Discontinuous Data-Oriented Parsing: disco-dop¶

The aim of this project is to parse discontinuous constituents in natural language using Data-Oriented Parsing (DOP), with a focus on global world domination. The grammar is extracted from a treebank of sentences annotated with (discontinuous) phrase-structure trees. Concretely, this project provides a statistical constituency parser with support for discontinuous constituents and Data-Oriented Parsing. Discontinuous constituents are supported through the grammar formalism Linear Context-Free Rewriting System (LCFRS), which is a generalization of Probabilistic Context-Free Grammar (PCFG). Data-Oriented Parsing allows re-use of arbitrary-sized fragments from previously seen sentences using Tree-Substitution Grammar (TSG).
Getting started¶
Contents
Parsing a single sentence¶
The parser can be used as an off-the-shelf parser using an existing grammar model,
either one of the grammars used in publications for English, German,
Dutch, and French, or a new model produced by discodop runexp
or discodop grammar param
(more on that below).
To download and extract a grammar:
$ curl -Ssl https://staff.fnwi.uva.nl/a.w.vancranenburgh/grammars/en_ptb.tar | tar -xf -
To parse a single sentence:
$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/
parsing 0: Why did the chicken cross the road ?
0.078917 s
(ROOT (SBARQ (SQ (VP (WHADVP (WRB 0=Why)) (VB 4=cross) (NP (DT 5=the) (NN 6=road))) (VBD 1=did) (NP (DT 2=the) (NN 3=chicken))) (. 7=?)))
average time per sentence 0.0789169999999999
unparsed sentences: 0
finished
We can add an extra command to visualize the tree:
$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/ | discodop treedraw
[...]
1. (len=8):
ROOT
│
SBARQ
┌─────────┴────────────────────────┐
SQ │
┌───┴───┬─────┐ │
│ │ VP │
┌──── │ ───── │ ────┴──────┬────────┐ │
WHADVP │ NP │ NP │
│ │ ┌───┴─────┐ │ ┌───┴───┐ │
WRB VBD DT NN VB DT NN .
│ │ │ │ │ │ │ │
Why did the chicken cross the road ?
Parsing a text¶
Similarly we can parse a whole document. The text needs to be tokenized, for example using ucto, the output of which can be passed to the parser:
$ ucto -L en -n "CONRAD, Joseph - Lord Jim.txt" | discodop parser en_ptb/
To parse sentences from a treebank in bracketed format:
$ discodop treetransforms treebankExample.mrg --inputfmt=bracket --outputfmt=tokens | discodop parser en_ptb/
Inducing a simple grammar¶
Suppose we have the following treebank in the file treebankExample.mrg
:
(S (NP (DT The) (NN cat)) (VP (VBP saw) (NP (DT the) (JJ hungry) (NN dog))))
(S (NP (DT The) (NN cat)) (VP (VBP saw) (NP (DT the) (NN dog))))
(S (NP (DT The) (NN mouse)) (VP (VBP saw) (NP (DT the) (NN cat))))
(S (NP (DT The) (NN mouse)) (VP (VBP saw) (NP (DT the) (JJ yellow) (NN cat))))
(S (NP (DT The) (JJ little) (NN mouse)) (VP (VBP saw) (NP (DT the) (NN cat))))
(S (NP (DT The) (NN cat)) (VP (VBP ate) (NP (DT the) (NN dog))))
(S (NP (DT The) (NN mouse)) (VP (VBP ate) (NP (DT the) (NN cat))))
You can view the trees using the following command:
$ discodop treedraw < treebankExample.mrg | less -R
Before inducing a grammar, we need to binarize the treebank:
$ discodop treetransforms --binarize --fmt=bracket treebankExample.mrg /tmp/bintrees
transformed 7 trees with action 'binarize'
We can then induce a Double-DOP grammar with the following command:
$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example
extracting recurring fragments
[...]
found 34 fragments
wrote backtransform to /tmp/example.backtransform
wrote grammar to /tmp/example.rules and /tmp/example.lex.
[...]
This produces a grammar consisting of a series of text files in /tmp
, which
can be used as follows:
$ echo 'The cat saw the hungry dog' | discodop parser --simple /tmp/example.rules /tmp/example.lex --bt=/tmp/example.backtransform -s S -b 5
parsing 0: The cat saw the hungry dog
(S (NP (DT 0) (NN 1)) (VP (VBP 2) (NP (DT 3) (NP|<JJ,NN> (JJ 4) (NN 5)))))
0.0 s
[...]
Note that the -s
option should specify the root label of the trees in the treebank.
Run an experiment¶
To parse with large treebanks, it may be necessary to apply pruning, and it
is useful to specify all the preprocessing steps and other configuration for an
experiment in a single file. This can be done with the discodop runexp
command.
To run an end-to-end experiment from grammar extraction to evaluation on a test
set, make a copy of the file sample.prm
and edit its parameters. For
example:
stages=[
dict(name='pcfg', mode='pcfg',
split=True, markorigin=True),
dict(name='plcfrs', mode='plcfrs',
prune='pcfg', splitprune=True, k=1000),
dict(name='dop', mode='plcfrs',
prune='plcfrs', k=50, m=1000,
dop='doubledop',
estimator='rfe', objective = 'mpp')
],
traincorpus=dict(
path='alpinosample.export', encoding='utf-8',
numsents=3, maxwords=100),
testcorpus=dict(
path='alpinosample.export', encoding='utf-8',
numsents=3, maxwords=100, skiptrain=False),
postagging=dict(
method='unknownword', model='4',
unknownthreshold=1, openclassthreshold=50,
simplelexsmooth=True),
binarization=dict(
method='default', factor='right',
h=1, v=1),
See the documentation on the available parameters. This parameter file can be invoked by executing:
$ discodop runexp filename.prm
This will create a new directory with the base name of the parameter file, i.e.,
filename/
in this case. This directory must not exist yet, to avoid
accidentally overwriting previous results. The directory will contain the
grammar rules and lexicon in a text format, as well as the parsing results and
the gold standard file in Negra’s export format.

Note that there is an option to utilize multiple processor cores by launching a specific number of processes. This greatly speeds up parsing, but note that for a nontrivial DOP grammar, each process may require anywhere from 4GB to 16GB.
Corpora can be read in Negra’s export format, or in the bracketed Penn treebank format. Access to the Negra corpus can be requested for non-commercial purposes, while the Tiger corpus is freely available for download for research purposes.
Alternatively, it is possible to achieve similar results with separate commands:
$ discodop grammar param filename.prm filename/
$ discodop alpinosample.export sentences.txt --inputfmt=export --outputfmt=tokens
$ discodop parser filename/ sentences.txt --fmt=export >filename/dop.export
$ discodop eval alpinosample.export filename/dop.export proper.prm
Overview¶
Command-line tools¶
Aside from the parser there are some standalone tools, invoked as discodop <cmd>
:
- fragments
Finds recurring or common fragments in one or more treebanks. It can be used with discontinuous as well as Penn-style bracketed treebanks. Example:
$ discodop fragments wsj-02-21.mrg > wsjfragments.txt
Specify the option
--numproc n
to use multiple processes, as withrunexp
.- eval
Discontinuous evaluation. Reports F-scores and other metrics. Accepts
EVALB
parameter files:$ discodop eval sample/gold.export sample/dop.export proper.prm
- treetransforms
- A command line interface to perform transformations on treebanks such as binarization.
- grammar
- A command line interface to read off grammars from (binarized) treebanks.
- treedraw
Visualize (discontinuous) trees. Command-line interface:
$ discodop treedraw < negra-corpus.export | less -RS
- treesearch
- Search through treebanks with queries.
- parser
- A basic command line interface to the parser comparable to bitpar. Reads grammars from text files.
demos
- Contains examples of various formalisms encoded in LCFRS grammars.
- gen
- An experiment in generation with LCFRS.
For instructions, pass the --help
option to a command
or the links above.
Web interfaces¶
There are three web based tools in the web/
directory. These require
some extra libraries such as Flask and pandas to be installed:
pip install --user -r web/requirements.txt
parse.py
- A web interface to the parser. Expects a series of grammars
in subdirectories of
web/grammars/
, each containing grammar files as produced by runningdiscodop runexp
. Download grammars for English, German, Dutch, and French, as used in a forthcoming paper. treesearch.py
- A web interface for searching through treebanks. Expects
one or more treebanks in the directory
web/corpus/
(sample included). See the treesearch manual for instructions on corpus preperation. To enable basic authentication, put a text fileweb/treesearchpasswd.txt
with lines of the formusername: password
(see the example file, only use over HTTPS). treedraw.py
- A web interface for drawing discontinuous trees in various formats.


Reference¶
Command line options¶
eval¶
Evaluation of (discontinuous) parse trees, following EVALB
as much
as possible.
Usage: discodop eval <gold> <parses> [param] [options]
where gold
and parses
are files with parse trees, param
is
an EVALB
parameter file.
Options¶
--cutofflen=n | Overrides the sentence length cutoff of the parameter file. |
--verbose | Print table with per sentence information. |
--debug | Print debug information with per sentence bracketings etc. |
--disconly | Only evaluate discontinuous bracketings (affects bracketing scores: precision, recall, f-measure, exact match). |
--goldfmt, --parsesfmt=<export|bracket|discbracket|tiger|alpino|dact> | |
Specify corpus format [default: export]. | |
--fmt=<...> | Shorthand for setting both --goldfmt and --parsesfmt . |
--goldenc, --parsesenc=<utf-8|iso-8859-1|...> | |
Specify encoding [default: utf-8]. | |
--la | Enable leaf-ancestor evaluation. |
--ted | Enable tree-edit distance evaluation. NB: it is not clear whether this score is applicable to discontinuous trees. |
--headrules=x | Specify file with rules for head assignment of constituents that do not already have a child marked as head; this enables dependency evaluation. |
--functions=x | ‘remove’=default: strip functions off labels, ‘leave’: leave syntactic labels as is, ‘add’: evaluate both syntactic categories and functions, ‘replace’: only evaluate grammatical functions. |
--morphology=x | ‘no’=default: only evaluate POS tags, ‘add’: concatenate morphology tags to POS tags, ‘replace’: replace POS tags with morphology tags, ‘between’: add morphological node between POS tag and word. |
Function tags¶
If the parses
file contains function tags, these are evaluated with the
non-null metric of Blaheta & Charniak (2000), which scores function tags of
correctly parsed bracketings. Multiple tags on a constituent are scored
separately. We additionally consider function tags on preterminals; this does
not change the evaluation for the Penn treebank as it does not contain function
tags on preterminals. A more stringent metric is to combine phrasal & function
tags with the option --functions=add
, which incorporates function tags in
the bracketing scores.
Parameter file¶
See the reference documentation on evaluation parameter files.
The commonly used parameter file for Penn-treebank parsing is COLLINS.prm
, distributed as part of EVALB
.
The file proper.prm
in the code repository is a version adapted for discontinuous parsing.
Examples¶
Discontinuous parsing:
$ discodop eval negra-parses.export negra-dev.export proper.prm
Continuous parsing:
$ discodop eval wsj-24.mrg parses.mrg COLLINS.prm --fmt=bracket
fragments¶
Extract recurring tree fragments from constituency treebanks.
discodop fragments <treebank1> [treebank2] [options]
discodop fragments --batch=<dir> <treebank1> <treebank2>... [options]
If only one treebank is given, extract fragments in common between its pairs of trees. If two treebanks are given, extract fragments in common between the trees of the first & second treebank. Input is in Penn treebank format (S-expressions), one tree per line. Output contains lines of the form “tree<TAB>frequency”. Frequencies refer to the first treebank by default. Output is sent to stdout; to save the results, redirect to a file.
Options:¶
--fmt=<export|bracket|discbracket|tiger|alpino|dact> | |
when format is not bracket , work with discontinuous trees;
output is in discbracket format:
tree<TAB>sentence<TAB>frequency
where tree has indices as leaves, referring to elements of
sentence , a space separated list of words. | |
--numtrees=n | only read first n trees from first treebank |
--encoding=x | specify treebank encoding, e.g. utf-8 [default], iso-8859-1, etc. |
-o file | Write output to file instead of stdout. |
--complete | treebank1 is a list of fragments (needle), result is the
indices / counts of these fragments in treebank2 (haystack). |
--batch=dir | enable batch mode; any number of treebanks > 1 can be given;
first treebank (A) will be compared to each (B) of the rest.
Results are written to filenames of the form dir/A_B .
Counts/indices are from B. |
--indices | report sets of 0-based indices where fragments occur instead of frequencies. |
--relfreq | report relative frequencies wrt. root node of fragments of the form n/m . |
--approx | report counts of occurrence as maximal fragment (lower bound) |
--nofreq | do not report frequencies. |
--cover=<n[,m]> | |
include all non-maximal/non-recurring fragments up to depth n
of first treebank; optionally, limit number of substitution
sites to m (default is unlimited). | |
--twoterms=x | only extract fragments with at least two lexical terminals,
one of which has a POS tag which matches the given regex.
For example, to match POS tags of content words in the
Penn treebank: ^(?:NN(?:[PS]|PS)?|(?:JJ|RB)[RS]?|VB[DGNPZ])$ |
--adjacent | only compare pairs of adjacent trees (i.e., sent no. n, n + 1 ). |
--debin | debinarize fragments. Since fragments may contain incomplete binarized constituents, the result may still contain artificial nodes from the binarization in the root or frontier non-terminals of the fragments. |
--alt | alternative output format: (NP (DT "a") NN)
default: (NP (DT a) (NN )) |
--numproc=n | use n independent processes, to enable multi-core usage
(default: 1); use 0 to detect the number of CPUs. |
--debug | extra debug information, ignored when numproc > 1 . |
--quiet | disable all messages. |
gen¶
Generate random sentences with a PLCFRS or PCFG. Reads grammar from a text file in PLCFRS or bitpar format.
discodop gen [--verbose] <rules> <lexicon>
discodop gen --test
Grammar is assumed to be in utf-8; may be gzip’ed (.gz extension).
grammar¶
Read off grammars from treebanks.
Usage, one of:
discodop grammar param <parameter-file> <output-directory>
discodop grammar <type> <input> <output> [options]
discodop grammar info <rules-file>
discodop grammar merge (rules|lexicon|fragments) <input1> <input2>... <output>
The first format extracts a grammar according to a parameter file. See the documentation on parameter files.
The second format is for extracting simple grammars (e.g., no unknown word handling or coarse-to-fine parsing).
type
is one of:
pcfg: | Probabilistic Context-Free Grammar (treebank grammar). |
---|---|
plcfrs: | Probabilistic Linear Context-Free Rewriting System (discontinuous treebank grammar). |
ptsg: | Probabilistic Tree-Substitution Grammar. |
dopreduction: | All-fragments PTSG using Goodman’s reduction. |
doubledop: | PTSG from recurring fragmensts. |
dop1: | PTSG from all fragments up to given depth. |
input
is a binarized treebank, or in the ptsg
case, weighted fragments
in the same format as the output of the discodop fragments
command;
input
may contain discontinuous constituents, except for the pcfg
case.
output
is the base name for the filenames to write the grammar to; the
filenames will be <output>.rules
and <output>.lex
.
Other subcommands:
info: | Print statistics for PLCFRS/bitpar grammar rules. |
---|---|
merge: | Interpolate given sorted grammars into a single grammar. Input can be a rules, lexicon or fragment file. |
NB: both the info
and merge
commands expect grammars to be sorted by
LHS, such as the ones created by this tool.
Options¶
--inputfmt=<export|bracket|discbracket|tiger|alpino|dact> | |
The treebank format [default: export]. | |
--inputenc=<utf-8|iso-8859-1|...> | |
Treebank encoding [default: utf-8]. | |
--numproc=<1|2|...> | |
Number of processes to start [default: 1]. Only relevant for double dop fragment extraction. | |
--gzip | compress output with gzip, view with zless &c. |
--packed | use packed graph encoding for DOP reduction. |
--bitpar | produce an unbinarized grammar for use with bitpar. |
-s X | start symbol to use for PTSG. |
--dopestimator=<rfe|ewe|shortest|...> | |
The DOP estimator to use with dopreduction/doubledop [default: rfe]. | |
--maxdepth=N, --maxfrontier=N | |
When extracting a ‘dop1’ grammar, the limit on what fragments are extracted; 3 or 4 is a reasonable depth limit. |
Grammar formats¶
When a PCFG is requested, or the input format is bracket
(Penn format), the
output will be in bitpar format. Otherwise the grammar is written as a PLCFRS.
The encoding of the input treebank may be specified. Output encoding will be
ASCII for the rules, and UTF-8 for the lexicon.
See the documentation on grammar formats.
Examples¶
Extract a Double-DOP grammar given binarized trees:
$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example
Extract grammars specified in a parameter file:
$ discodop grammar param filename.prm /tmp/example
parser¶
A command line interface for parsing sentences with an existing grammar.
discodop parser [options] <grammar/> [input files]
discodop parser --simple [options] <rules> <lexicon> [input [output]]
grammar/
is a directory with a model produced by discodop runexp
.
When no filename is given, input is read from standard input and the results
are written to standard output. Input should contain one sentence per line
with space-delimited tokens. Output consists of bracketed trees in
selected format. Files must be encoded in UTF-8.
General options¶
-x | Input is one token per line, sentences separated by two newlines (like bitpar). |
-b k | Return the k-best parses instead of just 1. |
--prob | Print probabilities as well as parse trees. |
--tags | Tokens are of the form word/POS ; give both to parser. |
--sentid | Sentences in input are prefixed by IDs, delimited by | . |
--fmt=<export|bracket|discbracket|alpino|conll|mst|wordpos> | |
Format of output [default: discbracket]. | |
--numproc=k | Launch k processes, to exploit multiple cores. |
--verbosity=x | 0 <= x <= 4. Same effect as verbosity in parameter file. |
--simple | Parse with a single grammar and input file; similar interface
to bitpar. The files rules and lexicon define a binarized
grammar in bitpar or PLCFRS format. |
Options for simple mode¶
-s x | Use x as start symbol instead of default TOP . |
--bt=file | Apply backtransform table to recover TSG derivations. |
--mpp=k | By default, the output consists of derivations, with the most probable derivation (MPD) ranked highest. With a PTSG such as DOP, it is possible to aim for the most probable parse (MPP) instead, whose probability is the sum of any number of the k-best derivations. |
--obj=<mpd|mpp|mcc|shortest|sl-dop> | |
Objective function to maximize [default: mpd]. | |
-m x | Use x derivations to approximate objective functions; mpd and shortest require only 1. |
--bitpar | Use bitpar to parse with an unbinarized grammar. |
Examples¶
To parse a single sentence:
$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/
Tokenize and parse a text:
$ ucto -L en -n "CONRAD, Joseph - Lord Jim.txt" | discodop parser en_ptb/
Parse sentences from a treebank in bracketed format:
$ discodop treetransforms treebankExample.mrg --inputfmt=bracket --outputfmt=tokens | discodop parser en_ptb/
runexp¶
Run an experiment given a parameter file. Does grammar extraction, parsing, and evaluation.
Usage: discodop runexp <parameter file> [--rerun]
If a parameter file is given, an experiment is run.
Given the parameter file sample.prm
, a new directory will be created with
the base name of the parameter file, i.e., sample/
in this case. This
directory must not exist yet, to avoid accidentally overwriting previous
results. To this directory the grammar rules and lexicon will be written in a
text format, as well as the parsing results and the gold standard parse trees
in the same format.
To repeat an experiment with an existing grammar, pass the option --rerun
.
The directory with the name of the parameter file without extension must exist
in the current path; its results will be overwritten.
Parameter file and example invocation¶
See the reference documentation on parameter files. A minimal parameter file:
stages=[
dict(
name='pcfg', # an identifier, used as filename when writing results
mode='pcfg', # use the PCFG CKY parser
),
],
evalparam='proper.prm', # EVALB-style parameter file
# train / test sets
corpusfmt='bracket', # choices: export, bracket, discbracket, alpino, tiger
traincorpus=dict(
path='ptb-02-21.mrg',
maxwords=100, # max number of words for sentences in train corpus
),
testcorpus=dict(
path='ptb-24.mrg',
maxwords=100, # max number of words for sentences in test corpus
),
See sample.prm
in the code repository for a more extensive example. The
file proper.prm
can also be found there, which is a version of the
COLLINS.prm
file typically used with EVALB
, adapted for discontinuous
parsing. Ensure that all referenced files are in the current directory or
specified with a path, and run as:
$ discodop runexp sample.prm
Parsing statistics¶
After running discodop runexp
, a number of additional files are produced
with parsing statistics:
output.log : | a log file with all messages displayed during parsing. This
file contains ANSI codes for colors, so view it with |
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pcdist.txt : | shows the distribution of parsing complexity (cf. Gildea, NAACL 2010 for the definition) among the grammar rules. |
||||||||||||
stats.tsv : |
|
treedraw¶
Visualize parse trees with ASCII art or LaTeX/SVG.
Usage: discodop treedraw [<treebank>...] [options]
--fmt=<export|bracket|discbracket|tiger|alpino|dact> | |||||||||||||
Specify input format [default: export]. | |||||||||||||
--encoding=enc | Specify input encoding [default: utf-8]. | ||||||||||||
--functions=x |
| ||||||||||||
--morphology=x |
| ||||||||||||
--abbr | abbreviate labels longer than 5 characters. | ||||||||||||
--plain | disable ANSI/HTML colors. | ||||||||||||
--frontier | only show terminal and non-terminal leaves. | ||||||||||||
--output=x |
| ||||||||||||
-n, --numtrees=x | |||||||||||||
only display the first x trees from the input. |
If no treebank is given, input is read from standard input; format is detected;
has the advantage of reading the input incrementally.
If more than one treebank is specified, trees will be displayed in parallel.
The ANSI colors can be viewed in a terminal with less -RS
.
Examples¶
View trees in a treebank:
$ discodop treedraw < wsj-02-21.mrg | less -RS
Apply a transformation and view the result:
$ discodop treetransforms --binarize --fmt=bracket wsj-02-21.mrg | discodop treedraw | less -RS
treesearch¶
Search through treebanks with queries.
Usage: discodop treesearch [--engine=X] [-t|-s|-c] <query> <treebank>...
Options:
-e X, --engine=X | |||||||||
Select query engine; possible options:
| |||||||||
-c, --counts | Report counts; multiple queries can be given. | ||||||||
-s, --sents | Output sentences (default); multiple queries can be given. | ||||||||
-t, --trees | Output visualizations of trees. | ||||||||
-b, --brackets | Output raw trees in the original corpus format. | ||||||||
--indices | Report a sentence numebers of matches for each corpus | ||||||||
--breakdown | Report counts of types that match query. | ||||||||
--csv | Report counts in CSV format instead of the default flat format. | ||||||||
-f, --file | Read queries (one per line) from filename given as first argument. | ||||||||
--slice=<N:M> | Only search in sentences N to M of each file; either N or M may be left out; slice indexing is 1-based and inclusive. | ||||||||
-m N, --max-count=N | |||||||||
Stop after finding N matches; 0 for no limit. | |||||||||
-n, --line-number | |||||||||
Prefix each line of output with the sentence number within its input file. | |||||||||
-o, --only-matching | |||||||||
Only output the matching portion
with --sents , --trees , and --brackets . | |||||||||
-i, --ignore-case | |||||||||
Ignore case in regex queries. | |||||||||
-M X, --macros=X | |||||||||
A file with macros. | |||||||||
--numproc=N | Use N independent processes, to enable multi-core usage (default: use all detected cores). |
Tree fragments¶
Search for literal matches of tree fragments, i.e., one or more connected grammar productions.
Supports (binarized, optionally discontinuous) treebanks with the extensions
.mrg
(bracket format), .dbr
(discbracket format), and .export
(export format).
regular bracket trees:
(S (NP Mary) (VP (VB is) (JJ rich)) (. .))
(S (NP ) (VP (VB is) (JJ )) (. .))
discontinuous trees:
(S (VP (VB 0=is) (JJ 2=)) (NP 1=) (? 3=?))
(VP (VB 0=is) (JJ 2=rich))
More information on the format of fragments: file format documentation
This query engine only works on binarized trees, and each node only has a single label that is matched on an all-or-nothing basis. Treebanks will be automatically binarized if they are not already binarized; treebanks may be discontinuous. It is useful to perform the binarization in advance, so that specific options and tree transformations can be applied; e.g., to handle punctuation, and include functional or morphological tags.
A cached copy of the treebank is created in an indexed format; given filename.mrg
,
this indexed version is stored as filename.mrg.ct
(in the same directory).
Another file, treesearchvocab.idx
, contains a global index of productions;
this index should automatically be recreated when the list of files changes or
any file is updated.
For the treesearch web interface, these indexed files need to be created in advance.
This can be done by running a dummy query on a set of files:
$ discodop treesearch -e frag '(prepare corpus)' *.dbr
Regular expressions¶
In contrast with the other engines, regular expressions treat the input as
plain text files, one sentence per line. The options --trees
and --brackets
are
not applicable. The syntax is Python’s re
syntax.
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like “A”, “a”, or “0”, are the simplest regular expressions; they simply match themselves.
The special characters are:
"." Matches any character except a newline.
"^" Matches the start of the string.
"$" Matches the end of the string or just before the newline at
the end of the string.
"*" Matches 0 or more (greedy) repetitions of the preceding RE.
Greedy means that it will match as many repetitions as possible.
"+" Matches 1 or more (greedy) repetitions of the preceding RE.
"?" Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
"\\" Either escapes special characters or signals a special sequence.
[] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
"|" A|B, creates an RE that will match either A or B.
(...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
(?:...) Non-grouping version of regular parentheses.
(?i) Perform case-insensitive matching.
The special sequences consist of “\” and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character:
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit.
\D Matches any non-digit character.
\s Matches any whitespace character.
\S Matches any non-whitespace character.
\w Matches any alphanumeric character.
\W Matches the complement of \w.
\\ Matches a literal backslash.
More information: https://docs.python.org/3/library/re.html#regular-expression-syntax
This query engine creates a cached index of line numbers in all files
treesearchline.idx
; this index should automatically be recreated when
the list of files changes or any file is updated.
TGrep2 syntax overview¶
Only treebanks in bracket format ary supported, but trees can be n-ary. Note that the tgrep2 command needs to be installed. A version with small improvements is available from https://github.com/andreasvc/tgrep2
TGrep2 operators:
A < B A is the parent of (immediately dominates) B.
A > B A is the child of B.
A <N B B is the Nth child of A (the first child is <1).
A >N B A is the Nth child of B (the first child is >1).
A <, B Synonymous with A <1 B.
A >, B Synonymous with A >1 B.
A <-N B B is the Nth-to-last child of A (the last child is <-1).
A >-N B A is the Nth-to-last child of B (the last child is >-1).
A <- B B is the last child of A (synonymous with A <-1 B).
A >- B A is the last child of B (synonymous with A >-1 B).
A <` B B is the last child of A (also synonymous with A <-1 B).
A >` B A is the last child of B (also synonymous with A >-1 B).
A <: B B is the only child of A.
A >: B A is the only child of B.
A << B A dominates B (A is an ancestor of B).
A >> B A is dominated by B (A is a descendant of B).
A <<, B B is a left-most descendant of A.
A >>, B A is a left-most descendant of B.
A <<` B B is a right-most descendant of A.
A >>` B A is a right-most descendant of B.
A <<: B There is a single path of descent from A and B is on it.
A >>: B There is a single path of descent from B and A is on it.
A . B A immediately precedes B.
A , B A immediately follows B.
A .. B A precedes B.
A ,, B A follows B.
A $ B A is a sister of B (and A != B).
A $. B A is a sister of and immediately precedes B.
A $, B A is a sister of and immediately follows B.
A $.. B A is a sister of and precedes B.
A $,, B A is a sister of and follows B.
A = B A is also matched by B.
More information: http://tedlab.mit.edu/~dr/Tgrep2/
TGrep2 uses its own indexed file format. These files are automatically created
when using this query engine. Given a file example.mrg
, the file example.mrg.t2c.gz
is created (in the same directory).
XPath syntax examples¶
Search through treebanks in XML format with XPath; treebanks must be in
dact
format. Note: XPath support depends on the alpinocorpus
library;
see https://github.com/rug-compling/alpinocorpus-python
Find a particular word:
//node[@word='loopt']
This is case-sensitive.
If you want to find all inflectional variants of the verb lopen
, do:
//node[@lemma='lopen']
To find main clauses:
//node[@cat="smain"]
Finite subordinate clauses:
//node[@cat="cp" and node[@rel="body" and @cat="ssub"]]
This locates cp
nodes with an ssub
child that has body
as function
tag (relation).
General XPath overview: https://en.wikipedia.org/wiki/XPath Using XPath on Alpino treebanks: http://rug-compling.github.io/dact/cookbook/
To create files in dact
format, the alpinocorpus tools may be used.
Alternatively, discodop treetransforms
can be used:
$ discodop treetransforms --input=alpino --output=dact 'mycorpus/*.xml' mycorpus.dact
Examples¶
Show trees that can contain a NP modified by a PP:
$ discodop treesearch --trees -e frag '(NP (NP ) (PP ))' wsj-02-21.mrg
Same query, but only show matching terminals:
$ discodop treesearch --only-matching --sents -e frag '(NP (NP ) (PP ))' ~/data/wsj-02-21.mrg
Perform a large number of regex queries from a file, and store counts in a CSV file:
$ discodop treesearch --csv --counts -e regex --file queries.txt corpus.txt > results.csv
treetransforms¶
Treebank binarization and conversion
Usage: discodop treetransforms [input [output]] [options]
where input
and output
are treebanks; standard in/output is used if not given.
Main transformation options¶
The following transforms are applied in the order given on the command line.
--introducepreterminals | |
Add preterminals to terminals without a dedicated preterminal. | |
--transforms=<NAME1,NAME2...> | |
Apply specific treebank transforms; available presets:
negra, wsj, alpino, green2013ftb, km2003wsj,
km2003simple, fraser2013tiger, lassy, lassy-func
For details cf. source of discodop.treebanktransforms module. | |
--reversetransforms=<NAME1,NAME2,...> | |
Undo specified transforms; specify transforms in original order. |
- –binarize [-h x] [-v x] [–factor=<left|right>] [...]
- Markovized binarization; also see –headrules and other options below.
- –optimalbinarize [-h x] [-v x]
- Binarization that minimizes LCFRS fan-out/complexity.
--unbinarize | Restore original n-ary trees. |
- –splitdisc [–markorigin]
- Split discontinuous nodes into several continuous nodes.
--mergedisc | Reverse the node splitting operation. |
Other options¶
--inputfmt=<export|bracket|discbracket|tiger|alpino|dact> | |||||||||
Input treebank format [default: export]. | |||||||||
--outputfmt=<export|bracket|discbracket|dact|conll|mst|tokens|wordpos> | |||||||||
Output treebank format [default: export].
Selecting the formats conll or mst results in an
unlabeled dependency conversion and requires the use of
heuristic head rules (--headrules ), to ensure that all
constituents have a child marked as head. A command line
interface to perform transformations on treebanks such as
binarization. | |||||||||
--fmt=x | Shortcut to specify both input and output format. | ||||||||
--inputenc, --outputenc, --enc=<utf-8|iso-8859-1|...> | |||||||||
Treebank encoding [default: utf-8]. | |||||||||
--slice=<n:m> | select a range of sentences from input starting with n, up to but not including m; as in Python, n or m can be left out or negative, and the first index is 0. | ||||||||
--renumber | Replace sentence IDs with numbers starting from 1, padded with 8 spaces. | ||||||||
--sentid | With ‘tokens’ or ‘wordpos’ output format, prefix lines with identifiers of the form ID| . | ||||||||
--maxlen=n | only select sentences with up to n tokens. | ||||||||
--punct=x |
| ||||||||
--functions=x |
| ||||||||
--morphology=x |
| ||||||||
--lemmas=x |
| ||||||||
--ensureroot=x | add root node labeled x to trees if not already present. | ||||||||
--removeempty | remove empty / -NONE- terminals. | ||||||||
--factor=<left|right> | |||||||||
specify left- or right-factored binarization [default: right]. | |||||||||
-h n | horizontal markovization. default: infinite (all siblings) | ||||||||
-v n | vertical markovization. default: 1 (immediate parent only) | ||||||||
--headrules=x | turn on head finding; turns on head-outward binarization.
reads rules from file x (e.g., “negra.headrules”). | ||||||||
--markhead | include label of the head child in all auxiliary labels of binarization. | ||||||||
--direction | mark direction when using head-outward binarization. | ||||||||
--labelfun=x | x is a Python lambda function that takes a node and returns
a label to be used for markovization purposes. For example,
to get labels without state splits, pass this function:
'lambda n: n.label.split("^")[0]' | ||||||||
--leftunary | make initial / final productions of binarized constituents | ||||||||
--rightunary | ... unary productions. | ||||||||
--tailmarker | mark rightmost child (the head if headrules are applied), to
avoid cyclic rules when --leftunary and --rightunary
are used. |
Example¶
Binarize a treebank:
$ discodop treetransforms --binarize --fmt=bracket treebankExample.mrg /tmp/bintrees
A parser for Probalistic Linear Context-Free Rewriting Systems (LCFRS) and Probabilistic Context-Free Grammars (PCFG), as well as facilities to extract and parse with data-oriented parsing (DOP) grammars.
Usage: discodop <command> [arguments]
Command is one of:
runexp | Run experiment: grammar extraction, parsing & evaluation. |
fragments | Extract recurring fragments from treebanks. |
eval | Evaluate discontinuous parse trees; similar to EVALB. |
treetransforms | Apply tree transformations and convert between formats. |
treedraw | Visualize (discontinuous) trees |
treesearch | Search through treebanks with queries. |
grammar | Read off grammars from treebanks. |
parser | Simple command line parser. |
gen | Generate sentences from a PLCFRS. |
demos: | Show some demonstrations of formalisms encoded in LCFRS. |
for additional instructions issue: discodop <command> --help
or refer to the man page discodop-<command>
.
Parser parameters¶
A parser is defined by a sequence of stages, and a set of global options:
stages=[
stage1,
stage2,
],
corpusfmt='...',
traincorpus=dict(...),
testcorpus=dict(...),
binarization=dict(...),
key1=val1,
key2=val2,
The parameters consist of a Python expression surrounded by an implicit
'dict('
and ')'
. Note that each key=value
is separated by a comma.
Corpora¶
corpusfmt: | The corpus format; choices:
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
traincorpus: | a dictionary with the following keys:
|
||||||||||||
testcorpus: | a dictionary with the following keys:
|
Binarization¶
binarization: | a dictionary with the following keys:
|
---|
Stages¶
Through the use of stages it is possible to run multiple parsers on the same test set, or to exploit coarse-to-fine pruning.
A stage has the form:
dict(
key1=val1,
key2=val2,
...
)
Where the keys and values are:
name: | identifier, used for filenames |
||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mode: | The type of parser to use
|
||||||||||||||
prune: | specify the name of a previous stage to enable coarse-to-fine pruning. |
||||||||||||||
split: | split disc. nodes |
||||||||||||||
splitprune: | treat |
||||||||||||||
markorigin: | mark origin of split nodes: |
||||||||||||||
k: | pruning parameter:
|
||||||||||||||
kbest: | extract m-best derivations from chart |
||||||||||||||
sample: | sample m derivations from chart |
||||||||||||||
m: | number of derivations to sample / enumerate. |
||||||||||||||
binarized: | when using |
||||||||||||||
dop: | enable DOP mode:
|
||||||||||||||
estimator: | DOP estimator. Choices:
|
||||||||||||||
objective: | Objective function to choose DOP parse tree. Choices:
|
||||||||||||||
sldop_n: | When using sl-dop or sl-dop-simple, number of most likely parse trees to consider. |
||||||||||||||
maxdepth: | with |
||||||||||||||
maxfrontier: | with |
||||||||||||||
collapse: | apply a multilevel coarse-to-fine preset. values are of the form
|
||||||||||||||
packedgraph: | use packed graph encoding for DOP reduction |
||||||||||||||
iterate: | for Double-DOP, whether to add fragments of fragments |
||||||||||||||
complement: | for Double-DOP, whether to include fragments which form the complement of the maximal recurring fragments extracted |
||||||||||||||
neverblockre: | do not prune nodes with label that match this regex |
||||||||||||||
estimates: | compute, store & use context-summary (outside) estimates |
||||||||||||||
beam_beta: | beam pruning factor, between 0 and 1; 1 to disable.
if enabled, new constituents must have a larger probability
than the probability of the best constituent in a cell multiplied by this
factor; i.e., a smaller value implies less pruning.
Suggested value: |
||||||||||||||
beam_delta: | if beam pruning is enabled, only apply it to spans up to this length. |
Other options¶
evalparam: | EVALB-style parameter file to use for reporting F-scores |
||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
postagging: | To disable POS tagging and use the gold POS tags from the
test set, set this to
|
||||||||||||||||||||||||||||||||||||||||||
punct: | one of ...
|
||||||||||||||||||||||||||||||||||||||||||
functions: | one of ...
|
||||||||||||||||||||||||||||||||||||||||||
morphology: | one of ...
|
||||||||||||||||||||||||||||||||||||||||||
lemmas: | one of ...
|
||||||||||||||||||||||||||||||||||||||||||
removeempty: |
|
||||||||||||||||||||||||||||||||||||||||||
ensureroot: | Ensure every tree has a root node with this label |
||||||||||||||||||||||||||||||||||||||||||
transformations: | |||||||||||||||||||||||||||||||||||||||||||
Apply specific treebank transforms; available presets:
|
|||||||||||||||||||||||||||||||||||||||||||
relationalrealizational: | |||||||||||||||||||||||||||||||||||||||||||
apply RR-transform;
see |
|||||||||||||||||||||||||||||||||||||||||||
verbosity: | control the amount of output to console;
a logfile
|
||||||||||||||||||||||||||||||||||||||||||
numproc: | default 1; increase to use multiple CPUs; |
File formats¶
Treebanks¶
export¶
Negra export format (v3).
For example:
#BOS 0
is VB -- -- 500
John NP -- -- 0
rich JJ -- -- 500
? ? -- -- 0
#500 VP -- -- 0
#EOS 0
An optional lemma field is supported. Secondary edges may or may not be preserved but mostly ignored. The preamble listing the tag sets is ignored and not reproduced when trees are written in this format.
This format is supported when input is read incrementally from
standard input with the treedraw
and treetransforms
commands.
Cf. http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/exformat3.ps
bracket¶
Penn-treebank style bracketed trees, one tree per line.
For example:
(S (NP John) (VP (VB is) (JJ rich)) (. .))
This format is supported when input is read incrementally from
standard input with the treedraw
and treetransforms
commands.
Tree fragments can be represented by leaving out the children of any non-terminal, e.g.:
(VP (VB is) (JJ ))
(S (NP John) (VP )) (. .))
discbracket¶
A corpus format for discontinuous trees in bracket notation, where the leaves are prefixed with indices indicating word order. Each leaf must have at least one index; the indices form an unbroken range starting from 0, with each index occurring exactly once.
For example:
(S (VP (VB 0=is) (JJ 2=rich)) (NP 1=John) (? 3=?))
(sentence: is John rich ?)
Note that the leaves are not in the same order as in the sentence. The leaves must be sorted by the indices to restore the original sentence order. There is one parse tree per line. Compared to Negra’s export format, this format lacks separate fields for morphology, lemmas, and functional edges. On the other hand, it is close to the internal representation employed here, so it can be read efficiently.
This format is supported when input is read incrementally from
standard input with the treedraw
and treetransforms
commands.
Tree fragments can be formed as with bracket trees, by leaving out terminals or whole subtrees:
(S (VP (VB 0=is) (JJ 2=)) (NP 1=) (? 3=?))
(VP (VB 0=is) (JJ 2=rich))
There is an extra case that should be handled, which is how to represent a discontinuous frontier non-terminal. This requires expressing how the spans of the discontinuous node relate to the other spans in the tree:
(S (VP 0= 2=) (NP 1=) (? 3=?))
While the VP node does not dominate any terminals, if they were to be added, they would end up before and after the NP node.
alpino¶
Alpino XML format. One file per sentence. The hierarchical tree structure is mirrored in the XML structure, which makes it possible to query trees in this format with XPath (as opposed to TigerXML which maintains the tabular structure of the Negra export format).
dact¶
Alpino XML trees in an XML database as used by Dact. Cf. http://rug-compling.github.io/dact/
Read-only formats¶
tiger : | Tiger XML format. Cf. http://www.ims.uni-stuttgart.de/forschung/ressourcen/werkzeuge/TIGERSearch/doc/html/TigerXML.html |
---|
Write-only formats¶
connl , mst : | unlabeled dependencies; relies on head identification. |
---|---|
tokens : | one sentence per line, tokens separated by spaces. |
wordpos : | similar to tokens but of the form token/POS . |
Grammars¶
PCFG¶
PCFG grammars are stored in bitpar format. From the bitpar manual:
The grammar file contains one grammar rule per line. Each grammar rule starts with its frequency [...] followed by the parent category (symbol on the left-hand side) and the child categories (symbols on the right-hand side). The symbols are separated by whitespace. [...]
The lexicon file contains one lexicon entry per line. Each lexicon entry starts with the word [...] followed a sequence of part-of-speech tag + frequency pairs. The POS tag is preceded by a tab character and followed by a blank or tab character.
PLCFRS¶
The PLCFRS format is as follows. Rules are delimited by newlines. Fields are separated by tabs. The fields are:
LHS RHS1 [RHS2] yield-function weight
The yield function defines how the spans of the RHS nonterminals are combined to form the spans of the LHS nonterminal. Components of the yield function are comma-separated, 0 refers to a component of the first RHS nonterminal, and 1 from the second. Weights are expressed as rational fractions. The lexicon is defined in a separate file. Lines start with a single word, followed by pairs of possible tags and their probabilities:
WORD TAG1 PROB1 [TAG2 PROB2 ...]
Example, rules file:
S NP VP 010 1/2
VP_2 VB NP 0,1 2/3
NP NN 0 1/4
lexicon file:
is VB 1/3
John NN 1/2
rich JJ 1/5
backtransform¶
Double-DOP grammars and other PTSGs employ a grammar in which internal nodes are removed from fragments to obtain a more compact grammar. Fragments are restored in derivations using a backtransform table with the original fragments for each grammar rule.
The backtransform file contains one fragment per line, with the lines
corresponding to the lines of the grammar rule file. Frontier non-terminals
are indicated as {0}
, {1}
, etc.
The fragments which this backtransform is based on is also saved, with a
filename of the form .fragments.gz
.
To view the grammar rules together with the corresponding fragments, issue the
following command:
$ paste <(zcat dop.rules.gz) <(zcat dop.fragments.gz)
A X Y 01 1 (A (X 0) (Y 1)) 1
A_2 X Z 0,1 1 (A_2 (X 0) (Z 2)) 2
RIGHT A_2 Y 010 1 (RIGHT (A_2 0 2) (Y 1)) 2
S S}<0> Z@z 01 2/5 (S (RIGHT (A_2 (X 0) (Z 2)) (Y 1))) x y z 2
S RIGHT 0 2/5 (S (RIGHT 0)) 2
S WRONG 0 1/5 (S (WRONG 0)) 1
WRONG A Z 01 1 (WRONG (A 0) (Z 1)) 1
S}<0> X@x Y@y 01 1
alternate weights¶
DOP grammars can contain multiple probability models. The alternate models are stored in a NumPy array:
$ python
>>> import numpy
>>> probs = numpy.load('dop.probs.npz')
>>> probs.keys()
['default', 'shortest', 'bon', 'ewe']
>>> probs['shortest'][:10]
array([ 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
In this case, we see the model for shortest derivation parsing, where every fragment is assigned a uniform weight of 0.5.
Miscellaneous¶
head assignment rules¶
This file specifies a set of heuristic rules to pick for every constituent one of its children as being the head of the constituent, based on syntactic categories.
The file is case insensitive. Lines starting with %
are treated as comments
and ignored. Each line specifies a rule of the form:
CAT direction child1 child2...
This rule specifies how a head child is assigned for a constituent labeled as CAT
.
The second argument specifies whether the children of the constituent should
be considered starting from the left or from the right (corresponding to whether
a category is head-first head-final):
left: (or left-to-right) for each of the possible heads, try all children from left to right right: (or right-to-left) for each of the possible heads, try all children from right to left leftdis: go from left to right and try each possible head. rightdis: go from right to left and try each possible head. like: treat this label as if it were another label; e.g. ‘TOP like ROOT’.
There may be multiple rules for a category, for example if they go in opposite directions. The rules are applied in the order as they appear in the file.
The list of children may be empty; in that case the leftmost (or rightmost, in the second case) child will be chosen as head. If the list of possible children is non-empty, the children of the constituents are iterated over for each possible child, and the first matching child is picked as the head.
evaluation parameters¶
The format of this file is a superset of the parameters for EVALB, cf. http://nlp.cs.nyu.edu/evalb/
The parameter file should be encoded in UTF-8 and supports the following options in addition to those supported by EVALB:
DELETE_ROOT_PRETERMS: if nonzero, ignore preterminals directly under the root in gold trees for scoring purposes.
DISC_ONLY: if nonzero, only consider discontinuous bracketings (affects precision, recall, f-measure, exact match).
LA: if nonzero, report leaf-ancestor scores [default: disabled].
TED: if nonzero, report tree-edit distance scores; disabled by default as these are slow to compute. NB: it is not clear whether this score is applicable to discontinuous trees.
DEBUG:
-1: only print summary table 0: additionally, print category / tag breakdowns (default) (after application of cutoff length). 1: give per-sentence results ( '--verbose'
)2: give detailed information for each sentence ( '--debug'
)MAX_ERROR: this value is ignored, no errors are tolerated. the parameter is accepted to support usage of unmodified EVALB parameter files.
parser parameters¶
API documentation¶
Discontinuous Data-Oriented Parsing (disco-dop).
Main components:
- A parser for Probalistic Linear Context-Free Rewriting Systems (LCFRS), as well as Probabilistic Context-Free Grammars (PCFG).
- Facilities to extract and parse with tree fragments using data-oriented parsing (DOP) grammars.
Python modules¶
cli |
Command-line interfaces to modules. |
demos |
Examples of various formalisms encoded in LCFRS grammars. |
eval |
Evaluation of (discontinuous) parse trees. |
fragments |
Extract recurring tree fragments from constituency treebanks. |
functiontags |
Function tags classifier. |
gen |
Generate random sentences with an LCFRS. |
grammar |
Assorted functions to read off grammars from treebanks. |
heads |
Functions related to finding the linguistic head of a constituent. |
lexicon |
Add rules to handle unknown words and smooth lexical probabilities. |
parser |
This is an interface to Python’s internal parser. |
punctuation |
Punctuation related functions. |
runexp |
|
tree |
Various Tree objects for representing syntax or morphological trees. |
treebank |
Read and write treebanks. |
treebanktransforms |
Treebank transformations. |
treedist |
Tree edit distance implementations. |
treesearch |
|
treetransforms |
Treebank-indenpendent tree transformations. |
util |
Misc code to avoid cyclic imports. |
Cython modules¶
_fragments |
Fragment extraction with tree kernels. |
bit |
Functions for working with bitvectors. |
coarsetofine |
Project selected items from a chart to corresponding items in next grammar. |
containers |
Data types for chart items, edges, &c. |
disambiguation |
Disambiguate parse forests with various methods for parse selection. |
estimates |
Computation of outside estimates for best-first or A* parsing. |
kbest |
Extract the k-best derivations from a probabilistic parse forest. |
pcfg |
CKY parser for Probabilistic Context-Free Grammar (PCFG). |
plcfrs |
Parser for string-rewriting Linear Context-Free Rewriting Systems. |