Discontinuous Data-Oriented Parsing: disco-dop

contrived discontinuous constituent for expository purposes.

The aim of this project is to parse discontinuous constituents in natural language using Data-Oriented Parsing (DOP), with a focus on global world domination. The grammar is extracted from a treebank of sentences annotated with (discontinuous) phrase-structure trees. Concretely, this project provides a statistical constituency parser with support for discontinuous constituents and Data-Oriented Parsing. Discontinuous constituents are supported through the grammar formalism Linear Context-Free Rewriting System (LCFRS), which is a generalization of Probabilistic Context-Free Grammar (PCFG). Data-Oriented Parsing allows re-use of arbitrary-sized fragments from previously seen sentences using Tree-Substitution Grammar (TSG).

Getting started

Parsing a single sentence

The parser can be used as an off-the-shelf parser using an existing grammar model, either one of the grammars used in publications for English, German, Dutch, and French, or a new model produced by discodop runexp or discodop grammar param (more on that below). To download and extract a grammar:

$ curl -Ssl https://staff.fnwi.uva.nl/a.w.vancranenburgh/grammars/en_ptb.tar | tar -xf -

To parse a single sentence:

$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/
parsing 0: Why did the chicken cross the road ?
0.078917 s
(ROOT (SBARQ (SQ (VP (WHADVP (WRB 0=Why)) (VB 4=cross) (NP (DT 5=the) (NN 6=road))) (VBD 1=did) (NP (DT 2=the) (NN 3=chicken))) (. 7=?)))
average time per sentence 0.0789169999999999
unparsed sentences: 0
finished

We can add an extra command to visualize the tree:

$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/ | discodop treedraw
[...]
1. (len=8):
                     ROOT
                      │
                    SBARQ
            ┌─────────┴────────────────────────┐
            SQ                                 │
        ┌───┴───┬─────┐                        │
        │       │     VP                       │
  ┌──── │ ───── │ ────┴──────┬────────┐        │
WHADVP  │       NP           │        NP       │
  │     │   ┌───┴─────┐      │    ┌───┴───┐    │
 WRB   VBD  DT        NN     VB   DT      NN   .
  │     │   │         │      │    │       │    │
 Why   did the     chicken cross the     road  ?

Parsing a text

Similarly we can parse a whole document. The text needs to be tokenized, for example using ucto, the output of which can be passed to the parser:

$ ucto -L en -n "CONRAD, Joseph - Lord Jim.txt" | discodop parser en_ptb/

To parse sentences from a treebank in bracketed format:

$ discodop treetransforms treebankExample.mrg --inputfmt=bracket --outputfmt=tokens | discodop parser en_ptb/

Inducing a simple grammar

Suppose we have the following treebank in the file treebankExample.mrg:

(S (NP (DT The) (NN cat)) (VP (VBP saw) (NP (DT the) (JJ hungry) (NN dog))))
(S (NP (DT The) (NN cat)) (VP (VBP saw) (NP (DT the) (NN dog))))
(S (NP (DT The) (NN mouse)) (VP (VBP saw) (NP (DT the) (NN cat))))
(S (NP (DT The) (NN mouse)) (VP (VBP saw) (NP (DT the) (JJ yellow) (NN cat))))
(S (NP (DT The) (JJ little) (NN mouse)) (VP (VBP saw) (NP (DT the) (NN cat))))
(S (NP (DT The) (NN cat)) (VP (VBP ate) (NP (DT the) (NN dog))))
(S (NP (DT The) (NN mouse)) (VP (VBP ate) (NP (DT the) (NN cat))))

You can view the trees using the following command:

$ discodop treedraw < treebankExample.mrg | less -R

Before inducing a grammar, we need to binarize the treebank:

$ discodop treetransforms --binarize --fmt=bracket treebankExample.mrg /tmp/bintrees
transformed 7 trees with action 'binarize'

We can then induce a Double-DOP grammar with the following command:

$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example
extracting recurring fragments
[...]
found 34 fragments
wrote backtransform to /tmp/example.backtransform
wrote grammar to /tmp/example.rules and /tmp/example.lex.
[...]

This produces a grammar consisting of a series of text files in /tmp, which can be used as follows:

$ echo 'The cat saw the hungry dog' | discodop parser --simple /tmp/example.rules /tmp/example.lex --bt=/tmp/example.backtransform -s S -b 5
parsing 0: The cat saw the hungry dog
(S (NP (DT 0) (NN 1)) (VP (VBP 2) (NP (DT 3) (NP|<JJ,NN> (JJ 4) (NN 5)))))
0.0 s
[...]

Note that the -s option should specify the root label of the trees in the treebank.

Run an experiment

To parse with large treebanks, it may be necessary to apply pruning, and it is useful to specify all the preprocessing steps and other configuration for an experiment in a single file. This can be done with the discodop runexp command.

To run an end-to-end experiment from grammar extraction to evaluation on a test set, make a copy of the file sample.prm and edit its parameters. For example:

stages=[
    dict(name='pcfg', mode='pcfg',
        split=True, markorigin=True),
    dict(name='plcfrs', mode='plcfrs',
        prune='pcfg', splitprune=True, k=1000),
    dict(name='dop', mode='plcfrs',
        prune='plcfrs', k=50, m=1000,
        dop='doubledop',
        estimator='rfe', objective = 'mpp')
],
traincorpus=dict(
    path='alpinosample.export', encoding='utf-8',
    numsents=3, maxwords=100),
testcorpus=dict(
    path='alpinosample.export', encoding='utf-8',
    numsents=3, maxwords=100, skiptrain=False),
postagging=dict(
    method='unknownword', model='4',
    unknownthreshold=1, openclassthreshold=50,
    simplelexsmooth=True),
binarization=dict(
    method='default', factor='right',
    h=1, v=1),

See the documentation on the available parameters. This parameter file can be invoked by executing:

$ discodop runexp filename.prm

This will create a new directory with the base name of the parameter file, i.e., filename/ in this case. This directory must not exist yet, to avoid accidentally overwriting previous results. The directory will contain the grammar rules and lexicon in a text format, as well as the parsing results and the gold standard file in Negra’s export format.

screenshot of runexp showing a parse tree

Note that there is an option to utilize multiple processor cores by launching a specific number of processes. This greatly speeds up parsing, but note that for a nontrivial DOP grammar, each process may require anywhere from 4GB to 16GB.

Corpora can be read in Negra’s export format, or in the bracketed Penn treebank format. Access to the Negra corpus can be requested for non-commercial purposes, while the Tiger corpus is freely available for download for research purposes.

Alternatively, it is possible to achieve similar results with separate commands:

$ discodop grammar param filename.prm filename/
$ discodop alpinosample.export sentences.txt --inputfmt=export --outputfmt=tokens
$ discodop parser filename/ sentences.txt --fmt=export >filename/dop.export
$ discodop eval alpinosample.export filename/dop.export proper.prm

Overview

Command-line tools

Aside from the parser there are some standalone tools, invoked as discodop <cmd>:

fragments

Finds recurring or common fragments in one or more treebanks. It can be used with discontinuous as well as Penn-style bracketed treebanks. Example:

$ discodop fragments wsj-02-21.mrg > wsjfragments.txt

Specify the option --numproc n to use multiple processes, as with runexp.

eval

Discontinuous evaluation. Reports F-scores and other metrics. Accepts EVALB parameter files:

$ discodop eval sample/gold.export sample/dop.export proper.prm

treetransforms
A command line interface to perform transformations on treebanks such as binarization.
grammar
A command line interface to read off grammars from (binarized) treebanks.
treedraw

Visualize (discontinuous) trees. Command-line interface:

$ discodop treedraw < negra-corpus.export | less -RS

treesearch
Search through treebanks with queries.
parser
A basic command line interface to the parser comparable to bitpar. Reads grammars from text files.
demos
Contains examples of various formalisms encoded in LCFRS grammars.
gen
An experiment in generation with LCFRS.

For instructions, pass the --help option to a command or the links above.

Web interfaces

There are three web based tools in the web/ directory. These require some extra libraries such as Flask and pandas to be installed:

pip install --user -r web/requirements.txt
parse.py
A web interface to the parser. Expects a series of grammars in subdirectories of web/grammars/, each containing grammar files as produced by running discodop runexp. Download grammars for English, German, Dutch, and French, as used in a forthcoming paper.
treesearch.py
A web interface for searching through treebanks. Expects one or more treebanks in the directory web/corpus/ (sample included). See the treesearch manual for instructions on corpus preperation. To enable basic authentication, put a text file web/treesearchpasswd.txt with lines of the form username: password (see the example file, only use over HTTPS).
treedraw.py
A web interface for drawing discontinuous trees in various formats.
screenshot of treesearch showing counts of pattern screenshot of treesearch showing bar plot of pattern

Reference

Command line options

eval

Evaluation of (discontinuous) parse trees, following EVALB as much as possible.

Usage: discodop eval <gold> <parses> [param] [options]

where gold and parses are files with parse trees, param is an EVALB parameter file.

Options
--cutofflen=n Overrides the sentence length cutoff of the parameter file.
--verbose Print table with per sentence information.
--debug Print debug information with per sentence bracketings etc.
--disconly Only evaluate discontinuous bracketings (affects bracketing scores: precision, recall, f-measure, exact match).
--goldfmt, --parsesfmt=<export|bracket|discbracket|tiger|alpino|dact>
 Specify corpus format [default: export].
--fmt=<...> Shorthand for setting both --goldfmt and --parsesfmt.
--goldenc, --parsesenc=<utf-8|iso-8859-1|...>
 Specify encoding [default: utf-8].
--la Enable leaf-ancestor evaluation.
--ted Enable tree-edit distance evaluation. NB: it is not clear whether this score is applicable to discontinuous trees.
--headrules=x Specify file with rules for head assignment of constituents that do not already have a child marked as head; this enables dependency evaluation.
--functions=x ‘remove’=default: strip functions off labels, ‘leave’: leave syntactic labels as is, ‘add’: evaluate both syntactic categories and functions, ‘replace’: only evaluate grammatical functions.
--morphology=x ‘no’=default: only evaluate POS tags, ‘add’: concatenate morphology tags to POS tags, ‘replace’: replace POS tags with morphology tags, ‘between’: add morphological node between POS tag and word.
Function tags

If the parses file contains function tags, these are evaluated with the non-null metric of Blaheta & Charniak (2000), which scores function tags of correctly parsed bracketings. Multiple tags on a constituent are scored separately. We additionally consider function tags on preterminals; this does not change the evaluation for the Penn treebank as it does not contain function tags on preterminals. A more stringent metric is to combine phrasal & function tags with the option --functions=add, which incorporates function tags in the bracketing scores.

Parameter file

See the reference documentation on evaluation parameter files.

The commonly used parameter file for Penn-treebank parsing is COLLINS.prm, distributed as part of EVALB. The file proper.prm in the code repository is a version adapted for discontinuous parsing.

Examples

Discontinuous parsing:

$ discodop eval negra-parses.export negra-dev.export proper.prm

Continuous parsing:

$ discodop eval wsj-24.mrg parses.mrg COLLINS.prm --fmt=bracket

fragments

Extract recurring tree fragments from constituency treebanks.

Usage: discodop fragments <treebank1> [treebank2] [options]
or: discodop fragments --batch=<dir> <treebank1> <treebank2>... [options]

If only one treebank is given, extract fragments in common between its pairs of trees. If two treebanks are given, extract fragments in common between the trees of the first & second treebank. Input is in Penn treebank format (S-expressions), one tree per line. Output contains lines of the form “tree<TAB>frequency”. Frequencies refer to the first treebank by default. Output is sent to stdout; to save the results, redirect to a file.

Options:
--fmt=<export|bracket|discbracket|tiger|alpino|dact>
 when format is not bracket, work with discontinuous trees; output is in discbracket format: tree<TAB>sentence<TAB>frequency where tree has indices as leaves, referring to elements of sentence, a space separated list of words.
--numtrees=n only read first n trees from first treebank
--encoding=x specify treebank encoding, e.g. utf-8 [default], iso-8859-1, etc.
-o file Write output to file instead of stdout.
--complete treebank1 is a list of fragments (needle), result is the indices / counts of these fragments in treebank2 (haystack).
--batch=dir enable batch mode; any number of treebanks > 1 can be given; first treebank (A) will be compared to each (B) of the rest. Results are written to filenames of the form dir/A_B. Counts/indices are from B.
--indices report sets of 0-based indices where fragments occur instead of frequencies.
--relfreq report relative frequencies wrt. root node of fragments of the form n/m.
--approx report counts of occurrence as maximal fragment (lower bound)
--nofreq do not report frequencies.
--cover=<n[,m]>
 include all non-maximal/non-recurring fragments up to depth n of first treebank; optionally, limit number of substitution sites to m (default is unlimited).
--twoterms=x only extract fragments with at least two lexical terminals, one of which has a POS tag which matches the given regex. For example, to match POS tags of content words in the Penn treebank: ^(?:NN(?:[PS]|PS)?|(?:JJ|RB)[RS]?|VB[DGNPZ])$
--adjacent only compare pairs of adjacent trees (i.e., sent no. n, n + 1).
--debin debinarize fragments. Since fragments may contain incomplete binarized constituents, the result may still contain artificial nodes from the binarization in the root or frontier non-terminals of the fragments.
--alt alternative output format: (NP (DT "a") NN) default: (NP (DT a) (NN ))
--numproc=n use n independent processes, to enable multi-core usage (default: 1); use 0 to detect the number of CPUs.
--debug extra debug information, ignored when numproc > 1.
--quiet disable all messages.

gen

Generate random sentences with a PLCFRS or PCFG. Reads grammar from a text file in PLCFRS or bitpar format.

Usage: discodop gen [--verbose] <rules> <lexicon>
or: discodop gen --test

Grammar is assumed to be in utf-8; may be gzip’ed (.gz extension).

grammar

Read off grammars from treebanks.

Usage, one of:

discodop grammar param <parameter-file> <output-directory>
discodop grammar <type> <input> <output> [options]
discodop grammar info <rules-file>
discodop grammar merge (rules|lexicon|fragments) <input1> <input2>... <output>

The first format extracts a grammar according to a parameter file. See the documentation on parameter files.

The second format is for extracting simple grammars (e.g., no unknown word handling or coarse-to-fine parsing).

type is one of:

pcfg:Probabilistic Context-Free Grammar (treebank grammar).
plcfrs:Probabilistic Linear Context-Free Rewriting System (discontinuous treebank grammar).
ptsg:Probabilistic Tree-Substitution Grammar.
dopreduction:All-fragments PTSG using Goodman’s reduction.
doubledop:PTSG from recurring fragmensts.
dop1:PTSG from all fragments up to given depth.

input is a binarized treebank, or in the ptsg case, weighted fragments in the same format as the output of the discodop fragments command; input may contain discontinuous constituents, except for the pcfg case. output is the base name for the filenames to write the grammar to; the filenames will be <output>.rules and <output>.lex.

Other subcommands:

info:Print statistics for PLCFRS/bitpar grammar rules.
merge:Interpolate given sorted grammars into a single grammar. Input can be a rules, lexicon or fragment file.

NB: both the info and merge commands expect grammars to be sorted by LHS, such as the ones created by this tool.

Options
--inputfmt=<export|bracket|discbracket|tiger|alpino|dact>
 The treebank format [default: export].
--inputenc=<utf-8|iso-8859-1|...>
 Treebank encoding [default: utf-8].
--numproc=<1|2|...>
 Number of processes to start [default: 1]. Only relevant for double dop fragment extraction.
--gzip compress output with gzip, view with zless &c.
--packed use packed graph encoding for DOP reduction.
--bitpar produce an unbinarized grammar for use with bitpar.
-s X start symbol to use for PTSG.
--dopestimator=<rfe|ewe|shortest|...>
 The DOP estimator to use with dopreduction/doubledop [default: rfe].
--maxdepth=N, --maxfrontier=N
 When extracting a ‘dop1’ grammar, the limit on what fragments are extracted; 3 or 4 is a reasonable depth limit.
Grammar formats

When a PCFG is requested, or the input format is bracket (Penn format), the output will be in bitpar format. Otherwise the grammar is written as a PLCFRS. The encoding of the input treebank may be specified. Output encoding will be ASCII for the rules, and UTF-8 for the lexicon.

See the documentation on grammar formats.

Examples

Extract a Double-DOP grammar given binarized trees:

$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example

Extract grammars specified in a parameter file:

$ discodop grammar param filename.prm /tmp/example

parser

A command line interface for parsing sentences with an existing grammar.

Usage: discodop parser [options] <grammar/> [input files]
or: discodop parser --simple [options] <rules> <lexicon> [input [output]]

grammar/ is a directory with a model produced by discodop runexp. When no filename is given, input is read from standard input and the results are written to standard output. Input should contain one sentence per line with space-delimited tokens. Output consists of bracketed trees in selected format. Files must be encoded in UTF-8.

General options
-x Input is one token per line, sentences separated by two newlines (like bitpar).
-b k Return the k-best parses instead of just 1.
--prob Print probabilities as well as parse trees.
--tags Tokens are of the form word/POS; give both to parser.
--sentid Sentences in input are prefixed by IDs, delimited by |.
--fmt=<export|bracket|discbracket|alpino|conll|mst|wordpos>
 Format of output [default: discbracket].
--numproc=k Launch k processes, to exploit multiple cores.
--verbosity=x 0 <= x <= 4. Same effect as verbosity in parameter file.
--simple Parse with a single grammar and input file; similar interface to bitpar. The files rules and lexicon define a binarized grammar in bitpar or PLCFRS format.
Options for simple mode
-s x Use x as start symbol instead of default TOP.
--bt=file Apply backtransform table to recover TSG derivations.
--mpp=k By default, the output consists of derivations, with the most probable derivation (MPD) ranked highest. With a PTSG such as DOP, it is possible to aim for the most probable parse (MPP) instead, whose probability is the sum of any number of the k-best derivations.
--obj=<mpd|mpp|mcc|shortest|sl-dop>
 Objective function to maximize [default: mpd].
-m x Use x derivations to approximate objective functions; mpd and shortest require only 1.
--bitpar Use bitpar to parse with an unbinarized grammar.
Examples

To parse a single sentence:

$ echo 'Why did the chicken cross the road ?' | discodop parser en_ptb/

Tokenize and parse a text:

$ ucto -L en -n "CONRAD, Joseph - Lord Jim.txt" | discodop parser en_ptb/

Parse sentences from a treebank in bracketed format:

$ discodop treetransforms treebankExample.mrg --inputfmt=bracket --outputfmt=tokens | discodop parser en_ptb/

runexp

Run an experiment given a parameter file. Does grammar extraction, parsing, and evaluation.

Usage: discodop runexp <parameter file> [--rerun]

If a parameter file is given, an experiment is run. Given the parameter file sample.prm, a new directory will be created with the base name of the parameter file, i.e., sample/ in this case. This directory must not exist yet, to avoid accidentally overwriting previous results. To this directory the grammar rules and lexicon will be written in a text format, as well as the parsing results and the gold standard parse trees in the same format.

To repeat an experiment with an existing grammar, pass the option --rerun. The directory with the name of the parameter file without extension must exist in the current path; its results will be overwritten.

Parameter file and example invocation

See the reference documentation on parameter files. A minimal parameter file:

stages=[
  dict(
    name='pcfg',  # an identifier, used as filename when writing results
    mode='pcfg',  # use the PCFG CKY parser
  ),
],

evalparam='proper.prm',  # EVALB-style parameter file
# train / test sets
corpusfmt='bracket',  # choices: export, bracket, discbracket, alpino, tiger
traincorpus=dict(
    path='ptb-02-21.mrg',
    maxwords=100,  # max number of words for sentences in train corpus
),
testcorpus=dict(
    path='ptb-24.mrg',
    maxwords=100,  # max number of words for sentences in test corpus
),

See sample.prm in the code repository for a more extensive example. The file proper.prm can also be found there, which is a version of the COLLINS.prm file typically used with EVALB, adapted for discontinuous parsing. Ensure that all referenced files are in the current directory or specified with a path, and run as:

$ discodop runexp sample.prm
Parsing statistics

After running discodop runexp, a number of additional files are produced with parsing statistics:

output.log:

a log file with all messages displayed during parsing. This file contains ANSI codes for colors, so view it with less -R in a terminal, or remove them: sed "s,\x1B\[[0-9;]*[a-zA-Z],,g" output.log | less.

pcdist.txt:

shows the distribution of parsing complexity (cf. Gildea, NAACL 2010 for the definition) among the grammar rules.

stats.tsv:

is a tab-separated file with additional information. For each tuple of sentid, len, stage, the following columns are given:

elapsedtime:CPU time to complete a given stage (not including any required preceding stages)
logprob:the log probability of the selected parse tree.
frags:the number of fragments in the best derivation of the selected parse tree. If this grammar did not use tree fragments, this number will equal the number of non-terminal nodes in the tree.
numitems:the total number of items (label, span) in the chart of this stage.
golditems:if this stage is pruned by the previous stage, the number of items from the gold tree that remain after pruning. if this stage is not pruned, the number of gold items in the chart.
totalgolditems:the number of items in the gold tree, for the binarized tree; the discontinuities in the tree are splitted to make the number of items between discontinuous trees and splitted trees comparable.

treedraw

Visualize parse trees with ASCII art or LaTeX/SVG.

Usage: discodop treedraw [<treebank>...] [options]

--fmt=<export|bracket|discbracket|tiger|alpino|dact>
 Specify input format [default: export].
--encoding=enc Specify input encoding [default: utf-8].
--functions=x
‘leave’:leave syntactic labels as is [default],
‘remove’:strip functions off labels,
‘add’:show both syntactic categories and functions,
‘replace’:only show grammatical functions.
--morphology=x
‘no’:only show POS tags [default],
‘add’:concatenate morphology tags to POS tags,
‘replace’:replace POS tags with morphology tags,
‘between’:add morphological node between POS tag and word.
--abbr abbreviate labels longer than 5 characters.
--plain disable ANSI/HTML colors.
--frontier only show terminal and non-terminal leaves.
--output=x
‘text’:(default) output in ASCII/ANSI art format.
‘html’:similar to ‘text’, but wrap output in HTML.
‘svg’:SVG wrappend in HTML.
‘tikznode’:generate LaTeX code using TiKZ.
‘tikzmatrix’:generate LaTeX code using TiKZ.
‘tikzqtree’:generate LaTeX code using TiKZ-qtree, only applicable to continuous trees.
-n, --numtrees=x
 only display the first x trees from the input.

If no treebank is given, input is read from standard input; format is detected; has the advantage of reading the input incrementally. If more than one treebank is specified, trees will be displayed in parallel. The ANSI colors can be viewed in a terminal with less -RS.

Examples

View trees in a treebank:

$ discodop treedraw < wsj-02-21.mrg | less -RS

Apply a transformation and view the result:

$ discodop treetransforms --binarize --fmt=bracket wsj-02-21.mrg | discodop treedraw | less -RS

treesearch

Search through treebanks with queries.

Usage: discodop treesearch [--engine=X] [-t|-s|-c] <query> <treebank>...

Options:

-e X, --engine=X
 

Select query engine; possible options:

frag:tree fragment queries (default); queries and files are in bracket, discbracket, or export format.
regex:search through tokenized sentences with Python regexps.
tgrep2:tgrep2 queries; files are bracket corpora (optionally precompiled into tgrep2 format).
xpath:arbitrary xpath queries; files are dact XML corpora.
-c, --counts Report counts; multiple queries can be given.
-s, --sents Output sentences (default); multiple queries can be given.
-t, --trees Output visualizations of trees.
-b, --brackets Output raw trees in the original corpus format.
--indices Report a sentence numebers of matches for each corpus
--breakdown Report counts of types that match query.
--csv Report counts in CSV format instead of the default flat format.
-f, --file Read queries (one per line) from filename given as first argument.
--slice=<N:M> Only search in sentences N to M of each file; either N or M may be left out; slice indexing is 1-based and inclusive.
-m N, --max-count=N
 Stop after finding N matches; 0 for no limit.
-n, --line-number
 Prefix each line of output with the sentence number within its input file.
-o, --only-matching
 Only output the matching portion with --sents, --trees, and --brackets.
-i, --ignore-case
 Ignore case in regex queries.
-M X, --macros=X
 A file with macros.
--numproc=N Use N independent processes, to enable multi-core usage (default: use all detected cores).
Tree fragments

Search for literal matches of tree fragments, i.e., one or more connected grammar productions. Supports (binarized, optionally discontinuous) treebanks with the extensions .mrg (bracket format), .dbr (discbracket format), and .export (export format).

regular bracket trees:

(S (NP Mary) (VP (VB is) (JJ rich)) (. .))
(S (NP ) (VP (VB is) (JJ )) (. .))

discontinuous trees:

(S (VP (VB 0=is) (JJ 2=)) (NP 1=) (? 3=?))
(VP (VB 0=is) (JJ 2=rich))

More information on the format of fragments: file format documentation

This query engine only works on binarized trees, and each node only has a single label that is matched on an all-or-nothing basis. Treebanks will be automatically binarized if they are not already binarized; treebanks may be discontinuous. It is useful to perform the binarization in advance, so that specific options and tree transformations can be applied; e.g., to handle punctuation, and include functional or morphological tags.

A cached copy of the treebank is created in an indexed format; given filename.mrg, this indexed version is stored as filename.mrg.ct (in the same directory). Another file, treesearchvocab.idx, contains a global index of productions; this index should automatically be recreated when the list of files changes or any file is updated. For the treesearch web interface, these indexed files need to be created in advance. This can be done by running a dummy query on a set of files:

$ discodop treesearch -e frag '(prepare corpus)' *.dbr
Regular expressions

In contrast with the other engines, regular expressions treat the input as plain text files, one sentence per line. The options --trees and --brackets are not applicable. The syntax is Python’s re syntax.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like “A”, “a”, or “0”, are the simplest regular expressions; they simply match themselves.

The special characters are:

"."      Matches any character except a newline.
"^"      Matches the start of the string.
"$"      Matches the end of the string or just before the newline at
         the end of the string.
"*"      Matches 0 or more (greedy) repetitions of the preceding RE.
         Greedy means that it will match as many repetitions as possible.
"+"      Matches 1 or more (greedy) repetitions of the preceding RE.
"?"      Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n}    Matches from m to n repetitions of the preceding RE.
{m,n}?   Non-greedy version of the above.
"\\"     Either escapes special characters or signals a special sequence.
[]       Indicates a set of characters.
         A "^" as the first character indicates a complementing set.
"|"      A|B, creates an RE that will match either A or B.
(...)    Matches the RE inside the parentheses.
         The contents can be retrieved or matched later in the string.
(?:...)  Non-grouping version of regular parentheses.
(?i)     Perform case-insensitive matching.

The special sequences consist of “\” and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character:

\A       Matches only at the start of the string.
\Z       Matches only at the end of the string.
\b       Matches the empty string, but only at the start or end of a word.
\B       Matches the empty string, but not at the start or end of a word.
\d       Matches any decimal digit.
\D       Matches any non-digit character.
\s       Matches any whitespace character.
\S       Matches any non-whitespace character.
\w       Matches any alphanumeric character.
\W       Matches the complement of \w.
\\       Matches a literal backslash.

More information: https://docs.python.org/3/library/re.html#regular-expression-syntax

This query engine creates a cached index of line numbers in all files treesearchline.idx; this index should automatically be recreated when the list of files changes or any file is updated.

TGrep2 syntax overview

Only treebanks in bracket format ary supported, but trees can be n-ary. Note that the tgrep2 command needs to be installed. A version with small improvements is available from https://github.com/andreasvc/tgrep2

TGrep2 operators:

A < B       A is the parent of (immediately dominates) B.
A > B       A is the child of B.
A <N B      B is the Nth child of A (the first child is <1).
A >N B      A is the Nth child of B (the first child is >1).
A <, B      Synonymous with A <1 B.
A >, B      Synonymous with A >1 B.
A <-N B     B is the Nth-to-last child of A (the last child is <-1).
A >-N B     A is the Nth-to-last child of B (the last child is >-1).
A <- B      B is the last child of A (synonymous with A <-1 B).
A >- B      A is the last child of B (synonymous with A >-1 B).
A <` B      B is the last child of A (also synonymous with A <-1 B).
A >` B      A is the last child of B (also synonymous with A >-1 B).
A <: B      B is the only child of A.
A >: B      A is the only child of B.
A << B      A dominates B (A is an ancestor of B).
A >> B      A is dominated by B (A is a descendant of B).
A <<, B     B is a left-most descendant of A.
A >>, B     A is a left-most descendant of B.
A <<` B     B is a right-most descendant of A.
A >>` B     A is a right-most descendant of B.
A <<: B     There is a single path of descent from A and B is on it.
A >>: B     There is a single path of descent from B and A is on it.
A . B       A immediately precedes B.
A , B       A immediately follows B.
A .. B      A precedes B.
A ,, B      A follows B.
A $ B       A is a sister of B (and A != B).
A $. B      A is a sister of and immediately precedes B.
A $, B      A is a sister of and immediately follows B.
A $.. B     A is a sister of and precedes B.
A $,, B     A is a sister of and follows B.
A = B       A is also matched by B.

More information: http://tedlab.mit.edu/~dr/Tgrep2/

TGrep2 uses its own indexed file format. These files are automatically created when using this query engine. Given a file example.mrg, the file example.mrg.t2c.gz is created (in the same directory).

XPath syntax examples

Search through treebanks in XML format with XPath; treebanks must be in dact format. Note: XPath support depends on the alpinocorpus library; see https://github.com/rug-compling/alpinocorpus-python

Find a particular word:

//node[@word='loopt']

This is case-sensitive. If you want to find all inflectional variants of the verb lopen, do:

//node[@lemma='lopen']

To find main clauses:

//node[@cat="smain"]

Finite subordinate clauses:

//node[@cat="cp" and node[@rel="body" and @cat="ssub"]]

This locates cp nodes with an ssub child that has body as function tag (relation).

General XPath overview: https://en.wikipedia.org/wiki/XPath Using XPath on Alpino treebanks: http://rug-compling.github.io/dact/cookbook/

To create files in dact format, the alpinocorpus tools may be used. Alternatively, discodop treetransforms can be used:

$ discodop treetransforms --input=alpino --output=dact 'mycorpus/*.xml' mycorpus.dact
Examples

Show trees that can contain a NP modified by a PP:

$ discodop treesearch --trees -e frag '(NP (NP ) (PP ))' wsj-02-21.mrg

Same query, but only show matching terminals:

$ discodop treesearch --only-matching --sents -e frag '(NP (NP ) (PP ))' ~/data/wsj-02-21.mrg

Perform a large number of regex queries from a file, and store counts in a CSV file:

$ discodop treesearch --csv --counts -e regex --file queries.txt corpus.txt > results.csv

treetransforms

Treebank binarization and conversion

Usage: discodop treetransforms [input [output]] [options]

where input and output are treebanks; standard in/output is used if not given.

Main transformation options

The following transforms are applied in the order given on the command line.

--introducepreterminals
 Add preterminals to terminals without a dedicated preterminal.
--transforms=<NAME1,NAME2...>
 Apply specific treebank transforms; available presets: negra, wsj, alpino, green2013ftb, km2003wsj, km2003simple, fraser2013tiger, lassy, lassy-func For details cf. source of discodop.treebanktransforms module.
--reversetransforms=<NAME1,NAME2,...>
 Undo specified transforms; specify transforms in original order.
–binarize [-h x] [-v x] [–factor=<left|right>] [...]
Markovized binarization; also see –headrules and other options below.
–optimalbinarize [-h x] [-v x]
Binarization that minimizes LCFRS fan-out/complexity.
--unbinarize Restore original n-ary trees.
–splitdisc [–markorigin]
Split discontinuous nodes into several continuous nodes.
--mergedisc Reverse the node splitting operation.
Other options
--inputfmt=<export|bracket|discbracket|tiger|alpino|dact>
 Input treebank format [default: export].
--outputfmt=<export|bracket|discbracket|dact|conll|mst|tokens|wordpos>
 Output treebank format [default: export]. Selecting the formats conll or mst results in an unlabeled dependency conversion and requires the use of heuristic head rules (--headrules), to ensure that all constituents have a child marked as head. A command line interface to perform transformations on treebanks such as binarization.
--fmt=x Shortcut to specify both input and output format.
--inputenc, --outputenc, --enc=<utf-8|iso-8859-1|...>
 Treebank encoding [default: utf-8].
--slice=<n:m> select a range of sentences from input starting with n, up to but not including m; as in Python, n or m can be left out or negative, and the first index is 0.
--renumber Replace sentence IDs with numbers starting from 1, padded with 8 spaces.
--sentid With ‘tokens’ or ‘wordpos’ output format, prefix lines with identifiers of the form ID|.
--maxlen=n only select sentences with up to n tokens.
--punct=x
‘remove’:remove any punctuation.
‘move’:re-attach punctuation to nearest constituent to minimize discontinuity.
‘restore’:attach punctuation under root node.
--functions=x
‘leave’:(default): leave syntactic labels as is,
‘remove’:strip away hyphen-separated function labels
‘add’:concatenate syntactic categories with functions,
‘replace’:replace syntactic labels w/grammatical functions.
--morphology=x
‘no’ (default):use POS tags as preterminals
‘add’:concatenate morphological information to POS tags, e.g., DET/sg.def
‘replace’:use morphological information as preterminal label
‘between’:insert node with morphological information between POS tag and word, e.g., (DET (sg.def the))
--lemmas=x
‘no’ (default):do not use lemmas.
‘add’:concatenate lemmas to terminals, e.g., word/lemma
‘replace’:use lemma instead of terminals
‘between’:insert node with lemma between POS tag and word, e.g., (NN (man men))
--ensureroot=x add root node labeled x to trees if not already present.
--removeempty remove empty / -NONE- terminals.
--factor=<left|right>
 specify left- or right-factored binarization [default: right].
-h n horizontal markovization. default: infinite (all siblings)
-v n vertical markovization. default: 1 (immediate parent only)
--headrules=x turn on head finding; turns on head-outward binarization. reads rules from file x (e.g., “negra.headrules”).
--markhead include label of the head child in all auxiliary labels of binarization.
--direction mark direction when using head-outward binarization.
--labelfun=x x is a Python lambda function that takes a node and returns a label to be used for markovization purposes. For example, to get labels without state splits, pass this function: 'lambda n: n.label.split("^")[0]'
--leftunary make initial / final productions of binarized constituents
--rightunary ... unary productions.
--tailmarker mark rightmost child (the head if headrules are applied), to avoid cyclic rules when --leftunary and --rightunary are used.
Example

Binarize a treebank:

$ discodop treetransforms --binarize --fmt=bracket treebankExample.mrg /tmp/bintrees

A parser for Probalistic Linear Context-Free Rewriting Systems (LCFRS) and Probabilistic Context-Free Grammars (PCFG), as well as facilities to extract and parse with data-oriented parsing (DOP) grammars.

Usage: discodop <command> [arguments]

Command is one of:

runexp Run experiment: grammar extraction, parsing & evaluation.
fragments Extract recurring fragments from treebanks.
eval Evaluate discontinuous parse trees; similar to EVALB.
treetransforms Apply tree transformations and convert between formats.
treedraw Visualize (discontinuous) trees
treesearch Search through treebanks with queries.
grammar Read off grammars from treebanks.
parser Simple command line parser.
gen Generate sentences from a PLCFRS.
demos: Show some demonstrations of formalisms encoded in LCFRS.

for additional instructions issue: discodop <command> --help or refer to the man page discodop-<command>.

Parser parameters

A parser is defined by a sequence of stages, and a set of global options:

stages=[
    stage1,
    stage2,
],
corpusfmt='...',
traincorpus=dict(...),
testcorpus=dict(...),
binarization=dict(...),
key1=val1,
key2=val2,

The parameters consist of a Python expression surrounded by an implicit 'dict(' and ')'. Note that each key=value is separated by a comma.

Corpora

corpusfmt:

The corpus format; choices:

'export':Negra export format
'bracket':Penn treebank style bracketed trees.
'discbracket':Bracketed parse trees with numeric indices and words of sentence specified separately.
'alpino':Alpino XML format
'tiger':Tiger XML format
traincorpus:

a dictionary with the following keys:

path:filename of training corpus; may include wildcards / globbing characters * and ?.
encoding:encoding of training corpus (defaults to 'utf-8')
maxwords:maximum sentence length to base grammar on
numsents:number of sentence to use from training corpus
testcorpus:

a dictionary with the following keys:

path:filename of test corpus (may be same as traincorpus, set skiptrain to True in that case).
encoding:encoding of test corpus (defaults to 'utf-8')
maxwords:maximum sentence length to parse from test set
numsents:number of sentences to parse
skiptrain:when training & test corpus are from same file, start reading test set after training set sentences
skip:number of (additional) sentences to skip before test corpus starts

Binarization

binarization:

a dictionary with the following keys:

method:

Binarization method; choices:

None:Treebank is already binarized.
'default':basic binarization (recommended).
'optimal':binarization which optimizes for lowest fan-out or parsing complexity.
'optimalhead':like optimal, but only considers head-driven binarizations.
factor:

'left' or 'right'. The direction of binarization when using default.

headrules:

file with rules for finding heads of constituents

markhead:

whether to prepend head to siblings labels

v:

vertical markovization context; default 1; 2 means 1 extra level of parent annotation.

h:

horizontal markovization context

revh:

horizontal markovization context preceding child being generated

leftmostunary:

whether to start binarization with unary node

rightmostunary:

whether to end binarization with unary node

tailmarker:

with headrules, head is last node and can be marked

markovthreshold:
 

reduce horizontal markovization threshold of auxiliary labels with a frequency lower than this threshold.

fanout_marks_before_bin:
 

whether to add fanout markers before binarization

labelfun:

specify a function from nodes to labels; can be used to change how labels appear in markovization, e.g., to strip of annotations.

Stages

Through the use of stages it is possible to run multiple parsers on the same test set, or to exploit coarse-to-fine pruning.

A stage has the form:

dict(
    key1=val1,
    key2=val2,
    ...
)

Where the keys and values are:

name:

identifier, used for filenames

mode:

The type of parser to use

'pcfg':CKY parser
'plcfrs':use the agenda-based PLCFRS parser
'pcfg-bitpar-nbest':
 Use external bitpar parser. Produces n-best list (up to n=1000) without producing a parse forest; works with non-binarized grammars (experimental).
'pcfg-bitpar-forest':
 Use external bitpar parser (experimental).
'dop-rerank':Rerank parse trees from previous stage with DOP reduction (experimental).
prune:

specify the name of a previous stage to enable coarse-to-fine pruning.

split:

split disc. nodes VP_2[101] as { VP*[100], VP*[001] }

splitprune:

treat VP_2[101] as {VP*[100], VP*[001]} for pruning

markorigin:

mark origin of split nodes: VP_2 => {VP*1, VP*2}

k:

pruning parameter:

k=0:filter only (only prune items that do not lead to a complete derivation)
0 < k < 1:posterior threshold for inside-outside probabilities
k > 1:no. of coarse pcfg derivations to prune with
kbest:

extract m-best derivations from chart

sample:

sample m derivations from chart

m:

number of derivations to sample / enumerate.

binarized:

when using mode='pcfg-bitpar-nbest', this option can be set to False, to disable the two auxiliary binarizations needed for Double-DOP. This enables bitpar to do the binarization internally, which is more efficient.

dop:

enable DOP mode:

None:Extract treebank grammar
'reduction':DOP reduction (Goodman 1996, 2003)
'doubledop':Double DOP (Sangti & Zuidema 2011)
'dop1':DOP1 (Bod 1992)
estimator:

DOP estimator. Choices:

'rfe':relative frequencies.
'ewe':equal weights estimate; relative frequencies with correction factor to remove bias for larger fragments; useful with DOP reduction.
'bon':Bonnema estimator; another correction factor approach.
objective:

Objective function to choose DOP parse tree. Choices:

'mpp':Most Probable Parse. Marginalizes over multiple derivations.
'mpd':Most Probable Derivation.
'mcc':Maximum Constituents Parse (Goodman 1996); approximation as in Sangati & Zuidema (2011); experimental.
'shortest':Most Probable Shortest Derivation; i.e., shortest derivation (with minimal number of fragments), where ties are broken using probabilities specified by estimator.
'sl-dop':Simplicity-Likelihood. Simplest Tree from the n most Likely trees.
'sl-dop-simple':
 An approximation which does not require parsing the sentence twice.
sldop_n:

When using sl-dop or sl-dop-simple, number of most likely parse trees to consider.

maxdepth:

with 'dop1', the maximum depth of fragments to extract; with 'doubledop', likewise but applying to the non-recurring/non-maximal fragments extracted to augment the set of recurring fragments.

maxfrontier:

with 'dop1', the maximum number of frontier non-terminals in extracted fragments; with 'doubledop', likewise but applying to the non-recurring/non-maximal fragments extracted to augment the set of recurring fragments.

collapse:

apply a multilevel coarse-to-fine preset. values are of the form ('treebank', level); e.g., ('ptb', 0) for the coarsest level of the Penn treebank. For the presets, see source of discodop.treebanktransforms.MAPPINGS. Include a stage for each of the collapse-levels in ascending order (0, 1, and 2 in the current presets), and then add a stage where labels are not collapsed.

packedgraph:

use packed graph encoding for DOP reduction

iterate:

for Double-DOP, whether to add fragments of fragments

complement:

for Double-DOP, whether to include fragments which form the complement of the maximal recurring fragments extracted

neverblockre:

do not prune nodes with label that match this regex

estimates:

compute, store & use context-summary (outside) estimates

beam_beta:

beam pruning factor, between 0 and 1; 1 to disable. if enabled, new constituents must have a larger probability than the probability of the best constituent in a cell multiplied by this factor; i.e., a smaller value implies less pruning. Suggested value: 1e-4.

beam_delta:

if beam pruning is enabled, only apply it to spans up to this length.

Other options

evalparam:

EVALB-style parameter file to use for reporting F-scores

postagging:

To disable POS tagging and use the gold POS tags from the test set, set this to None. Otherwise, pass a dictionary with the keys below; for details, see discodop.lexicon

method:

one of:

'unknownword':incorporate unknown word model in grammar
'stanford':use external Stanford tagger
'treetagger':use external tagger 'treetagger'
'frog':use external tagger ‘frog’ for Dutch; produces CGN tags, use morphology=’replace’.
model:
with ‘unknownword’, one of:
 
'4':Stanford model 4; language agnostic
'6':Stanford model 6, for the English Penn treebank
'base':Stanford ‘base’ model; language agnostic
'ftb':Stanford model 2 for French treebank
with external taggers:
 

filename of tagger model (not applicable to ‘frog’)

retag:

if True, re-tag the training corpus using the external tagger.

unknownthreshold:
 

use probabilities of words that occur this number of times or less for unknown words

openclassthreshold:
 

add unseen tags for known words when tag rewrites at least this number of words. 0 to disable.

simplelexsmooth:
 

enable/disable sophisticated smoothing (untested)

punct:

one of ...

None:leave punctuation as is.
'move':move punctuation to appropriate constituents using heuristics.
'moveall':same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
'prune':prune away leading & ending quotes & periods, then move.
'remove':eliminate punctuation.
'root':attach punctuation directly to root (as in original Negra/Tiger treebanks).
functions:

one of ...

None:leave syntactic labels as is.
'add':concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ
'remove':strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP
'replace':replace syntactic label with grammatical function, e.g., NP => SBJ
morphology:

one of ...

None:use POS tags as preterminals
'add':concatenate morphological information to POS tags, e.g., DET/sg.def
'replace':use morphological information as preterminal label
'between':add node with morphological information between POS tag and word, e.g., (DET (sg.def the))
lemmas:

one of ...

None:ignore lemmas
'between':insert lemma as node between POS tag and word.
removeempty:

True or False; whether to remove empty terminals from train, test sets.

ensureroot:

Ensure every tree has a root node with this label

transformations:
 

Apply specific treebank transforms; available presets: negra, wsj, alpino, green2013ftb, km2003wsj, km2003simple, fraser2013tiger, lassy, lassy-func For details cf. source of discodop.treebanktransforms module.

relationalrealizational:
 

apply RR-transform; see discodop.treebanktransforms.rrtransform()

verbosity:

control the amount of output to console; a logfile output.log is also kept with a fixed log level of 2.

0:silent
1:summary report
2:per sentence results
3:dump derivations/parse trees
4:dump chart
numproc:

default 1; increase to use multiple CPUs; None: use all CPUs.

File formats

Treebanks

export

Negra export format (v3).

For example:

#BOS 0
is  VB  --  --  500
John    NP  --  --  0
rich    JJ  --  --  500
?   ?   --  --  0
#500    VP  --  --  0
#EOS 0

An optional lemma field is supported. Secondary edges may or may not be preserved but mostly ignored. The preamble listing the tag sets is ignored and not reproduced when trees are written in this format.

This format is supported when input is read incrementally from standard input with the treedraw and treetransforms commands.

Cf. http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/exformat3.ps

bracket

Penn-treebank style bracketed trees, one tree per line.

For example:

(S (NP John) (VP (VB is) (JJ rich)) (. .))

This format is supported when input is read incrementally from standard input with the treedraw and treetransforms commands.

Tree fragments can be represented by leaving out the children of any non-terminal, e.g.:

(VP (VB is) (JJ ))
(S (NP John) (VP )) (. .))
discbracket

A corpus format for discontinuous trees in bracket notation, where the leaves are prefixed with indices indicating word order. Each leaf must have at least one index; the indices form an unbroken range starting from 0, with each index occurring exactly once.

For example:

(S (VP (VB 0=is) (JJ 2=rich)) (NP 1=John) (? 3=?))
(sentence: is John rich ?)

Note that the leaves are not in the same order as in the sentence. The leaves must be sorted by the indices to restore the original sentence order. There is one parse tree per line. Compared to Negra’s export format, this format lacks separate fields for morphology, lemmas, and functional edges. On the other hand, it is close to the internal representation employed here, so it can be read efficiently.

This format is supported when input is read incrementally from standard input with the treedraw and treetransforms commands.

Tree fragments can be formed as with bracket trees, by leaving out terminals or whole subtrees:

(S (VP (VB 0=is) (JJ 2=)) (NP 1=) (? 3=?))
(VP (VB 0=is) (JJ 2=rich))

There is an extra case that should be handled, which is how to represent a discontinuous frontier non-terminal. This requires expressing how the spans of the discontinuous node relate to the other spans in the tree:

(S (VP 0= 2=) (NP 1=) (? 3=?))

While the VP node does not dominate any terminals, if they were to be added, they would end up before and after the NP node.

alpino

Alpino XML format. One file per sentence. The hierarchical tree structure is mirrored in the XML structure, which makes it possible to query trees in this format with XPath (as opposed to TigerXML which maintains the tabular structure of the Negra export format).

Cf. http://www.let.rug.nl/~vannoord/Lassy/alpino_ds.dtd

dact

Alpino XML trees in an XML database as used by Dact. Cf. http://rug-compling.github.io/dact/

Write-only formats
connl, mst:unlabeled dependencies; relies on head identification.
tokens:one sentence per line, tokens separated by spaces.
wordpos:similar to tokens but of the form token/POS.

Grammars

PCFG

PCFG grammars are stored in bitpar format. From the bitpar manual:

The grammar file contains one grammar rule per line. Each grammar rule starts with its frequency [...] followed by the parent category (symbol on the left-hand side) and the child categories (symbols on the right-hand side). The symbols are separated by whitespace. [...]

The lexicon file contains one lexicon entry per line. Each lexicon entry starts with the word [...] followed a sequence of part-of-speech tag + frequency pairs. The POS tag is preceded by a tab character and followed by a blank or tab character.

Cf. http://www.cis.uni-muenchen.de/~schmid/tools/BitPar/

PLCFRS

The PLCFRS format is as follows. Rules are delimited by newlines. Fields are separated by tabs. The fields are:

LHS RHS1    [RHS2]  yield-function  weight

The yield function defines how the spans of the RHS nonterminals are combined to form the spans of the LHS nonterminal. Components of the yield function are comma-separated, 0 refers to a component of the first RHS nonterminal, and 1 from the second. Weights are expressed as rational fractions. The lexicon is defined in a separate file. Lines start with a single word, followed by pairs of possible tags and their probabilities:

WORD    TAG1    PROB1   [TAG2   PROB2 ...]

Example, rules file:

S  NP  VP  010 1/2
VP_2   VB  NP  0,1 2/3
NP NN  0   1/4

lexicon file:

is  VB  1/3
John    NN 1/2
rich    JJ 1/5
backtransform

Double-DOP grammars and other PTSGs employ a grammar in which internal nodes are removed from fragments to obtain a more compact grammar. Fragments are restored in derivations using a backtransform table with the original fragments for each grammar rule.

The backtransform file contains one fragment per line, with the lines corresponding to the lines of the grammar rule file. Frontier non-terminals are indicated as {0}, {1}, etc. The fragments which this backtransform is based on is also saved, with a filename of the form .fragments.gz. To view the grammar rules together with the corresponding fragments, issue the following command:

$ paste <(zcat dop.rules.gz) <(zcat dop.fragments.gz)
A       X       Y       01      1       (A (X 0) (Y 1)) 1
A_2     X       Z       0,1     1       (A_2 (X 0) (Z 2))       2
RIGHT   A_2     Y       010     1       (RIGHT (A_2 0 2) (Y 1)) 2
S       S}<0>   Z@z     01      2/5     (S (RIGHT (A_2 (X 0) (Z 2)) (Y 1)))     x y z   2
S       RIGHT   0       2/5     (S (RIGHT 0))   2
S       WRONG   0       1/5     (S (WRONG 0))   1
WRONG   A       Z       01      1       (WRONG (A 0) (Z 1))     1
S}<0>   X@x     Y@y     01      1
alternate weights

DOP grammars can contain multiple probability models. The alternate models are stored in a NumPy array:

$ python
>>> import numpy
>>> probs = numpy.load('dop.probs.npz')
>>> probs.keys()
['default', 'shortest', 'bon', 'ewe']
>>> probs['shortest'][:10]
array([ 0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5,  0.5])

In this case, we see the model for shortest derivation parsing, where every fragment is assigned a uniform weight of 0.5.

Miscellaneous

head assignment rules

This file specifies a set of heuristic rules to pick for every constituent one of its children as being the head of the constituent, based on syntactic categories.

The file is case insensitive. Lines starting with % are treated as comments and ignored. Each line specifies a rule of the form:

CAT direction child1 child2...

This rule specifies how a head child is assigned for a constituent labeled as CAT. The second argument specifies whether the children of the constituent should be considered starting from the left or from the right (corresponding to whether a category is head-first head-final):

left:(or left-to-right) for each of the possible heads, try all children from left to right
right:(or right-to-left) for each of the possible heads, try all children from right to left
leftdis:go from left to right and try each possible head.
rightdis:go from right to left and try each possible head.
like:treat this label as if it were another label; e.g. ‘TOP like ROOT’.

There may be multiple rules for a category, for example if they go in opposite directions. The rules are applied in the order as they appear in the file.

The list of children may be empty; in that case the leftmost (or rightmost, in the second case) child will be chosen as head. If the list of possible children is non-empty, the children of the constituents are iterated over for each possible child, and the first matching child is picked as the head.

See also: http://www.cs.columbia.edu/~mcollins/papers/heads

evaluation parameters

The format of this file is a superset of the parameters for EVALB, cf. http://nlp.cs.nyu.edu/evalb/

The parameter file should be encoded in UTF-8 and supports the following options in addition to those supported by EVALB:

DELETE_ROOT_PRETERMS:
 

if nonzero, ignore preterminals directly under the root in gold trees for scoring purposes.

DISC_ONLY:

if nonzero, only consider discontinuous bracketings (affects precision, recall, f-measure, exact match).

LA:

if nonzero, report leaf-ancestor scores [default: disabled].

TED:

if nonzero, report tree-edit distance scores; disabled by default as these are slow to compute. NB: it is not clear whether this score is applicable to discontinuous trees.

DEBUG:
-1:only print summary table
0:additionally, print category / tag breakdowns (default) (after application of cutoff length).
1:give per-sentence results ('--verbose')
2:give detailed information for each sentence ('--debug')
MAX_ERROR:

this value is ignored, no errors are tolerated. the parameter is accepted to support usage of unmodified EVALB parameter files.

API documentation

Discontinuous Data-Oriented Parsing (disco-dop).

Main components:

  • A parser for Probalistic Linear Context-Free Rewriting Systems (LCFRS), as well as Probabilistic Context-Free Grammars (PCFG).
  • Facilities to extract and parse with tree fragments using data-oriented parsing (DOP) grammars.

Python modules

cli Command-line interfaces to modules.
demos Examples of various formalisms encoded in LCFRS grammars.
eval Evaluation of (discontinuous) parse trees.
fragments Extract recurring tree fragments from constituency treebanks.
functiontags Function tags classifier.
gen Generate random sentences with an LCFRS.
grammar Assorted functions to read off grammars from treebanks.
heads Functions related to finding the linguistic head of a constituent.
lexicon Add rules to handle unknown words and smooth lexical probabilities.
parser This is an interface to Python’s internal parser.
punctuation Punctuation related functions.
runexp
tree Various Tree objects for representing syntax or morphological trees.
treebank Read and write treebanks.
treebanktransforms Treebank transformations.
treedist Tree edit distance implementations.
treesearch
treetransforms Treebank-indenpendent tree transformations.
util Misc code to avoid cyclic imports.

Cython modules

_fragments Fragment extraction with tree kernels.
bit Functions for working with bitvectors.
coarsetofine Project selected items from a chart to corresponding items in next grammar.
containers Data types for chart items, edges, &c.
disambiguation Disambiguate parse forests with various methods for parse selection.
estimates Computation of outside estimates for best-first or A* parsing.
kbest Extract the k-best derivations from a probabilistic parse forest.
pcfg CKY parser for Probabilistic Context-Free Grammar (PCFG).
plcfrs Parser for string-rewriting Linear Context-Free Rewriting Systems.

Indices and tables