Read off grammars from treebanks.

Usage, one of:

discodop grammar param <parameter-file> <output-directory>
discodop grammar <type> <input> <output> [options]
discodop grammar info <rules-file>
discodop grammar merge (rules|lexicon|fragments) <input1> <input2>... <output>

The first format extracts a grammar according to a parameter file. See the documentation on parameter files.

The second format is for extracting simple grammars (e.g., no unknown word handling or coarse-to-fine parsing).

type is one of:

pcfg:Probabilistic Context-Free Grammar (treebank grammar).
plcfrs:Probabilistic Linear Context-Free Rewriting System (discontinuous treebank grammar).
ptsg:Probabilistic Tree-Substitution Grammar.
dopreduction:All-fragments PTSG using Goodman’s reduction.
doubledop:PTSG from recurring fragmensts.
dop1:PTSG from all fragments up to given depth.

input is a binarized treebank, or in the ptsg case, weighted fragments in the same format as the output of the discodop fragments command; input may contain discontinuous constituents, except for the pcfg case. output is the base name for the filenames to write the grammar to; the filenames will be <output>.rules and <output>.lex.

Other subcommands:

info:Print statistics for PLCFRS/bitpar grammar rules.
merge:Interpolate given sorted grammars into a single grammar. Input can be a rules, lexicon or fragment file.

NB: both the info and merge commands expect grammars to be sorted by LHS, such as the ones created by this tool.


 The treebank format [default: export].
 Treebank encoding [default: utf-8].
 Number of processes to start [default: 1]. Only relevant for double dop fragment extraction.
--gzip compress output with gzip, view with zless &c.
--packed use packed graph encoding for DOP reduction.
-s X start symbol to use for PTSG.
 The DOP estimator to use with dopreduction/doubledop [default: rfe].
--maxdepth=N, --maxfrontier=N
 When extracting a ‘dop1’ grammar, the limit on what fragments are extracted; 3 or 4 is a reasonable depth limit.

Grammar formats

When a PCFG is requested, or the input format is bracket (Penn format), the output will be in bitpar format. Otherwise the grammar is written as a PLCFRS. The encoding of the input treebank may be specified. Output encoding will be ASCII for the rules, and UTF-8 for the lexicon.

See the documentation on grammar formats.


Extract a Double-DOP grammar given binarized trees:

$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example

Extract grammars specified in a parameter file:

$ discodop grammar param filename.prm /tmp/example