grammar¶
Read off grammars from treebanks.
Usage, one of:
discodop grammar param <parameter-file> <output-directory>
discodop grammar <type> <input> <output> [options]
discodop grammar info <rules-file>
discodop grammar merge (rules|lexicon|fragments) <input1> <input2>... <output>
The first format extracts a grammar according to a parameter file. See the documentation on parameter files.
The second format is for extracting simple grammars (e.g., no unknown word handling or coarse-to-fine parsing).
type is one of:
| pcfg: | Probabilistic Context-Free Grammar (treebank grammar). |
|---|---|
| plcfrs: | Probabilistic Linear Context-Free Rewriting System (discontinuous treebank grammar). |
| ptsg: | Probabilistic Tree-Substitution Grammar. |
| dopreduction: | All-fragments PTSG using Goodman’s reduction. |
| doubledop: | PTSG from recurring fragmensts. |
| dop1: | PTSG from all fragments up to given depth. |
input is a binarized treebank, or in the ptsg case, weighted fragments
in the same format as the output of the discodop fragments command;
input may contain discontinuous constituents, except for the pcfg case.
output is the base name for the filenames to write the grammar to; the
filenames will be <output>.rules and <output>.lex.
Other subcommands:
| info: | Print statistics for PLCFRS/bitpar grammar rules. |
|---|---|
| merge: | Interpolate given sorted grammars into a single grammar. Input can be a rules, lexicon or fragment file. |
NB: both the info and merge commands expect grammars to be sorted by
LHS, such as the ones created by this tool.
Options¶
| --inputfmt=<export|bracket|discbracket|tiger|alpino> | |
| The treebank format [default: export]. | |
| --inputenc=<utf-8|iso-8859-1|…> | |
| Treebank encoding [default: utf-8]. | |
| --numproc=<1|2|…> | |
| Number of processes to start [default: 1]. Only relevant for double dop fragment extraction. | |
| --gzip | compress output with gzip, view with zless &c. |
| --packed | use packed graph encoding for DOP reduction. |
| -s X | start symbol to use for PTSG. |
| --dopestimator=<rfe|ewe|shortest|…> | |
| The DOP estimator to use with dopreduction/doubledop [default: rfe]. | |
| --maxdepth=N, --maxfrontier=N | |
| When extracting a ‘dop1’ grammar, the limit on what fragments are extracted; 3 or 4 is a reasonable depth limit. | |
Grammar formats¶
When a PCFG is requested, or the input format is bracket (Penn format), the
output will be in bitpar format. Otherwise the grammar is written as a PLCFRS.
The encoding of the input treebank may be specified. Output encoding will be
ASCII for the rules, and UTF-8 for the lexicon.
See the documentation on grammar formats.
Examples¶
Extract a Double-DOP grammar given binarized trees:
$ discodop grammar doubledop --inputfmt=bracket /tmp/bintrees /tmp/example
Extract grammars specified in a parameter file:
$ discodop grammar param filename.prm /tmp/example