fragments¶
Extract recurring tree fragments from constituency treebanks.
Usage:
discodop fragments <treebank1> [treebank2] [options]or:
discodop fragments --batch=<dir> <treebank1> <treebank2>... [options]If only one treebank is given, extract fragments in common between its pairs of trees. If two treebanks are given, extract fragments in common between the trees of the first & second treebank. Input is in Penn treebank format (S-expressions), one tree per line. Output contains lines of the form “tree<TAB>frequency”. Frequencies refer to the first treebank by default. Output is sent to stdout; to save the results, redirect to a file.
Options:¶
| --fmt=<export|bracket|discbracket|tiger|alpino|dact> | |
when format is not bracket, work with discontinuous trees;
output is in discbracket format:
tree<TAB>sentence<TAB>frequency
where tree has indices as leaves, referring to elements of
sentence, a space separated list of words. | |
| --numtrees=n | only read first n trees from first treebank |
| --encoding=x | specify treebank encoding, e.g. utf-8 [default], iso-8859-1, etc. |
| -o file | Write output to file instead of stdout. |
| --complete | treebank1 is a list of fragments (needle), result is the
indices / counts of these fragments in treebank2 (haystack). |
| --batch=dir | enable batch mode; any number of treebanks > 1 can be given;
first treebank (A) will be compared to each (B) of the rest.
Results are written to filenames of the form dir/A_B.
Counts/indices are from B. |
| --indices | report sets of 0-based indices where fragments occur instead of frequencies. |
| --relfreq | report relative frequencies wrt. root node of fragments of the form n/m. |
| --approx | report counts of occurrence as maximal fragment (lower bound) |
| --nofreq | do not report frequencies. |
| --cover=<n[,m]> | |
include all non-maximal/non-recurring fragments up to depth n
of first treebank; optionally, limit number of substitution
sites to m (default is unlimited). | |
| --twoterms=x | only extract fragments with at least two lexical terminals,
one of which has a POS tag which matches the given regex.
For example, to match POS tags of content words in the
Penn treebank: ^(?:NN(?:[PS]|PS)?|(?:JJ|RB)[RS]?|VB[DGNPZ])$ |
| --adjacent | only compare pairs of adjacent trees (i.e., sent no. n, n + 1). |
| --debin | debinarize fragments. Since fragments may contain incomplete binarized constituents, the result may still contain artificial nodes from the binarization in the root or frontier non-terminals of the fragments. |
| --alt | alternative output format: (NP (DT "a") NN)
default: (NP (DT a) (NN )) |
| --numproc=n | use n independent processes, to enable multi-core usage
(default: 1); use 0 to detect the number of CPUs. |
| --debug | extra debug information, ignored when numproc > 1. |
| --quiet | disable all messages. |