fragments

Extract recurring tree fragments from constituency treebanks.

Usage: discodop fragments <treebank1> [treebank2] [options]
or: discodop fragments --batch=<dir> <treebank1> <treebank2>... [options]

If only one treebank is given, extract fragments in common between its pairs of trees. If two treebanks are given, extract fragments in common between the trees of the first & second treebank. Input is in Penn treebank format (S-expressions), one tree per line. Output contains lines of the form “tree<TAB>frequency”. Frequencies refer to the first treebank by default. Output is sent to stdout; to save the results, redirect to a file.

Options:

--fmt=<export|bracket|discbracket|tiger|alpino>
 when format is not bracket, work with discontinuous trees; output is in discbracket format: tree<TAB>sentence<TAB>frequency where tree has indices as leaves, referring to elements of sentence, a space separated list of words.
--numtrees=n only read first n trees from first treebank
--encoding=x specify treebank encoding, e.g. utf-8 [default], iso-8859-1, etc.
-o file Write output to file instead of stdout.
--complete treebank1 is a list of fragments (needle), result is the indices / counts of these fragments in treebank2 (haystack).
--batch=dir enable batch mode; any number of treebanks > 1 can be given; first treebank (A) will be compared to each (B) of the rest. Results are written to filenames of the form dir/A_B. Counts/indices are from B.
--indices report sets of 0-based indices where fragments occur instead of frequencies.
--relfreq report relative frequencies wrt. root node of fragments of the form n/m.
--approx report counts of occurrence as maximal fragment (lower bound)
--nofreq do not report frequencies.
--cover=<n[,m]>
 include all non-maximal/non-recurring fragments up to depth n of first treebank; optionally, limit number of substitution sites to m (default is unlimited).
--twoterms=x only extract fragments with at least two lexical terminals, one of which has a POS tag which matches the given regex. For example, to match POS tags of content words in the Penn treebank: ^(?:NN(?:[PS]|PS)?|(?:JJ|RB)[RS]?|VB[DGNPZ])$
--adjacent only compare pairs of adjacent trees (i.e., sent no. n, n + 1).
--debin debinarize fragments. Since fragments may contain incomplete binarized constituents, the result may still contain artificial nodes from the binarization in the root or frontier non-terminals of the fragments.
--alt alternative output format: (NP (DT "a") NN) default: (NP (DT a) (NN ))
--numproc=n use n independent processes, to enable multi-core usage (default: 1); use 0 to detect the number of CPUs.
--debug extra debug information, ignored when numproc > 1.
--quiet disable all messages.