discodop.fragments

Extract recurring tree fragments from constituency treebanks.

NB: there is a known bug in multiprocessing which makes it impossible to detect Ctrl-C or fatal errors like segmentation faults in child processes which causes the master program to wait forever for output from its children. Therefore if you want to abort, kill the program manually (e.g., press Ctrl-Z and issue ‘kill %1’). If the program seems stuck, re-run without multiprocessing (pass –numproc 1) to see if there might be a bug.

Functions

allfragments(trees, sents, maxdepth[, …]) Return all fragments up to a certain depth, # frontiers.
altrepr(a) Rewrite bracketed tree to alternative format.
batch(outputdir, filenames, limit, encoding, …) batch processing: three or more treebanks specified.
cpu_count() Return number of CPUs or 1.
debinarize(fragments) Debinarize fragments; fragments that fail to debinarize left as-is.
exactcountworker(args) Worker function for counting of fragments.
initworker(filename1, filename2, limit, encoding) Read treebanks for this worker.
initworkersimple(trees, sents[, trees2, sents2]) Initialization for a worker in which a treebank was already loaded.
main([argv]) Command line interface to fragment extraction.
mpexactcountworker(args) Worker function for counts (multiprocessing wrapper).
mpworker(interval) Worker function for fragment extraction (multiprocessing wrapper).
printfragments(fragments, counts[, out]) Dump fragments to standard output or some other file object.
read2ndtreebank(filename2, vocab[, fmt, …]) Read a second treebank.
readtreebanks(filename1[, filename2, fmt, …]) Read one or two treebanks.
recurringfragments(trees, sents[, numproc, …]) Get recurring fragments with exact counts in a single treebank.
regular(filenames, numproc, limit, encoding) non-batch processing.
test() Demonstration of fragment extractor.
worker(interval) Worker function for fragment extraction.
workload(numtrees, mult, numproc) Calculate an even workload.
discodop.fragments.main(argv=None)[source]

Command line interface to fragment extraction.

discodop.fragments.regular(filenames, numproc, limit, encoding)[source]

non-batch processing. multiprocessing optional.

discodop.fragments.batch(outputdir, filenames, limit, encoding, debin)[source]

batch processing: three or more treebanks specified.

Compares the first treebank to all others, and writes the results to outputdir/A_B where A and B are the respective filenames. Counts/indices are from the other (B) treebanks. There are at least 2 use cases for this:

  1. Comparing one treebank to a series of others. The first treebank will
    only be loaded once.
  2. In combination with --complete, the first treebank is a set of
    fragments used as queries on the other treebanks specified.
discodop.fragments.readtreebanks(filename1, filename2=None, fmt='bracket', limit=None, encoding='utf8')[source]

Read one or two treebanks.

discodop.fragments.read2ndtreebank(filename2, vocab, fmt='bracket', limit=None, encoding='utf8')[source]

Read a second treebank.

discodop.fragments.initworker(filename1, filename2, limit, encoding)[source]

Read treebanks for this worker.

We do this separately for each process under the assumption that this is advantageous with a NUMA architecture.

discodop.fragments.initworkersimple(trees, sents, trees2=None, sents2=None)[source]

Initialization for a worker in which a treebank was already loaded.

discodop.fragments.worker(interval)[source]

Worker function for fragment extraction.

discodop.fragments.exactcountworker(args)[source]

Worker function for counting of fragments.

discodop.fragments.workload(numtrees, mult, numproc)[source]

Calculate an even workload.

When n trees are compared against themselves, n * (n - 1) total comparisons are made. Each tree m has to be compared to all trees x such that m < x <= n (meaning there are more comparisons for lower n).

Returns:a sequence of (start, end) intervals such that the number of comparisons is approximately balanced.
discodop.fragments.recurringfragments(trees, sents, numproc=1, disc=True, indices=True, maxdepth=1, maxfrontier=999)[source]

Get recurring fragments with exact counts in a single treebank.

Returns:

a dictionary whose keys are fragments as strings, and indices as values. When disc is True, keys are of the form (frag, sent) where frag is a unicode string, and sent is a list of words as unicode strings; when disc is False, keys are of the form frag where frag is a unicode string.

Parameters:
  • trees – a sequence of binarized Tree objects, with indices as leaves.
  • sents – the corresponding sentences (lists of strings).
  • numproc – number of processes to use; pass 0 to use detected # CPUs.
  • disc – when disc=True, assume trees with discontinuous constituents; resulting fragments will be of the form (frag, sent); otherwise fragments will be strings with words as leaves.
  • indices – when False, return integer counts instead of indices.
  • maxdepth – when > 0, add ‘cover’ fragments to result, corresponding to all fragments up to given depth; pass 0 to disable.
  • maxfrontier – maximum number of frontier non-terminals (substitution sites) in cover fragments; a limit of 0 only gives fragments that bottom out in terminals; the default 999 is unlimited for practical purposes.
discodop.fragments.allfragments(trees, sents, maxdepth, maxfrontier=999)[source]

Return all fragments up to a certain depth, # frontiers.

discodop.fragments.altrepr(a)[source]

Rewrite bracketed tree to alternative format.

Replace double quotes with double single quotes: ” -> ‘’ Quote terminals with double quotes terminal: -> “terminal” Remove parentheses around frontier nodes: (NN ) -> NN

>>> print(altrepr('(NP (DT a) (NN ))'))
(NP (DT "a") NN)
discodop.fragments.debinarize(fragments)[source]

Debinarize fragments; fragments that fail to debinarize left as-is.

discodop.fragments.printfragments(fragments, counts, out=None)[source]

Dump fragments to standard output or some other file object.

discodop.fragments.cpu_count()[source]

Return number of CPUs or 1.