discodop._fragments

Fragment extraction with tree kernels.

Implements:

Functions

allfragments(Ctrees trees, Vocabulary vocab, …) Return all fragments of trees up to maxdepth.
completebitsets(Ctrees trees, …[, start, …]) Generate bitsets corresponding to whole trees in the input.
exactcounts(list bitsets, Ctrees trees1, …) Get exact counts or indices of occurrence for fragments.
exactcountsslice(list bitsets, …[, …]) Get counts of fragments in a slice of the treebank.
extractfragments(Ctrees trees1, int start1, …) Find the largest fragments in treebank(s) with the fast tree kernel.
getctrees(items1[, items2]) Convert binarized Tree objects to Ctrees object.
pygetsent(unicode frag) Wrapper of getsent() to make doctests possible.
readtreebank(treebankfile, Vocabulary vocab) Read a treebank from a given filename.
repl(d) A function for use with re.sub that looks up numeric IDs in a dict.
discodop._fragments.extractfragments(Ctrees trees1, int start1, int end1, Vocabulary vocab, Ctrees trees2=None, int start2=0, int end2=0, bool approx=True, bool debug=False, bool disc=False, unicode twoterms=None, bool adjacent=False, maxnodes=None)

Find the largest fragments in treebank(s) with the fast tree kernel.

  • scenario 1: recurring fragments in single treebank, use::
    extractfragments(trees1, start1, end1, vocab)
  • scenario 2: common fragments in two treebanks::
    extractfragments(trees1, start1, end1, vocab, trees2)
Parameters:
  • end1 (start1,) – specify slice of treebank to be used; can be used to divide the work over multiple processes; they are indices of trees1 to work on (pass 0 for both to use all trees).
  • end2 (start2,) – idem for trees2.
  • approx – return approximate counts instead of bitsets.
  • debug – if True, a table of common productions is printed for each pair of trees
  • disc – if True, return trees with indices as leaves.
  • twoterms – only return fragments with at least two terminals, one of which has a POS tag matching the given regex.
  • adjacent – only extract fragments from sentences with adjacent indices.
  • maxnodes – the maximum number of nodes in a single tree to fix the bitset size. Set this manually when combining results from different sets of trees to ensure a consistent bitset size. By default it is the maximum value across both treebanks.
Returns:

a dictionary; keys are fragments as strings; values are either counts (if approx=True), or bitsets describing fragments of trees1.

discodop._fragments.exactcounts(list bitsets, Ctrees trees1, Ctrees trees2, int indices=False, maxnodes=None)

Get exact counts or indices of occurrence for fragments.

Parameters:
  • trees1 (bitsets,) – bitsets defines fragments of trees in trees1 to search for (the needles).
  • trees2 – the trees to search in (haystack); may be equal to trees1. The returned counts are occurrences in these trees.
  • indices

    whether to collect indices or counts of fragments.

    0:return a single count per fragment.
    1:collect the indices (sentence numbers) in which fragments occur.
    2:collect both sentence numbers, and node numbers of fragments.
  • maxnodes – the maximum number of nodes in a single tree to fix the bitset size; use the same value as the function that generated these bitsets. For extractfragments, it is the maximum value across both treebanks, which is also the default here.
Returns:

depending on indices:

0:an array of counts, corresponding to bitsets.
1:a list of arrays, each array being a sorted sequence of indices for the corresponding bitset; multiple occurrences of a fragment in the same tree are reflected as multiple occurrences of the same index.
2:a list of pairs of arrays, tree indices paired with node numbers. The node number is the index in the tree of the root of the matching fragment.

discodop._fragments.completebitsets(Ctrees trees, Vocabulary vocab, short maxnodes, bool disc=False, start=None, end=None, tostring=True)

Generate bitsets corresponding to whole trees in the input.

Parameters:tostring – when False, do not create list of trees as strings
Returns:a pair of lists with trees as strings and their bitsets, respectively.

A tree with a discontinuous substitution site is expected to be binarized with rightmostunary=True:

>>> from discodop.treebank import brackettree
>>> tree, sent = brackettree(u'(S (X 0= 2= 4=))')
>>> print(handledisc(tree))
(S (X 0 (X|<> 2 (X|<> 4))))

These auxiliary nodes will not be part of the returned tree / bitset:

>>> tmp = getctrees([(tree, sent)])
>>> print(completebitsets(tmp['trees1'], tmp['vocab'], 2, disc=True)[0][0])
(S (X 0= 2= 4=))
discodop._fragments.allfragments(Ctrees trees, Vocabulary vocab, unsigned int maxdepth, unsigned int maxfrontier=999, bool disc=True, bool indices=False, start=None, end=None)

Return all fragments of trees up to maxdepth.

Parameters:
  • maxdepth – maximum depth of fragments; depth 1 gives fragments that are equivalent to a treebank grammar.
  • maxfrontier – maximum number of frontier non-terminals (substitution sites) in fragments; a limit of 0 only gives fragments that bottom out in terminals; 999 is unlimited for practical purposes.
  • end (start,) – only consider this interval of trees (default is all).
Returns:

dictionary fragments with tree strings as keys and integer counts as values (or arrays if indices is True).

discodop._fragments.repl(d)

A function for use with re.sub that looks up numeric IDs in a dict.

discodop._fragments.pygetsent(unicode frag)

Wrapper of getsent() to make doctests possible.

>>> print(pygetsent(u'(S (NP 2=man) (VP 4=walks))'))
(S (NP 0=man) (VP 2=walks))
>>> print(pygetsent(u'(VP (VB 0=Wake) (PRT 3=up))'))
(VP (VB 0=Wake) (PRT 2=up))
>>> print(pygetsent(u'(S (NP 2:2 4:4) (VP 1:1 3:3))'))
(S (NP 1= 3=) (VP 0= 2=))
>>> print(pygetsent(u'(ROOT (S 0:2) ($. 3=.))'))
(ROOT (S 0=) ($. 1=.))
>>> print(pygetsent(u'(ROOT (S 0=Foo) ($. 3=.))'))
(ROOT (S 0=Foo) ($. 2=.))
>>> print(pygetsent(
... u'(S|<VP>_2 (VP_3 0:1 3:3 16:16) (VAFIN 2=wird))'))
(S|<VP>_2 (VP_3 0= 2= 4=) (VAFIN 1=wird))
discodop._fragments.getctrees(items1, items2=None, Vocabulary vocab=None, bool index=True)

Convert binarized Tree objects to Ctrees object.

Parameters:
  • items1 – an iterable with tuples of the form (tree, sent).
  • items2 – optionally, a second iterable of trees.
  • index – whether to create production index of trees.
Returns:

dictionary with keys ‘trees1’, ‘trees2’, and ‘vocab’, where trees1 and trees2 are Ctrees objects for disc. binary trees and sentences.

discodop._fragments.readtreebank(treebankfile, Vocabulary vocab, fmt=u'bracket', limit=None, encoding=u'utf8')

Read a treebank from a given filename.

vocab should be re-used when reading multiple treebanks.

Returns:tuple of Ctrees object and list of sentences.
discodop._fragments.exactcountsslice(list bitsets, Ctrees trees1, Ctrees trees2, int indices=0, maxnodes=None, start=None, end=None, maxresults=None)

Get counts of fragments in a slice of the treebank.

Variant of exactcounts() that releases the GIL in the inner loop and is intended for searching in subsets of trees2.

Parameters:
  • end (start,) – only search through this interval of trees from trees2 (defaults to all trees).
  • maxresults – stop searching after this number of matchs.
Returns:

depending on indices:

0:an array of counts, corresponding to bitsets.
1:a list of arrays, each array being a sequence of indices for the corresponding bitset.
2:a list of pairs of arrays, tree indices paired with node numbers.