discodop._fragments¶
Fragment extraction with tree kernels.
Implements:
- van Cranenburgh. 2014. Extraction of Phrase-Structure Fragments with a Linear Average Time Tree-Kernel. http://www.clinjournal.org/sites/default/files/01-Cranenburgh-CLIN2014.pdf
- Sangati et al. 2010. Efficiently extract recurring tree fragments from large treebanks. http://lrec-conf.org/proceedings/lrec2010/pdf/613_Paper.pdf
- Moschitti. 2006. Making Tree Kernels practical for Natural Language Learning. http://aclweb.org/anthology/E06-1015
Functions
allfragments(Ctrees trees, Vocabulary vocab, …) |
Return all fragments of trees up to maxdepth. |
completebitsets(Ctrees trees, …[, start, …]) |
Generate bitsets corresponding to whole trees in the input. |
exactcounts(list bitsets, Ctrees trees1, …) |
Get exact counts or indices of occurrence for fragments. |
exactcountsslice(list bitsets, …[, …]) |
Get counts of fragments in a slice of the treebank. |
extractfragments(Ctrees trees1, int start1, …) |
Find the largest fragments in treebank(s) with the fast tree kernel. |
getctrees(items1[, items2]) |
Convert binarized Tree objects to Ctrees object. |
pygetsent(unicode frag) |
Wrapper of getsent() to make doctests possible. |
readtreebank(treebankfile, Vocabulary vocab) |
Read a treebank from a given filename. |
repl(d) |
A function for use with re.sub that looks up numeric IDs in a dict. |
-
discodop._fragments.extractfragments(Ctrees trees1, int start1, int end1, Vocabulary vocab, Ctrees trees2=None, int start2=0, int end2=0, bool approx=True, bool debug=False, bool disc=False, unicode twoterms=None, bool adjacent=False, maxnodes=None)¶ Find the largest fragments in treebank(s) with the fast tree kernel.
- scenario 1: recurring fragments in single treebank, use::
- extractfragments(trees1, start1, end1, vocab)
- scenario 2: common fragments in two treebanks::
- extractfragments(trees1, start1, end1, vocab, trees2)
Parameters: - end1 (start1,) – specify slice of treebank to be used; can be used to
divide the work over multiple processes; they are indices of
trees1to work on (pass 0 for both to use all trees). - end2 (start2,) – idem for trees2.
- approx – return approximate counts instead of bitsets.
- debug – if True, a table of common productions is printed for each pair of trees
- disc – if True, return trees with indices as leaves.
- twoterms – only return fragments with at least two terminals, one of which has a POS tag matching the given regex.
- adjacent – only extract fragments from sentences with adjacent indices.
- maxnodes – the maximum number of nodes in a single tree to fix the bitset size. Set this manually when combining results from different sets of trees to ensure a consistent bitset size. By default it is the maximum value across both treebanks.
Returns: a dictionary; keys are fragments as strings; values are either counts (if approx=True), or bitsets describing fragments of
trees1.
-
discodop._fragments.exactcounts(list bitsets, Ctrees trees1, Ctrees trees2, int indices=False, maxnodes=None)¶ Get exact counts or indices of occurrence for fragments.
Parameters: - trees1 (bitsets,) –
bitsetsdefines fragments of trees intrees1to search for (the needles). - trees2 – the trees to search in (haystack); may be equal
to
trees1. The returned counts are occurrences in these trees. - indices –
whether to collect indices or counts of fragments.
0: return a single count per fragment. 1: collect the indices (sentence numbers) in which fragments occur. 2: collect both sentence numbers, and node numbers of fragments. - maxnodes – the maximum number of nodes in a single tree to fix the
bitset size; use the same value as the function that generated these
bitsets. For
extractfragments, it is the maximum value across both treebanks, which is also the default here.
Returns: depending on
indices:0: an array of counts, corresponding to bitsets.1: a list of arrays, each array being a sorted sequence of indices for the corresponding bitset; multiple occurrences of a fragment in the same tree are reflected as multiple occurrences of the same index. 2: a list of pairs of arrays, tree indices paired with node numbers. The node number is the index in the tree of the root of the matching fragment. - trees1 (bitsets,) –
-
discodop._fragments.completebitsets(Ctrees trees, Vocabulary vocab, short maxnodes, bool disc=False, start=None, end=None, tostring=True)¶ Generate bitsets corresponding to whole trees in the input.
Parameters: tostring – when False, do not create list of trees as strings Returns: a pair of lists with trees as strings and their bitsets, respectively. A tree with a discontinuous substitution site is expected to be binarized with
rightmostunary=True:>>> from discodop.tree import discbrackettree >>> tree, sent = discbrackettree('(S (X 0= 2= 4=))') >>> print(handledisc(tree)) (S (X 0 (X|<> 2 (X|<> 4))))
These auxiliary nodes will not be part of the returned tree / bitset:
>>> tmp = getctrees([(tree, sent)]) >>> print(completebitsets(tmp['trees1'], tmp['vocab'], 2, disc=True)[0][0]) (S (X 0= 2= 4=))
-
discodop._fragments.allfragments(Ctrees trees, Vocabulary vocab, unsigned int maxdepth, unsigned int maxfrontier=999, bool disc=True, bool indices=False, start=None, end=None)¶ Return all fragments of trees up to maxdepth.
Parameters: - maxdepth – maximum depth of fragments; depth 1 gives fragments that are equivalent to a treebank grammar.
- maxfrontier – maximum number of frontier non-terminals (substitution sites) in fragments; a limit of 0 only gives fragments that bottom out in terminals; 999 is unlimited for practical purposes.
- end (start,) – only consider this interval of trees (default is all).
Returns: dictionary fragments with tree strings as keys and integer counts as values (or arrays if indices is True).
-
discodop._fragments.repl(d)¶ A function for use with re.sub that looks up numeric IDs in a dict.
-
discodop._fragments.pygetsent(unicode frag)¶ Wrapper of
getsent()to make doctests possible.>>> print(pygetsent(u'(S (NP 2=man) (VP 4=walks))')) (S (NP 0=man) (VP 2=walks)) >>> print(pygetsent(u'(VP (VB 0=Wake) (PRT 3=up))')) (VP (VB 0=Wake) (PRT 2=up)) >>> print(pygetsent(u'(S (NP 2:2 4:4) (VP 1:1 3:3))')) (S (NP 1= 3=) (VP 0= 2=)) >>> print(pygetsent(u'(ROOT (S 0:2) ($. 3=.))')) (ROOT (S 0=) ($. 1=.)) >>> print(pygetsent(u'(ROOT (S 0=Foo) ($. 3=.))')) (ROOT (S 0=Foo) ($. 2=.)) >>> print(pygetsent( ... u'(S|<VP>_2 (VP_3 0:1 3:3 16:16) (VAFIN 2=wird))')) (S|<VP>_2 (VP_3 0= 2= 4=) (VAFIN 1=wird))
-
discodop._fragments.getctrees(items1, items2=None, Vocabulary vocab=None, bool index=True)¶ Convert binarized Tree objects to Ctrees object.
Parameters: - items1 – an iterable with tuples of the form
(tree, sent). - items2 – optionally, a second iterable of trees.
- index – whether to create production index of trees.
Returns: dictionary with keys ‘trees1’, ‘trees2’, and ‘vocab’, where trees1 and trees2 are Ctrees objects for disc. binary trees and sentences.
- items1 – an iterable with tuples of the form
-
discodop._fragments.readtreebank(treebankfile, Vocabulary vocab, fmt=u'bracket', limit=None, encoding=u'utf8')¶ Read a treebank from a given filename.
vocabshould be re-used when reading multiple treebanks.Returns: tuple of Ctrees object and list of sentences.
-
discodop._fragments.exactcountsslice(list bitsets, Ctrees trees1, Ctrees trees2, int indices=0, maxnodes=None, start=None, end=None, maxresults=None)¶ Get counts of fragments in a slice of the treebank.
Variant of exactcounts() that releases the GIL in the inner loop and is intended for searching in subsets of
trees2.Parameters: - end (start,) – only search through this interval of trees from
trees2(defaults to all trees). - maxresults – stop searching after this number of matchs.
Returns: depending on
indices:0: an array of counts, corresponding to bitsets.1: a list of arrays, each array being a sequence of indices for the corresponding bitset. 2: a list of pairs of arrays, tree indices paired with node numbers. - end (start,) – only search through this interval of trees from