discodop.treetransforms

Treebank-indenpendent tree transformations.

This file contains three main transformations:

  • A straightforward binarization: binarize(), based on NLTK code. Provides some additional Markovization options.
  • An optimal binarization for LCFRS: optimalbinarize() Cf. Gildea (2010): Optimal parsing strategies for linear context-free rewriting systems. http://aclweb.org/anthology/N10-1118
  • Converting discontinuous trees to continuous trees and back: splitdiscnodes(). Cf. Boyd (2007): Discontinuity revisited. http://aclweb.org/anthology/W07-1506

Functions

addbitsets(tree) Turn tree into an ImmutableTree and add bitset attribute.
addfanoutmarkers(tree) Mark discontinuous constituents with ‘_n’ where n = # gaps + 1.
binarize(tree[, factor, horzmarkov, …]) Binarize a Tree object.
binarizetree(tree, binarization, …) Binarize a single tree.
canonicalize(tree) Restore canonical linear precedence order; tree is modified in-place.
collapseunary(tree[, collapsepos, …]) Collapse unary nodes into a new node indicated by ‘joinchar’.
complexity(tree) The degree of the time complexity of parsing with this rule.
complexityfanout(tree) Return a tuple with the complexity and fan-out of a subtree.
contsets(nodes) Partition children into continuous subsets.
factorconstituent(node[, sep, h, factor, …]) Binarize one constituent with a left/right factored binarization.
fanout(tree) Return fan-out of constituent.
fanoutcomplexity(tree) Return a tuple with the fan-out and complexity of a subtree.
getbits(bitset) Iterate over the indices of set bits in a bitset.
getyf(left, right) Return the yield function for two subtrees with bitsets.
handledisc(tree) Binarize discontinuous substitution sites.
introducepreterminals(tree, sent[, ids]) Add preterminals with artificial POS-tags for terminals with siblings.
markovthreshold(trees, n, horzmarkov, vertmarkov) Reduce Markov order of binarization labels occurring < n times.
mergediscnodes(tree) Reverse transformation of splitdiscnodes().
minimalbinarization(tree, score[, sep, …]) Find optimal binarization according to a scoring function.
optimalbinarize(tree[, sep, headdriven, h, …]) Recursively binarize a tree, optimizing for given function.
removeemptynodes(tree, sent) Remove any empty nodes, and any empty ancestors.
removefanoutmarkers(tree) Remove fanout marks.
removeterminals(tree, sent, func) Remove any terminal for which func is True, and any empty ancestors.
splitdiscnodes(tree[, markorigin]) Return a continuous version of tree by splitting discontinuous nodes.
treebankfanout(trees) Get maximal fan-out of a list of trees.
unbinarize(tree[, _sent, expandunary, …]) Restore a binarized tree to the original n-ary tree.
discodop.treetransforms.binarize(tree, factor='right', horzmarkov=999, vertmarkov=1, revhorzmarkov=0, markhead=False, headoutward=False, childchar='|', parentchar='^', tailmarker='', leftmostunary=False, rightmostunary=False, threshold=2, artpa=True, ids=None, filterlabels=(), labelfun=None, dot=False, direction=False)[source]

Binarize a Tree object.

Parameters:
  • factor – “left” or “right”. Determines whether binarization proceeds from left to right or vice versa.
  • horzmarkov – amount of horizontal context in labels. Default is infinity, such that now new generalization are introduced by the binarization.
  • vertmarkov – number of ancestors to include in labels. NB: 1 means only the direct parent, as in a normal tree.
  • revhorzmarkov – like horzmarkov, but looks backwards.
  • headoutward – nodes are marked as head in their function tags; the direction of binarization will be switched when it is encountered, to enable a head-outward binarization.
  • markhead – include label of the head child in all auxiliary labels.
  • rightmostunary (leftmostunary,) – introduce a unary production for the first/last child. When h=1, this enables the same generalizations for the first & last non-terminals as for other siblings.
  • tailmarker – when given a non-empty string, add this to artificial nodes introducing the last symbol. This is useful when the last symbol is the head node, ensuring that it is not exchangeable with other non-terminals.
  • dot – if True, horizontal context will include all siblings not yet generated, separated with a dot from the siblings that have been.
  • artpa – whether to add parent annotation to the artificial nodes introduced by the binarization.
  • ids – abbreviate artificial node labels using numeric IDs from this object; must have dictionary-like interface.
  • threshold – constituents with more than this number of children are factored; i.e., for a value of 2, do a normal binarization; for a value of 1, also factor binary productions to include an artificial node, etc.
  • filterlabels – filter any labels matching this sequence from the horizontal markovization context. If labels are of the form A/B, only A is used to match against this sequence. Also, labelfun is first applied to the label, if given. Can be used to filter out modifiers, s.t. the context contains only required elements.
  • labelfun – a function to derive a label from a node to be used for the horizontal markovization context; the default is to use child.label for a given child node.
  • direction – if True, mark the the direction of the binarization with l, r, or m; l is everything before the head, r to the right, and m just before introducing the head.
>>> tree = Tree('(S (VP (PDS 0) (ADV 3) (VVINF 4)) (VMFIN 1) (PIS 2))')
>>> tree[1].type = HEAD
>>> sent = 'das muss man jetzt machen'.split()
>>> print(binarize(tree, horzmarkov=1, headoutward=True))
(S (VP (PDS 0) (VP|<ADV> (ADV 3) (VVINF 4))) (S|<VMFIN> (VMFIN 1) (PIS 2)))
>>> tree = Tree('(S (X (A 0) (B 3) (C 4)) (D 1) (E 2))')
>>> tree[1].type = HEAD
>>> print(binarize(tree, headoutward=True, leftmostunary=True,
... rightmostunary=True))  # doctest: +NORMALIZE_WHITESPACE
(S (S|<X,D,E> (X (X|<A,B,C> (A 0) (X|<B,C> (B 3) (X|<C> (C 4)))))
        (S|<D,E> (S|<D,E> (D 1)) (E 2))))
discodop.treetransforms.unbinarize(tree, _sent=None, expandunary=True, childchar='|', parentchar='^', unarychar='+')[source]

Restore a binarized tree to the original n-ary tree.

Modifies tree in-place. NB: a malformed node such as (X|<Y> ) which is not supposed to be empty will be silently discarded.

discodop.treetransforms.collapseunary(tree, collapsepos=False, collapseroot=False, joinchar='+')[source]

Collapse unary nodes into a new node indicated by ‘joinchar’.

For example``(NP (NN John))`` becomes (NP+NN John). The tree is modified in-place.

Parameters:
  • collapsepos – when False (default), do not collapse preterminals
  • collapseroot – when False (default) do not modify the root production if it is unary; e.g., TOP -> productions for the Penn WSJ treebank
  • joinchar – A string used to connect collapsed node values
discodop.treetransforms.introducepreterminals(tree, sent, ids=None)[source]

Add preterminals with artificial POS-tags for terminals with siblings.

Parameters:ids – by default, artificial labels have the form parent_label/terminal. When an iterator is passed, its values are used in place of terminal.
>>> tree = Tree('(S (X 0 1 (CD 2 3) 4))')
>>> print(introducepreterminals(tree, ['a', 'b', 'c', 'd', 'e']))
(S (X (X/a 0) (X/b 1) (CD (CD/c 2) (CD/d 3)) (X/e 4)))
>>> tree = Tree('(S (X 0 1 2))')
>>> print(introducepreterminals(tree, [None, None, None], ids=iter('abc')))
(S (X (X/a 0) (X/b 1) (X/c 2)))
discodop.treetransforms.handledisc(tree)[source]

Binarize discontinuous substitution sites.

>>> print(handledisc(Tree('(S (X 0 2 4))')))
(S (X 0 (X|<> 2 (X|<> 4))))
>>> print(handledisc(Tree('(S (X 0 2))')))
(S (X 0 (X|<> 2)))
discodop.treetransforms.factorconstituent(node, sep='|', h=999, factor='right', markfanout=False, markyf=False, ids=None, threshold=2, filterlabels=(), labelfun=operator.attrgetter('label'))[source]

Binarize one constituent with a left/right factored binarization.

Children remain unmodified. Nodes must be immutable and contain bitsets; use addbitsets(). By default construct artificial labels using labels of child nodes. When markyf is True, each artificial label will include the yield function; this is necessary for a ‘normal form’ binarization that is equivalent to the original. When ids is given, it is used both as an interator (for new unique labels) and as a dictionary (to re-use labels). The first ID in a binarization will always be unique, while the others will be re-used for the same combination of labels and yield function.

discodop.treetransforms.markovthreshold(trees, n, horzmarkov, vertmarkov)[source]

Reduce Markov order of binarization labels occurring < n times.

discodop.treetransforms.splitdiscnodes(tree, markorigin=False)[source]

Return a continuous version of tree by splitting discontinuous nodes.

Boyd (2007): Discontinuity revisited. http://aclweb.org/anthology/W07-1506

Markorigin=False:
 VP* (bare label)
Markorigin=True:
 VP*1 (add index)
>>> tree = Tree('(S (VP (VP (PP (APPR 0) (ART 1) (NN 2)) (CARD 4)'
... '(VVPP 5)) (VAINF 6)) (VMFIN 3))')
>>> print(splitdiscnodes(tree.copy(True)))
...  # doctest: +NORMALIZE_WHITESPACE
(S (VP* (VP* (PP (APPR 0) (ART 1) (NN 2)))) (VMFIN 3) (VP* (VP* (CARD 4)
        (VVPP 5)) (VAINF 6)))
>>> print(splitdiscnodes(tree, markorigin=True))
...  # doctest: +NORMALIZE_WHITESPACE
(S (VP*0 (VP*0 (PP (APPR 0) (ART 1) (NN 2)))) (VMFIN 3) (VP*1 (VP*1
        (CARD 4) (VVPP 5)) (VAINF 6)))
discodop.treetransforms.mergediscnodes(tree)[source]

Reverse transformation of splitdiscnodes().

discodop.treetransforms.addfanoutmarkers(tree)[source]

Mark discontinuous constituents with ‘_n’ where n = # gaps + 1.

discodop.treetransforms.removefanoutmarkers(tree)[source]

Remove fanout marks.

discodop.treetransforms.removeterminals(tree, sent, func)[source]

Remove any terminal for which func is True, and any empty ancestors.

Parameters:
  • tree – a ParentedTree.
  • func – a function with the signature (word, node) -> bool.
discodop.treetransforms.removeemptynodes(tree, sent)[source]

Remove any empty nodes, and any empty ancestors.

discodop.treetransforms.treebankfanout(trees)[source]

Get maximal fan-out of a list of trees.

discodop.treetransforms.canonicalize(tree)[source]

Restore canonical linear precedence order; tree is modified in-place.

discodop.treetransforms.binarizetree(tree, binarization, relationalrealizational)[source]

Binarize a single tree.

discodop.treetransforms.optimalbinarize(tree, sep='|', headdriven=False, h=None, v=1, fun=None)[source]

Recursively binarize a tree, optimizing for given function.

v=0 is not implemented. Setting h to a nonzero integer restricts the possible binarizations to head driven binarizations.

discodop.treetransforms.minimalbinarization(tree, score, sep='|', head=None, parentstr='', h=999)[source]

Find optimal binarization according to a scoring function.

Implementation of Gildea (2010): Optimal parsing strategies for linear context-free rewriting systems. http://aclweb.org/anthology/N10-1118

Parameters:
  • tree – ImmutableTree for which the optimal binarization of its top production will be searched. Nodes need to have a .bitset attribute, as produced by addbitsets().
  • score – a function from binarized trees to scores, where lower is better (the scores can be anything else which supports comparisons).
  • head – an optional index of the head node, specifying it enables head-driven binarization (which constrains the possible binarizations).
>>> tree = '(X (A 0) (B 1) (C 2) (D 3) (E 4))'
>>> tree2 = binarize(Tree(tree))
>>> minimalbinarization(addbitsets(tree), complexityfanout, head=2) == tree2
True
>>> tree = addbitsets('(A (B1 (t 6) (t 13)) (B2 (t 3) (t 7) (t 10)) '
... '(B3 (t 1) (t 9) (t 11) (t 14) (t 16)) (B4 (t 0) (t 5) (t 8)))')
>>> a = minimalbinarization(tree, complexityfanout)
>>> b = minimalbinarization(tree, fanoutcomplexity)
>>> print(max(map(complexityfanout, a.subtrees())))
(14, 6)
>>> print(max(map(complexityfanout, b.subtrees())))
(15, 5)
discodop.treetransforms.fanout(tree)[source]

Return fan-out of constituent. Requires bitset attribute.

discodop.treetransforms.complexity(tree)[source]

The degree of the time complexity of parsing with this rule.

Cf. Gildea (2010). http://aclweb.org/anthology/N10-1118

discodop.treetransforms.complexityfanout(tree)[source]

Return a tuple with the complexity and fan-out of a subtree.

discodop.treetransforms.fanoutcomplexity(tree)[source]

Return a tuple with the fan-out and complexity of a subtree.

discodop.treetransforms.contsets(nodes)[source]

Partition children into continuous subsets.

>>> tree = Tree('(VP (PP (APPR 0) (ART 1) (NN 2)) (CARD 4) (VVPP 5))')
>>> for a in contsets(tree):
...             print(' / '.join('%s' % b for b in a))
(PP (APPR 0) (ART 1) (NN 2))
(CARD 4) / (VVPP 5)
discodop.treetransforms.getbits(bitset)[source]

Iterate over the indices of set bits in a bitset.

discodop.treetransforms.addbitsets(tree)[source]

Turn tree into an ImmutableTree and add bitset attribute.

The bitset attribute is a Python integer corresponding to the information that leaves() would return for that node.

discodop.treetransforms.getyf(left, right)[source]

Return the yield function for two subtrees with bitsets.

Returns:string representation of yield function; e.g., ‘;01,10’.