discodop.grammar

Assorted functions to read off grammars from treebanks.

Functions

addindices(frag) Convert fragment in bracket to discbracket format.
cartpi(seq) The cartesian product of a sequence of iterables.
compiletsg(fragments) Compile a set of weighted fragments (i.e., a TSG) into a grammar.
convertweight(weight) Convert a weight in a string to a float.
defaultparse(wordstags[, rightbranching]) A default parse to generate when parsing fails.
dop1(trees, sents[, maxdepth, maxfrontier, …]) Return an all-fragments DOP1 model with relative frequencies.
dopgrammar(trees, fragments[, extrarules, …]) Create a DOP grammar from a set of fragments and occurrences.
dopreduction(trees, sents[, packedgraph, …]) Induce a reduction of DOP to an LCFRS.
doubledop(trees, sents[, debug, maxdepth, …]) Extract a Double-DOP grammar from a treebank.
doubleostagfromtsg(trees, sents[, numproc, …]) Extract recurring fragments from a treebank and create osTAG.
extractaux(node, foot) Return copy of tree where foot is made a foot node.
extractinit(tree, root, foot) Return copy of tree where foot is substituted for foot.
flatten(frag, ids, backtransform) Auxiliary function for Double-DOP.
footnode(node, sent)
grammarinfo(grammar[, dump]) Print some statistics on a grammar, before it goes through Grammar().
grammarstats(filename) Print statistics for PLCFRS/bitpar grammar (sorted by LHS).
inducetagfromtsg(elemtrees) Given elementary trees from a TSG, heuristically extract a TAG.
lcfrsproductions(tree, sent[, frontiers]) Read off LCFRS productions from a tree with indices and a sentence.
mean(seq) Arithmetic mean.
merge(filenames, outfilename, sumfunc, key) Interpolate weights of given files.
nodefreq(tree, dectree, subtreefd, nonterminalfd) Auxiliary function for DOP reduction.
ostagreduction(inittrees, auxtrees[, …]) A reduction of a tree-adjoining grammar with the off-spine constraint.
preterminal(node, sent)
printrule(r, yf[, w])
returns:a string representation of a rule.
rangeheads(s) Return first element of each range in a sorted sequence of numbers.
ranges(s) Partition s into a sequence of lists corresponding to contiguous ranges.
renumber(tree, root, foot, sent) Renumber tree when removing indices dominated by foot.
sortgrammar(grammar[, altweights]) Sort grammar productions in three clusters: phrasal, binarized, lexical.
spinal(node)
spinallabel(node, sent, adjsite)
splitweight(weight) Convert a weight / fraction in a string to a float / tuple.
stripweight(line) Extract rule without weight.
subsetgrammar(a, b) Test whether grammar a is a subset of b.
substitutionsite(node, sent)
sumfrags(iterable, n) Sum weights for runs of identical fragments.
sumlex(iterable, n) Given a sorted lexicon iterable, sum weights of word/tag pairs.
sumrules(iterable, n) Given a sorted iterable of rules, sum weights of identical rules.
treebankgrammar(trees, sents[, extrarules]) Induce a probabilistic LCFRS with relative frequencies of productions.
writegrammar(grammar[, bitpar]) Write a grammar in a simple text file format.

Classes

SpinedTreeDecorator([memoize, n]) TreeDecorator that adds address to root and can mark spine nodes ‘$’.
TreeDecorator([memoize, n]) Auxiliary class for DOP reduction.
UniqueIDs([prefix]) Produce strings with numeric IDs.
discodop.grammar.lcfrsproductions(tree, sent, frontiers=False)[source]

Read off LCFRS productions from a tree with indices and a sentence.

Tree should contain integer indices as terminals, and a sentence with the corresponding words for these indices. Always produces monotone LCFRS rules. For best results, tree should be canonicalized. When frontiers is True, frontier nodes will generate empty productions, by default they are ignored.

>>> tree = Tree("(S (VP_2 (V 0) (ADJ 2)) (NP 1))")
>>> sent = "is Mary happy".split()
>>> print('\n'.join(printrule(r, yf)  
...             for r, yf in lcfrsproductions(tree, sent)))
010     S VP_2 NP
0,1     VP_2 V ADJ
is      V Epsilon
happy   ADJ Epsilon
Mary    NP Epsilon
discodop.grammar.treebankgrammar(trees, sents, extrarules=None)[source]

Induce a probabilistic LCFRS with relative frequencies of productions.

When trees contain no discontinuities, the result is equivalent to a treebank PCFG.

Parameters:extarules – A dictionary of productions that will be merged with the grammar, with (pseudo)frequencies as values.
discodop.grammar.dopreduction(trees, sents, packedgraph=False, decorator=None, extrarules=None)[source]

Induce a reduction of DOP to an LCFRS.

Based on how Goodman (1996, 2003) reduces DOP to a PCFG. http://aclweb.org/anthology/W96-0214

Parameters:
  • packedgraph – packed graph encoding (Bansal & Klein 2010, sec 4.2). http://aclweb.org/anthology/P10-1112
  • decorator – a TreeDecorator instance (packedgraph is ignored if this is passed).
Returns:

a set of rules with the relative frequency estimate as probabilities, and a dictionary with alternate weights.

discodop.grammar.doubledop(trees, sents, debug=False, maxdepth=1, maxfrontier=999, numproc=None, extrarules=None)[source]

Extract a Double-DOP grammar from a treebank.

That is, a fragment grammar containing fragments that occur at least twice, plus all individual productions needed to obtain full coverage. Input trees need to be binarized.

Parameters:
  • maxdepth – add non-maximal/non-recurring fragments with depth 1 < depth < maxdepth.
  • maxfrontier – limit number of frontier non-terminals; not yet implemented.
Returns:

a tuple (grammar, altweights, backtransform, fragments) :grammar: a sequence of productions. :altweights: a dictionary containing alternate weights. :backtransform: needed to recover trees from compressed derivations. :fragments: a sequence of the fragments used to build the grammar, in the same order as they appear in grammar.

discodop.grammar.dop1(trees, sents, maxdepth=4, maxfrontier=999, extrarules=None)[source]

Return an all-fragments DOP1 model with relative frequencies.

Parameters:
  • maxdepth – restrict fragments to 1 < depth < maxdepth.
  • maxfrontier – limit number of frontier non-terminals; not yet implemented.
Returns:

a tuple (grammar, altweights, backtransform, fragments) :grammar: a sequence of productions. :altweights: a dictionary containing alternate weights. :backtransform: needed to recover trees from compressed derivations. :fragments: a sequence of the fragments used to build the grammar, in the same order as they appear in grammar.

discodop.grammar.dopgrammar(trees, fragments, extrarules=None, debug=False, ids=None)[source]

Create a DOP grammar from a set of fragments and occurrences.

A second level of binarization (a normal form) is needed when fragments are converted to individual grammar rules, which occurs through the removal of internal nodes. The binarization adds unique identifiers so that each grammar rule can be mapped back to its fragment. In fragments with terminals, we replace their POS tags with a tag uniquely identifying that terminal and tag: tag@word.

Parameters:
  • fragments – a dictionary of fragments from binarized trees, with occurrences as values (a sequence of sentence numbers with repetitions).
  • extrarules – Additional rules to add to the grammar.
Returns:

a tuple (grammar, altweights, backtransform, fragments) altweights is a dictionary containing alternate weights.

discodop.grammar.compiletsg(fragments)[source]

Compile a set of weighted fragments (i.e., a TSG) into a grammar.

Similar to dopgrammar(), but the supplied fragments have fixed weights
and no alternative weights are returned.
Parameters:fragments – a dictionary of fragments mapped to weights. The fragments must consist of strings in discbracket format.
Returns:a (grammar, backtransform, altweights) tuple similar to what doubledop() returns; altweights will be empty.
discodop.grammar.sortgrammar(grammar, altweights=None)[source]

Sort grammar productions in three clusters: phrasal, binarized, lexical.

  1. normal phrasal rules, ordered by lhs symbol
  2. non-initial binarized 2dop rules (to align the 2dop backtransform with
    the rules in cluster 1 which introduce a new fragment)
  3. lexical rules sorted by word
discodop.grammar.flatten(frag, ids, backtransform)[source]

Auxiliary function for Double-DOP.

Remove internal nodes from a fragment and read off the (binarized) productions of the resulting flattened fragment. Aside from returning productions, also return fragment with lexical and frontier nodes replaced by a templating symbol ‘{n}’ where n is an index. Trees are in the form of strings.

Parameters:
  • frag – a tree fragment
  • ids – an iterator which yields unique IDs for non-terminals introduced by the binarization
Returns:

a tuple (prods, template).

>>> ids = UniqueIDs()
>>> frag = "(ROOT (S_2 0= 2=) (ROOT|<$,>_2 ($, 1=,) ($. 3=.)))"
>>> prods, template = flatten(frag, ids, {})
>>> print('\n'.join(printrule(r, yf) for r, yf in prods))
... 
01      ROOT ROOT}<0> $.@.
010     ROOT}<0> S_2 $,@,
,       $,@, Epsilon
.       $.@. Epsilon
>>> print(template)
(ROOT {0} (ROOT|<$,>_2 {1} {2}))
discodop.grammar.nodefreq(tree, dectree, subtreefd, nonterminalfd)[source]

Auxiliary function for DOP reduction.

Counts frequencies of nodes and calculate the number of subtrees headed by each node. updates subtreefd and nonterminalfd as a side effect. Expects a normal tree and a tree with IDs.

Parameters:
  • subtreefd – the Counter to store the counts of subtrees
  • nonterminalfd – the Counter to store the counts of non-terminals
>>> fd = Counter()
>>> d = TreeDecorator()
>>> tree = Tree("(S (NP 0) (VP 1))")
>>> dectree = d.decorate(tree, ['mary', 'walks'])
>>> nodefreq(tree, dectree, fd, Counter())
4
>>> fd == Counter({'S': 4, 'NP': 1, 'VP': 1, 'NP@1-0': 1, 'VP@1-1': 1})
True
class discodop.grammar.TreeDecorator(memoize=False, n=1)[source]

Auxiliary class for DOP reduction.

Adds unique identifiers to each internal non-terminal of a tree.

Parameters:
  • memoize – if True, identifiers will be reused for equivalent subtrees (including all terminals).
  • n – the initial sentence number.
decorate(tree, sent)[source]

Return a copy of tree with labels decorated with IDs.

>>> d = TreeDecorator()
>>> tree = Tree('(S (NP (DT 0) (N 1)) (VP 2))')
>>> print(d.decorate(tree, ['the', 'dog', 'walks']))
(S (NP@1-0 (DT@1-1 0) (N@1-2 1)) (VP@1-3 2))
>>> d = TreeDecorator(memoize=True)
>>> print(d.decorate(Tree('(S (NP (DT 0) (N 1)) (VP 2))'),
...             ['the', 'dog', 'walks']))
(S (NP@1-1 (DT@1-2 0) (N@1-3 1)) (VP@1-4 2))
>>> print(d.decorate(Tree('(S (NP (DT 0) (N 1)) (VP 2))'),
...             ['the', 'dog', 'barks']))
(S (NP@1-1 (DT@1-2 0) (N@1-3 1)) (VP@2-4 2))
class discodop.grammar.UniqueIDs(prefix='')[source]

Produce strings with numeric IDs.

Can be used as iterator (IDs will never be re-used) and dictionary (IDs will be re-used for same key).

>>> ids = UniqueIDs()
>>> print(next(ids))
0
>>> print(ids['foo'], ids['bar'], ids['foo'])
1 2 1
discodop.grammar.rangeheads(s)[source]

Return first element of each range in a sorted sequence of numbers.

>>> rangeheads( (0, 1, 3, 4, 6) )
[0, 3, 6]
discodop.grammar.ranges(s)[source]

Partition s into a sequence of lists corresponding to contiguous ranges.

>>> list(ranges( (0, 1, 3, 4, 6) ))
[[0, 1], [3, 4], [6]]
discodop.grammar.defaultparse(wordstags, rightbranching=False)[source]

A default parse to generate when parsing fails.

Parameters:rightbranching – when True, return a right branching tree with NPs, otherwise return all words under a single constituent ‘NOPARSE’.
>>> print(defaultparse([('like','X'), ('this','X'), ('example', 'NN'),
... ('here','X')]))
(NOPARSE (X like) (X this) (NN example) (X here))
>>> print(defaultparse([('like','X'), ('this','X'), ('example', 'NN'),
... ('here','X')], True))
(NP (X like) (NP (X this) (NP (NN example) (NP (X here)))))
discodop.grammar.printrule(r, yf, w=None)[source]
Returns:a string representation of a rule.
discodop.grammar.cartpi(seq)[source]

The cartesian product of a sequence of iterables.

itertools.product doesn’t support infinite sequences!

>>> list(islice(cartpi([count(), count(0)]), 9))
[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7), (0, 8)]
discodop.grammar.writegrammar(grammar, bitpar=False)[source]

Write a grammar in a simple text file format.

Rules are written in the order as they appear in the sequence grammar, except that the lexicon file lists words in sorted order (with tags for each word in the order of grammar). For a description of the file format, see docs/fileformats.rst.

Parameters:
  • grammar – a sequence of rule tuples, as produced by treebankgrammar(), dopreduction(), or doubledop().
  • bitpar – when True, use bitpar format: for rules, put weight first and leave out the yield function. By default, a format that supports LCFRS is used.
Returns:

tuple of strings``(rules, lexicon)``

discodop.grammar.subsetgrammar(a, b)[source]

Test whether grammar a is a subset of b.

discodop.grammar.mean(seq)[source]

Arithmetic mean.

discodop.grammar.grammarinfo(grammar, dump=None)[source]

Print some statistics on a grammar, before it goes through Grammar().

Parameters:dump – if given a filename, will dump distribution of parsing complexity to a file (i.e., p.c. 3 occurs 234 times, 4 occurs 120 times, etc.)
discodop.grammar.grammarstats(filename)[source]

Print statistics for PLCFRS/bitpar grammar (sorted by LHS).

discodop.grammar.splitweight(weight)[source]

Convert a weight / fraction in a string to a float / tuple.

>>> [splitweight(a) for a in ('0.5', '0x1.0000000000000p-1', '1/2')]
[0.5, 0.5, (1.0, 2.0)]
discodop.grammar.convertweight(weight)[source]

Convert a weight in a string to a float.

>>> [convertweight(a) for a in ('0.5', '0x1.0000000000000p-1', '1/2')]
[0.5, 0.5, 0.5]
discodop.grammar.stripweight(line)[source]

Extract rule without weight.

discodop.grammar.sumrules(iterable, n)[source]

Given a sorted iterable of rules, sum weights of identical rules.

discodop.grammar.sumlex(iterable, n)[source]

Given a sorted lexicon iterable, sum weights of word/tag pairs.

discodop.grammar.sumfrags(iterable, n)[source]

Sum weights for runs of identical fragments.

discodop.grammar.merge(filenames, outfilename, sumfunc, key)[source]

Interpolate weights of given files.

discodop.grammar.addindices(frag)[source]

Convert fragment in bracket to discbracket format.