discodop.containers

Data types for chart items, edges, &c.

Classes

Chart Base class for charts.
ChartItem
Ctrees() An indexed, binarized treebank stored as array.
Edges Object with a linked list of Edges.
FatChartItem Item where bitvector is a fixed-width static array.
FixedVocabulary
Grammar(rule_tuples_or_str[, lexicon, ...]) A grammar object which stores rules compactly, indexed in various ways.
LexicalRule(uint32_t lhs, unicode word, ...) A weighted rule of the form ‘non-terminal –> word’.
RankedEdge A derivation with backpointers.
SmallChartItem(label, vec) Item with word-sized bitvector.
Vocabulary() A mapping of productions, labels, words to integers.
class discodop.containers.Chart

Base class for charts. Provides methods available on all charts.

The subclass hierarchy for charts has three levels:

  1. base class, methods for chart traversal.
  2. formalism, methods specific to CFG vs. LCFRS parsers.
  3. data structures optimized for short/long sentences, small/large
    grammars.

Level 1/2 defines a type for labeled spans referred to as item.

filter(self)

Drop entries not part of a derivation headed by root of chart.

indices(self, item)

Return a list of indices dominated by item.

root(self)

Return item with root label spanning the whole sentence.

stats(self)

Return a short string with counts of items, edges.

toitem(self, node, item)

Convert Tree node with integer indices as terminals to a ChartItem.

Return type is determined by item.

class discodop.containers.Ctrees

An indexed, binarized treebank stored as array.

Indexing depends on an external Vocabulary object that maps productions and labels to unique integers across different sets of trees. First call the alloc() method with (estimated) number of nodes & trees. Then add trees one by one using the addnodes() method.

alloc(self, int numtrees, long numnodes)

Initialize an array of trees of nodes structs.

extract(self, int n, Vocabulary vocab, bool disc=True, int node=-1)

Return given tree in discbracket format.

Parameters:node – if given, extract specific subtree instead of whole tree.
extractsent(self, int n, Vocabulary vocab)

Return sentence as a list for given tree.

fromfile(type cls, filename)
indextrees(self, Vocabulary vocab)

Create index from productions to trees containing that production.

Productions are represented as integer IDs, trees are given as sets of integer indices.

printrepr(self, int n, Vocabulary vocab)

Print repr of a tree for debugging purposes.

tofile(self, filename)
class discodop.containers.Edges

Object with a linked list of Edges.

class discodop.containers.FatChartItem

Item where bitvector is a fixed-width static array.

binrepr(self, lensent=0)
lexidx(self)
class discodop.containers.Grammar(rule_tuples_or_str, lexicon=None, start=u'ROOT', binarized=True)

A grammar object which stores rules compactly, indexed in various ways.

Parameters:
  • rule_tuples_or_str – either a sequence of tuples containing both phrasal & lexical rules, or a string containing the phrasal rules in text format; in the latter case lexicon should be given. The text format allows for more efficient loading and is used internally.
  • start – a string identifying the unique start symbol of this grammar, which will be used by default when parsing with this grammar
  • binarized – whether to require a binarized grammar; a non-binarized grammar can only be used by bitpar.

By default the grammar is in logprob mode; invoke grammar.switch('default', logprob=False) to switch. If the grammar only contains integral weights (frequencies), they will be normalized into relative frequencies; if the grammar contains any non-integral weights, weights will be left unchanged.

buildchainvec(self)

Build a boolean matrix representing the unary (chain) rules.

getmapping(self, Grammar coarse, striplabelre=None, neverblockre=None, bool splitprune=False, bool markorigin=False, dict mapping=None)

Construct mapping of this grammar’s non-terminal labels to another.

Parameters:
  • coarse – the grammar to which this grammar’s labels will be mapped. May be None; useful when neverblockre needs to be applied.
  • striplabelre – if not None, a compiled regex used to form the coarse label for a given fine label. This regex is applied with a substitution to the empty string.
  • neverblockre

    labels that match this regex will never be pruned. Also used to identify auxiliary labels of Double-DOP grammars.

    • use |< to ignore nodes introduced by binarization;
      useful if coarse and fine stages employ different kinds of markovization; e.g., NP and VP may be blocked, but not NP|<DT-NN>.
    • _[0-9]+ to ignore discontinuous nodes X_n where X is
      a label and n is a fanout.
  • mapping – a dictionary with strings of fine labels mapped to coarse labels. striplabelre, if given, is applied first.

The regexes should be compiled objects, i.e., re.compile(regex), or None to leave labels unchanged.

getrulemapping(self, Grammar coarse, striplabelre)

Produce a mapping of coarse rules to sets of fine rules.

A coarse rule for a given fine rule is found by applying the regex striplabelre to labels. NB: this regex is applied to strings with multiple non-terminal labels at once, it should not match on the end of string $. The mapping uses the rule numbers (rule.no) derived from the original order of the rules when the Grammar object was created; e.g., self.rulemapping[12] == [34, 56, 78, ...] where 12 refers to a rule in the given coarse grammar, and the other IDs to rules in this grammar.

register(self, name, weights)

Register a probabilistic model given a name and a sequence of floats weights, with weights in the same order as self.origrules and self.origlexicon (which is an arbitrary order except that tags for each word are clustered together).

rulestr(self, int n)

Return a string representation of a specific rule in this grammar.

setmask(self, seq)

Given a sequence of rule numbers, store a mask so that any phrasal rules not in the sequence are deactivated. If sequence is None, the mask is cleared.

switch(self, unicode name, bool logprob=True)

Switch to a different probabilistic model; use u’default’ to swith back to model given during initialization.

testgrammar(self, epsilon=<???>)

Test whether all left-hand sides sum to 1 +/-epsilon for the currently selected weights.

class discodop.containers.LexicalRule(uint32_t lhs, unicode word, double prob)

A weighted rule of the form ‘non-terminal –> word’.

class discodop.containers.RankedEdge

A derivation with backpointers.

Denotes a k-best derivation defined by an edge, including the chart item (head) to which it points, along with ranks for its children.

class discodop.containers.SmallChartItem(label, vec)

Item with word-sized bitvector.

binrepr(self, int lensent=0)
lexidx(self)
class discodop.containers.Vocabulary

A mapping of productions, labels, words to integers.

  • Vocabulary.getprod(): get prod no and add to index (mutating).
  • FixedVocabulary.getprod(): lookup prod no given labels/words.
    (no mutation, but requires makeindex())
  • .getlabel(): lookup label/word given prod no (no mutation, arrays only)
prodrepr(self, int prodno)
tofile(self, unicode filename)

Helper function for pickling.