discodop.containers

Data types for chart items, edges, &c.

Classes

Chart Base class for charts.
Ctrees() An indexed, binarized treebank stored as array.
FixedVocabulary
Grammar(rule_tuples_or_filename[, …]) A grammar object which stores rules compactly, indexed in various ways.
StringIntDict Proxy class to expose sparse_hash_map with read-only dict interface.
StringList Proxy class to expose vector<string> with read-only list interface.
Vocabulary() A mapping of productions, labels, words to integers.
Whitelist
class discodop.containers.Grammar(rule_tuples_or_filename, lexiconfile=None, start=u'ROOT', altweights=None, backtransform=None)

A grammar object which stores rules compactly, indexed in various ways.

Parameters:
  • rule_tuples_or_filename – either a sequence of tuples containing both phrasal & lexical rules, or the name of a file with the phrasal rules in text format; in the latter case the filename lexicon should be given. The text format allows for more efficient loading and is used internally.
  • start – a string identifying the unique start symbol of this grammar, which will be used by default when parsing with this grammar
  • altweights – a dictionary or filename with numpy arrays of alternative weights.

By default the grammar is in logprob mode; invoke grammar.switch('default', logprob=False) to switch. If the grammar only contains integral weights (frequencies), they will be normalized into relative frequencies; if the grammar contains any non-integral weights, weights will be left unchanged.

addrules(self, bytes rules, bytes lexicon, backtransform=None, init=False)

Update weights and add new rules.

frombinfile(type cls, filename, rulesfile, lexiconfile, backtransform=None)

Load grammar from cached binary file.

Parameters:
  • filename – file produced by tobinfile() method; format subject to change, recreate as needed.
  • rulesfile – original grammar file, used only when pickling.
getlabels(self)

Return grammar labels as list.

getlexprobs(self, unicode word)

Return the list of probabilities of rules for a word.

getmapping(self, Grammar coarse, striplabelre=None, neverblockre=None, bool splitprune=False, bool markorigin=False, dict mapping=None, int startidx=0, bool debug=True)

Construct mapping of this grammar’s non-terminal labels to another.

Parameters:
  • coarse – the grammar to which this grammar’s labels will be mapped. May be None to establish a separate mapping to own labels.
  • striplabelre – if not None, a compiled regex used to form the coarse label for a given fine label. This regex is applied with a substitution to the empty string.
  • neverblockre

    labels that match this regex will never be pruned. Also used to identify auxiliary labels of Double-DOP grammars.

    • use |< to ignore nodes introduced by binarization;
      useful if coarse and fine stages employ different kinds of markovization; e.g., NP and VP may be blocked, but not NP|<DT-NN>.
    • _[0-9]+ to ignore discontinuous nodes X_n where X is
      a label and n is a fanout.
  • mapping – a dictionary with strings of fine labels mapped to coarse labels. striplabelre, if given, is applied first.
  • startidx – when running getmapping after new rules have been added, pass the value of grammar.nonterminals before they were added to avoid rebuilding the mapping completely.
  • debug – whether to return a debug message.

The regexes should be compiled objects, i.e., re.compile(regex), or None to leave labels unchanged.

getpos(self)

Return POS tags in lexicon as list.

getrulemapping(self, Grammar coarse, striplabelre)

Produce a mapping of coarse rules to sets of fine rules.

A coarse rule for a given fine rule is found by applying the label mapping to rules. The rule mapping uses the rule numbers (rule.no) derived from the original order of the rules when the Grammar object was created; e.g., self.rulemapping[12] == [34, 56, 78, ...] where 12 refers to a rule in the given coarse grammar, and the other IDs to rules in this grammar.

getruleno(self, tuple r, tuple yf)

Get rule no given a (discontinuous) production.

getwords(self)

Return words in lexicon as list.

incrementrulecount(self, int ruleno, int freq)

Add freq to observed count of a rule. NB: need to re-normalize after this; alternative weights not affected.

noderuleno(self, node)

Get rule no given a node of a continuous tree.

rulestr(self, int n)

Return a string representation of a specific rule in this grammar.

setmask(self, seq)

Given a sequence of rule numbers, store a mask so that any phrasal rules not in the sequence are deactivated. If sequence is None, the mask is cleared (all rules are active).

switch(self, unicode name, bool logprob=True)
testgrammar(self, epsilon=1e-16)

Test whether all left-hand sides sum to 1 +/-epsilon for the currently selected weights.

tobinfile(self, filename)

Store grammar in a binary format for faster loading.

class discodop.containers.Chart

Base class for charts. Provides methods available on all charts.

The subclass hierarchy for charts has three levels:

  1. base class, methods for chart traversal.
  2. formalism, methods specific to CFG vs. LCFRS parsers.
  3. data structures optimized for short/long sentences, small/large
    grammars.

Level 1/2 defines a type for labeled spans referred to as item.

bestsubtree(self, start, end)

Return item with most probable subtree for given (continuous) span.

filter(self)

Drop edges not part of a derivation headed by root of chart.

indices(self, item)

Return a list of indices dominated by item.

itemid(self, unicode label, indices, Whitelist whitelist=None)

Get integer ID for labeled span in the chart (0 if non-existent).

numitems(self)

Number of items in chart.

root(self)

Return item with root label spanning the whole sentence.

stats(self)

Return a short string with counts of items, edges.

class discodop.containers.Ctrees

An indexed, binarized treebank stored as array.

Indexing depends on an external Vocabulary object that maps productions and labels to unique integers across different sets of trees. First call the alloc() method with (estimated) number of nodes & trees. Then add trees one by one using the addnodes() method.

addtrees(self, items, Vocabulary vocab, index=True)

Add binarized Tree objects.

Parameters:
  • items – an iterable with tuples of the form (tree, sent).
  • index – whether to create production index of trees.
Returns:

dictionary with keys ‘trees1’, ‘trees2’, and ‘vocab’, where trees1 and trees2 are Ctrees objects for disc. binary trees and sentences.

alloc(self, int numtrees, long numnodes)

Initialize an array of trees of nodes structs.

close(self)

Close any open files and free memory.

extract(self, int n, Vocabulary vocab, bool disc=True, int node=-1)

Return given tree in discbracket format.

Parameters:node – if given, extract specific subtree instead of whole tree.
extractsent(self, int n, Vocabulary vocab)

Return sentence as a list for given tree.

fromfile(type cls, filename)

Load read-only version of Ctrees object using mmap.

fromfilemut(type cls, filename)

Mutable version of fromfile(); changes not stored to disk.

indextrees(self, Vocabulary vocab, int start=0, freeze=False)

Create index from productions to trees containing that production.

Productions are represented as integer IDs, trees are given as sets of integer indices.

printrepr(self, int n, Vocabulary vocab)

Print repr of a tree for debugging purposes.

tofile(self, filename)
class discodop.containers.Vocabulary

A mapping of productions, labels, words to integers.

  • Vocabulary.getprod(): get prod no and add to index (mutating).
  • FixedVocabulary.getprod(): lookup prod no given labels/words.
    (no mutation, but requires makeindex())
  • .getlabel(): lookup label/word given prod no (no mutation, arrays only)
fromfile(type cls, filename)

Create a mutable Vocabulary object from a file.

prodrepr(self, int prodno)
tofile(self, unicode filename)

Helper function for pickling.

class discodop.containers.FixedVocabulary
close(self)

Close open files, if any.

fromfile(type cls, filename)

Return an immutable Vocabulary object from a file.

makeindex(self)

Build dictionaries; necessary for getprod().