discodop.containers¶
Data types for chart items, edges, &c.
Classes
Chart |
Base class for charts. |
ChartItem |
|
Ctrees() |
An indexed, binarized treebank stored as array. |
Edges |
Object with a linked list of Edges. |
FatChartItem |
Item where bitvector is a fixed-width static array. |
FixedVocabulary |
|
Grammar(rule_tuples_or_str[, lexicon, ...]) |
A grammar object which stores rules compactly, indexed in various ways. |
LexicalRule(uint32_t lhs, unicode word, ...) |
A weighted rule of the form ‘non-terminal –> word’. |
RankedEdge |
A derivation with backpointers. |
SmallChartItem(label, vec) |
Item with word-sized bitvector. |
Vocabulary() |
A mapping of productions, labels, words to integers. |
-
class
discodop.containers.Chart¶ Base class for charts. Provides methods available on all charts.
The subclass hierarchy for charts has three levels:
- base class, methods for chart traversal.
- formalism, methods specific to CFG vs. LCFRS parsers.
- data structures optimized for short/long sentences, small/large
- grammars.
Level 1/2 defines a type for labeled spans referred to as
item.-
filter(self)¶ Drop entries not part of a derivation headed by root of chart.
-
indices(self, item)¶ Return a list of indices dominated by
item.
-
root(self)¶ Return item with root label spanning the whole sentence.
-
stats(self)¶ Return a short string with counts of items, edges.
-
toitem(self, node, item)¶ Convert Tree node with integer indices as terminals to a ChartItem.
Return type is determined by
item.
-
class
discodop.containers.Ctrees¶ An indexed, binarized treebank stored as array.
Indexing depends on an external Vocabulary object that maps productions and labels to unique integers across different sets of trees. First call the alloc() method with (estimated) number of nodes & trees. Then add trees one by one using the addnodes() method.
-
alloc(self, int numtrees, long numnodes)¶ Initialize an array of trees of nodes structs.
-
extract(self, int n, Vocabulary vocab, bool disc=True, int node=-1)¶ Return given tree in discbracket format.
Parameters: node – if given, extract specific subtree instead of whole tree.
-
extractsent(self, int n, Vocabulary vocab)¶ Return sentence as a list for given tree.
-
fromfile(type cls, filename)¶
-
indextrees(self, Vocabulary vocab)¶ Create index from productions to trees containing that production.
Productions are represented as integer IDs, trees are given as sets of integer indices.
-
printrepr(self, int n, Vocabulary vocab)¶ Print repr of a tree for debugging purposes.
-
tofile(self, filename)¶
-
-
class
discodop.containers.Edges¶ Object with a linked list of Edges.
-
class
discodop.containers.FatChartItem¶ Item where bitvector is a fixed-width static array.
-
binrepr(self, lensent=0)¶
-
lexidx(self)¶
-
-
class
discodop.containers.Grammar(rule_tuples_or_str, lexicon=None, start=u'ROOT', binarized=True)¶ A grammar object which stores rules compactly, indexed in various ways.
Parameters: - rule_tuples_or_str – either a sequence of tuples containing both
phrasal & lexical rules, or a string containing the phrasal
rules in text format; in the latter case
lexiconshould be given. The text format allows for more efficient loading and is used internally. - start – a string identifying the unique start symbol of this grammar, which will be used by default when parsing with this grammar
- binarized – whether to require a binarized grammar; a non-binarized grammar can only be used by bitpar.
By default the grammar is in logprob mode; invoke
grammar.switch('default', logprob=False)to switch. If the grammar only contains integral weights (frequencies), they will be normalized into relative frequencies; if the grammar contains any non-integral weights, weights will be left unchanged.-
buildchainvec(self)¶ Build a boolean matrix representing the unary (chain) rules.
-
getmapping(self, Grammar coarse, striplabelre=None, neverblockre=None, bool splitprune=False, bool markorigin=False, dict mapping=None)¶ Construct mapping of this grammar’s non-terminal labels to another.
Parameters: - coarse – the grammar to which this grammar’s labels will be
mapped. May be
None; useful whenneverblockreneeds to be applied. - striplabelre – if not None, a compiled regex used to form the coarse label for a given fine label. This regex is applied with a substitution to the empty string.
- neverblockre –
labels that match this regex will never be pruned. Also used to identify auxiliary labels of Double-DOP grammars.
- use
|<to ignore nodes introduced by binarization; - useful if coarse and fine stages employ different kinds of
markovization; e.g.,
NPandVPmay be blocked, but notNP|<DT-NN>.
- use
_[0-9]+to ignore discontinuous nodesX_nwhereXis- a label and n is a fanout.
- mapping – a dictionary with strings of fine labels mapped to coarse labels. striplabelre, if given, is applied first.
The regexes should be compiled objects, i.e.,
re.compile(regex), orNoneto leave labels unchanged.- coarse – the grammar to which this grammar’s labels will be
mapped. May be
-
getrulemapping(self, Grammar coarse, striplabelre)¶ Produce a mapping of coarse rules to sets of fine rules.
A coarse rule for a given fine rule is found by applying the regex
striplabelreto labels. NB: this regex is applied to strings with multiple non-terminal labels at once, it should not match on the end of string$. The mapping uses the rule numbers (rule.no) derived from the original order of the rules when the Grammar object was created; e.g.,self.rulemapping[12] == [34, 56, 78, ...]where 12 refers to a rule in the given coarse grammar, and the other IDs to rules in this grammar.
-
register(self, name, weights)¶ Register a probabilistic model given a name and a sequence of floats
weights, with weights in the same order asself.origrulesandself.origlexicon(which is an arbitrary order except that tags for each word are clustered together).
-
rulestr(self, int n)¶ Return a string representation of a specific rule in this grammar.
-
setmask(self, seq)¶ Given a sequence of rule numbers, store a mask so that any phrasal rules not in the sequence are deactivated. If sequence is None, the mask is cleared.
-
switch(self, unicode name, bool logprob=True)¶ Switch to a different probabilistic model; use u’default’ to swith back to model given during initialization.
-
testgrammar(self, epsilon=<???>)¶ Test whether all left-hand sides sum to 1 +/-epsilon for the currently selected weights.
- rule_tuples_or_str – either a sequence of tuples containing both
phrasal & lexical rules, or a string containing the phrasal
rules in text format; in the latter case
-
class
discodop.containers.LexicalRule(uint32_t lhs, unicode word, double prob)¶ A weighted rule of the form ‘non-terminal –> word’.
-
class
discodop.containers.RankedEdge¶ A derivation with backpointers.
Denotes a k-best derivation defined by an edge, including the chart item (head) to which it points, along with ranks for its children.
-
class
discodop.containers.SmallChartItem(label, vec)¶ Item with word-sized bitvector.
-
binrepr(self, int lensent=0)¶
-
lexidx(self)¶
-
-
class
discodop.containers.Vocabulary¶ A mapping of productions, labels, words to integers.
- Vocabulary.getprod(): get prod no and add to index (mutating).
- FixedVocabulary.getprod(): lookup prod no given labels/words.
- (no mutation, but requires makeindex())
- .getlabel(): lookup label/word given prod no (no mutation, arrays only)
-
prodrepr(self, int prodno)¶
-
tofile(self, unicode filename)¶ Helper function for pickling.