discodop.lexicon¶

Add rules to handle unknown words and smooth lexical probabilities.

Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:

Simple lexical smoothing:
1. getunknownwordmodel (get statistics)
2. replaceraretrainwords (adjust trees)
3. [ read off grammar ]
4. simplesmoothlexicon (add extra lexical productions)
Sophisticated smoothing (untested):
1. getunknownwordmodel
2. getlexmodel
3. replaceraretrainwords
4. [ read off grammar ]
5. smoothlexicon
During parsing:
1. replaceraretestwords (only give known words and signatures to parser)
2. restore original words in derivations

Functions

`accuracy`(gold, cand)	Compute fraction of equivalent pairs in two sequences.
`externaltagging`(usetagger, model, sents, ...)	Use an external tool to tag a list of sentences.
`getlexmodel`(sigs, words, _lexicon, ...[, ...])	Compute a smoothed lexical model.
`getunknownwordmodel`(tagged_sents, ...)	Collect statistics for an unknown word model.
`replaceraretestwords`(sent, unknownword, ...)	Replace test set words not in lexicon w/signature from unknownword().
`replaceraretrainwords`(tagged_sents, ...)	Replace train set words not in lexicon w/signature from unknownword().
`simplesmoothlexicon`(lexmodel[, epsilon])	Collect new lexical productions.
`smoothlexicon`(grammar, P_wordtag)	Replace lexical probabilities using given unknown word model.
`tagmangle`(a, splitchar, overridetag, tagmap)	Function to filter tags after they are produced by the tagger.
`unknownword4`(word, loc, _lexicon)	Model 4 of the Stanford parser.
`unknownword6`(word, loc, lexicon)	Model 6 of the Stanford parser (for WSJ treebank).
`unknownwordbase`(word, _loc, _lexicon)	BaseUnknownWordModel of the Stanford parser.
`unknownwordftb`(word, loc, _lexicon)	Model 2 for French of the Stanford parser.

discodop.lexicon.getunknownwordmodel(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]¶

Collect statistics for an unknown word model.

Parameters:

tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories.

discodop.lexicon.replaceraretrainwords(tagged_sents, unknownword, lexicon)[source]¶: Replace train set words not in lexicon w/signature from unknownword().

discodop.lexicon.replaceraretestwords(sent, unknownword, lexicon, sigs)[source]¶

Replace test set words not in lexicon w/signature from unknownword().

If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.

discodop.lexicon.simplesmoothlexicon(lexmodel, epsilon=0.01)[source]¶

Collect new lexical productions.

unobserved combinations of tags with known open class words.
unobserved signatures which are mapped to '_UNK'.

Parameters:	epsilon – ‘frequency’ of productions for unseen tag, word pair.
Returns:	a dictionary of lexical rules, with pseudofrequencies as values.

discodop.lexicon.getlexmodel(sigs, words, _lexicon, wordsfortag, openclasstags, openclasswords, tags, wordtags, wordsig, sigtag, openclassoffset=1, kappa=1)[source]¶

Compute a smoothed lexical model.

Returns:	a dictionary giving P(word_or_sig \| tag).
Parameters:	openclassoffset – for words that only appear with open class tags, add unseen combinations of open class (tag, word) with this count. kappa – FIXME; cf. Klein & Manning (2003), footnote 5. http://aclweb.org/anthology/P03-1054

discodop.lexicon.smoothlexicon(grammar, P_wordtag)[source]¶: Replace lexical probabilities using given unknown word model. Ignores lexical productions of known subtrees (tag contains ‘@’) introduced by DOP, i.e., we only modify lexical depth 1 subtrees.

discodop.lexicon.unknownword6(word, loc, lexicon)[source]¶: Model 6 of the Stanford parser (for WSJ treebank).

discodop.lexicon.unknownword4(word, loc, _lexicon)[source]¶: Model 4 of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordbase(word, _loc, _lexicon)[source]¶: BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordftb(word, loc, _lexicon)[source]¶: Model 2 for French of the Stanford parser.

discodop.lexicon.externaltagging(usetagger, model, sents, overridetag, tagmap)[source]¶: Use an external tool to tag a list of sentences.

discodop.lexicon.tagmangle(a, splitchar, overridetag, tagmap)[source]¶: Function to filter tags after they are produced by the tagger.