discodop.lexicon¶
Add rules to handle unknown words and smooth lexical probabilities.
Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:
- Simple lexical smoothing:
- getunknownwordmodel (get statistics)
- replaceraretrainwords (adjust trees)
- [ read off grammar ]
- simplesmoothlexicon (add extra lexical productions)
- Sophisticated smoothing (untested):
- getunknownwordmodel
- getlexmodel
- replaceraretrainwords
- [ read off grammar ]
- smoothlexicon
- During parsing:
- replaceraretestwords (only give known words and signatures to parser)
- restore original words in derivations
Functions
accuracy(gold, cand) |
Compute fraction of equivalent pairs in two sequences. |
externaltagging(usetagger, model, sents, ...) |
Use an external tool to tag a list of sentences. |
getlexmodel(sigs, words, _lexicon, ...[, ...]) |
Compute a smoothed lexical model. |
getunknownwordmodel(tagged_sents, ...) |
Collect statistics for an unknown word model. |
replaceraretestwords(sent, unknownword, ...) |
Replace test set words not in lexicon w/signature from unknownword(). |
replaceraretrainwords(tagged_sents, ...) |
Replace train set words not in lexicon w/signature from unknownword(). |
simplesmoothlexicon(lexmodel[, epsilon]) |
Collect new lexical productions. |
smoothlexicon(grammar, P_wordtag) |
Replace lexical probabilities using given unknown word model. |
tagmangle(a, splitchar, overridetag, tagmap) |
Function to filter tags after they are produced by the tagger. |
unknownword4(word, loc, _lexicon) |
Model 4 of the Stanford parser. |
unknownword6(word, loc, lexicon) |
Model 6 of the Stanford parser (for WSJ treebank). |
unknownwordbase(word, _loc, _lexicon) |
BaseUnknownWordModel of the Stanford parser. |
unknownwordftb(word, loc, _lexicon) |
Model 2 for French of the Stanford parser. |
-
discodop.lexicon.getunknownwordmodel(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]¶ Collect statistics for an unknown word model.
Parameters: - tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
- unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
- unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
- openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories.
-
discodop.lexicon.replaceraretrainwords(tagged_sents, unknownword, lexicon)[source]¶ Replace train set words not in lexicon w/signature from unknownword().
-
discodop.lexicon.replaceraretestwords(sent, unknownword, lexicon, sigs)[source]¶ Replace test set words not in lexicon w/signature from unknownword().
If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.
-
discodop.lexicon.simplesmoothlexicon(lexmodel, epsilon=0.01)[source]¶ Collect new lexical productions.
- unobserved combinations of tags with known open class words.
- unobserved signatures which are mapped to
'_UNK'.
Parameters: epsilon – ‘frequency’ of productions for unseen tag, word pair. Returns: a dictionary of lexical rules, with pseudofrequencies as values.
-
discodop.lexicon.getlexmodel(sigs, words, _lexicon, wordsfortag, openclasstags, openclasswords, tags, wordtags, wordsig, sigtag, openclassoffset=1, kappa=1)[source]¶ Compute a smoothed lexical model.
Returns: a dictionary giving P(word_or_sig | tag).
Parameters: - openclassoffset – for words that only appear with open class tags, add unseen combinations of open class (tag, word) with this count.
- kappa – FIXME; cf. Klein & Manning (2003), footnote 5. http://aclweb.org/anthology/P03-1054
-
discodop.lexicon.smoothlexicon(grammar, P_wordtag)[source]¶ Replace lexical probabilities using given unknown word model. Ignores lexical productions of known subtrees (tag contains ‘@’) introduced by DOP, i.e., we only modify lexical depth 1 subtrees.
-
discodop.lexicon.unknownword6(word, loc, lexicon)[source]¶ Model 6 of the Stanford parser (for WSJ treebank).
-
discodop.lexicon.unknownword4(word, loc, _lexicon)[source]¶ Model 4 of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.unknownwordbase(word, _loc, _lexicon)[source]¶ BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.unknownwordftb(word, loc, _lexicon)[source]¶ Model 2 for French of the Stanford parser.