discodop.lexicon¶
Add rules to handle unknown words and smooth lexical probabilities.
Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:
- Simple lexical smoothing:
- getunknownwordmodel (get statistics)
- replaceraretrainwords (adjust trees)
- [ read off grammar ]
- simplesmoothlexicon (add extra lexical productions)
- During parsing:
- replaceraretestwords (only give known words and signatures to parser)
- restore original words in derivations
Functions
accuracy(gold, cand) |
Compute fraction of equivalent pairs in two sequences. |
externaltagging(usetagger, model, sents, …) |
Use an external tool to tag a list of sentences. |
getunknownwordmodel(tagged_sents, …) |
Collect statistics for an unknown word model. |
replaceraretestwords(sent, unknownword, …) |
Replace test set words not in lexicon w/signature from unknownword(). |
replaceraretrainwords(tagged_sents, …) |
Replace train set words not in lexicon w/signature from unknownword(). |
simplesmoothlexicon(lexmodel[, epsilon]) |
Collect new lexical productions. |
tagmangle(a, splitchar, overridetag, tagmap) |
Function to filter tags after they are produced by the tagger. |
unknownword4(word, loc, _lexicon) |
Model 4 of the Stanford parser. |
unknownword6(word, loc, lexicon) |
Model 6 of the Stanford parser (for WSJ treebank). |
unknownwordbase(word, _loc, _lexicon) |
BaseUnknownWordModel of the Stanford parser. |
unknownwordftb(word, loc, _lexicon) |
Model 2 for French of the Stanford parser. |
-
discodop.lexicon.getunknownwordmodel(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]¶ Collect statistics for an unknown word model.
Parameters: - tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
- unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
- unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
- openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories, so that open class words and tags can be identified.
-
discodop.lexicon.replaceraretrainwords(tagged_sents, unknownword, lexicon)[source]¶ Replace train set words not in lexicon w/signature from unknownword().
-
discodop.lexicon.replaceraretestwords(sent, unknownword, lexicon, sigs)[source]¶ Replace test set words not in lexicon w/signature from unknownword().
If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.
-
discodop.lexicon.simplesmoothlexicon(lexmodel, epsilon=0.01)[source]¶ Collect new lexical productions.
- for rare words, include productions with words in addition to signatures.
- map unobserved signatures to
_UNKand associate w/all potential tags. - (unobserved combinations of open class (word, tag) handled in parser).
Parameters: epsilon – pseudo-frequency of unseen productions tag => word.Returns: a dictionary of lexical rules, with pseudo-frequencies as values.
-
discodop.lexicon.unknownword6(word, loc, lexicon)[source]¶ Model 6 of the Stanford parser (for WSJ treebank).
-
discodop.lexicon.unknownword4(word, loc, _lexicon)[source]¶ Model 4 of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.unknownwordbase(word, _loc, _lexicon)[source]¶ BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.unknownwordftb(word, loc, _lexicon)[source]¶ Model 2 for French of the Stanford parser.