discodop.lexicon¶
Add rules to handle unknown words and smooth lexical probabilities.
Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:
- Simple lexical smoothing:
- getunknownwordmodel (get statistics)
- replaceraretrainwords (adjust trees)
- [ read off grammar ]
- simplesmoothlexicon (add extra lexical productions)
- During parsing:
- replaceraretestwords (only give known words and signatures to parser)
- restore original words in derivations
Functions
accuracy (gold, cand) |
Compute fraction of equivalent pairs in two sequences. |
externaltagging (usetagger, model, sents, …) |
Use an external tool to tag a list of sentences. |
getunknownwordmodel (tagged_sents, …) |
Collect statistics for an unknown word model. |
replaceraretestwords (sent, unknownword, …) |
Replace test set words not in lexicon w/signature from unknownword(). |
replaceraretrainwords (tagged_sents, …) |
Replace train set words not in lexicon w/signature from unknownword(). |
simplesmoothlexicon (lexmodel[, epsilon]) |
Collect new lexical productions. |
tagmangle (a, splitchar, overridetag, tagmap) |
Function to filter tags after they are produced by the tagger. |
unknownword4 (word, loc, _lexicon) |
Model 4 of the Stanford parser. |
unknownword6 (word, loc, lexicon) |
Model 6 of the Stanford parser (for WSJ treebank). |
unknownwordbase (word, _loc, _lexicon) |
BaseUnknownWordModel of the Stanford parser. |
unknownwordftb (word, loc, _lexicon) |
Model 2 for French of the Stanford parser. |
-
discodop.lexicon.
getunknownwordmodel
(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]¶ Collect statistics for an unknown word model.
Parameters: - tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
- unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
- unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
- openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories, so that open class words and tags can be identified.
-
discodop.lexicon.
replaceraretrainwords
(tagged_sents, unknownword, lexicon)[source]¶ Replace train set words not in lexicon w/signature from unknownword().
-
discodop.lexicon.
replaceraretestwords
(sent, unknownword, lexicon, sigs)[source]¶ Replace test set words not in lexicon w/signature from unknownword().
If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.
-
discodop.lexicon.
simplesmoothlexicon
(lexmodel, epsilon=0.01)[source]¶ Collect new lexical productions.
- include productions for rare words with words in addition to signatures.
- map unobserved signatures to
_UNK
and associate w/all potential tags. - (unobserved combinations of open class (word, tag) handled in parser).
Parameters: epsilon – pseudo-frequency of unseen productions tag => word
.Returns: a dictionary of lexical rules, with pseudo-frequencies as values.
-
discodop.lexicon.
unknownword6
(word, loc, lexicon)[source]¶ Model 6 of the Stanford parser (for WSJ treebank).
-
discodop.lexicon.
unknownword4
(word, loc, _lexicon)[source]¶ Model 4 of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.
unknownwordbase
(word, _loc, _lexicon)[source]¶ BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.
-
discodop.lexicon.
unknownwordftb
(word, loc, _lexicon)[source]¶ Model 2 for French of the Stanford parser.