discodop.lexicon¶

Add rules to handle unknown words and smooth lexical probabilities.

Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:

Simple lexical smoothing:
1. getunknownwordmodel (get statistics)
2. replaceraretrainwords (adjust trees)
3. [ read off grammar ]
4. simplesmoothlexicon (add extra lexical productions)
During parsing:
1. replaceraretestwords (only give known words and signatures to parser)
2. restore original words in derivations

Functions

`accuracy`(gold, cand)	Compute fraction of equivalent pairs in two sequences.
`externaltagging`(usetagger, model, sents, …)	Use an external tool to tag a list of sentences.
`getunknownwordmodel`(tagged_sents, …)	Collect statistics for an unknown word model.
`replaceraretestwords`(sent, unknownword, …)	Replace test set words not in lexicon w/signature from unknownword().
`replaceraretrainwords`(tagged_sents, …)	Replace train set words not in lexicon w/signature from unknownword().
`simplesmoothlexicon`(lexmodel[, epsilon])	Collect new lexical productions.
`tagmangle`(a, splitchar, overridetag, tagmap)	Function to filter tags after they are produced by the tagger.
`unknownword4`(word, loc, _lexicon)	Model 4 of the Stanford parser.
`unknownword6`(word, loc, lexicon)	Model 6 of the Stanford parser (for WSJ treebank).
`unknownwordbase`(word, _loc, _lexicon)	BaseUnknownWordModel of the Stanford parser.
`unknownwordftb`(word, loc, _lexicon)	Model 2 for French of the Stanford parser.

discodop.lexicon.getunknownwordmodel(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]¶

Collect statistics for an unknown word model.

Parameters:

tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories, so that open class words and tags can be identified.

discodop.lexicon.replaceraretrainwords(tagged_sents, unknownword, lexicon)[source]¶: Replace train set words not in lexicon w/signature from unknownword().

discodop.lexicon.replaceraretestwords(sent, unknownword, lexicon, sigs)[source]¶

Replace test set words not in lexicon w/signature from unknownword().

If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.

discodop.lexicon.simplesmoothlexicon(lexmodel, epsilon=0.01)[source]¶

Collect new lexical productions.

for rare words, include productions with words in addition to signatures.
map unobserved signatures to _UNK and associate w/all potential tags.
(unobserved combinations of open class (word, tag) handled in parser).

Parameters:	epsilon – pseudo-frequency of unseen productions `tag => word`.
Returns:	a dictionary of lexical rules, with pseudo-frequencies as values.

discodop.lexicon.unknownword6(word, loc, lexicon)[source]¶: Model 6 of the Stanford parser (for WSJ treebank).

discodop.lexicon.unknownword4(word, loc, _lexicon)[source]¶: Model 4 of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordbase(word, _loc, _lexicon)[source]¶: BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordftb(word, loc, _lexicon)[source]¶: Model 2 for French of the Stanford parser.

discodop.lexicon.externaltagging(usetagger, model, sents, overridetag, tagmap)[source]¶: Use an external tool to tag a list of sentences.

discodop.lexicon.tagmangle(a, splitchar, overridetag, tagmap)[source]¶: Function to filter tags after they are produced by the tagger.