discodop.lexicon

Add rules to handle unknown words and smooth lexical probabilities.

Rare words in the training set are replaced with word signatures, such that unknown words can receive similar tags. Given a function to produce such signatures from words, the flow is as follows:

  • Simple lexical smoothing:
    1. getunknownwordmodel (get statistics)
    2. replaceraretrainwords (adjust trees)
    3. [ read off grammar ]
    4. simplesmoothlexicon (add extra lexical productions)
  • During parsing:
    1. replaceraretestwords (only give known words and signatures to parser)
    2. restore original words in derivations

Functions

accuracy(gold, cand) Compute fraction of equivalent pairs in two sequences.
externaltagging(usetagger, model, sents, …) Use an external tool to tag a list of sentences.
getunknownwordmodel(tagged_sents, …) Collect statistics for an unknown word model.
replaceraretestwords(sent, unknownword, …) Replace test set words not in lexicon w/signature from unknownword().
replaceraretrainwords(tagged_sents, …) Replace train set words not in lexicon w/signature from unknownword().
simplesmoothlexicon(lexmodel[, epsilon]) Collect new lexical productions.
tagmangle(a, splitchar, overridetag, tagmap) Function to filter tags after they are produced by the tagger.
unknownword4(word, loc, _lexicon) Model 4 of the Stanford parser.
unknownword6(word, loc, lexicon) Model 6 of the Stanford parser (for WSJ treebank).
unknownwordbase(word, _loc, _lexicon) BaseUnknownWordModel of the Stanford parser.
unknownwordftb(word, loc, _lexicon) Model 2 for French of the Stanford parser.
discodop.lexicon.getunknownwordmodel(tagged_sents, unknownword, unknownthreshold, openclassthreshold)[source]

Collect statistics for an unknown word model.

Parameters:
  • tagged_sents – the sentences from the training set with the gold POS tags from the treebank.
  • unknownword – a function that returns a signature for a given word; e.g., “eschewed” => “_UNK-L-d”.
  • unknownthreshold – words with frequency lower than or equal to this are replaced by their signature.
  • openclassthreshold – tags that rewrite to at least this much word types are considered to be open class categories, so that open class words and tags can be identified.
discodop.lexicon.replaceraretrainwords(tagged_sents, unknownword, lexicon)[source]

Replace train set words not in lexicon w/signature from unknownword().

discodop.lexicon.replaceraretestwords(sent, unknownword, lexicon, sigs)[source]

Replace test set words not in lexicon w/signature from unknownword().

If only a lowercase version of a word is in the grammar, that will be used instead. If the returned signature is not part of the grammar, a default one is returned.

discodop.lexicon.simplesmoothlexicon(lexmodel, epsilon=0.01)[source]

Collect new lexical productions.

  • include productions for rare words with words in addition to signatures.
  • map unobserved signatures to _UNK and associate w/all potential tags.
  • (unobserved combinations of open class (word, tag) handled in parser).
Parameters:epsilon – pseudo-frequency of unseen productions tag => word.
Returns:a dictionary of lexical rules, with pseudo-frequencies as values.
discodop.lexicon.unknownword6(word, loc, lexicon)[source]

Model 6 of the Stanford parser (for WSJ treebank).

discodop.lexicon.unknownword4(word, loc, _lexicon)[source]

Model 4 of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordbase(word, _loc, _lexicon)[source]

BaseUnknownWordModel of the Stanford parser. Relatively language agnostic.

discodop.lexicon.unknownwordftb(word, loc, _lexicon)[source]

Model 2 for French of the Stanford parser.

discodop.lexicon.externaltagging(usetagger, model, sents, overridetag, tagmap)[source]

Use an external tool to tag a list of sentences.

discodop.lexicon.tagmangle(a, splitchar, overridetag, tagmap)[source]

Function to filter tags after they are produced by the tagger.