discodop.treebank

Read and write treebanks.

Functions

alpinotree(block[, functions, morphology, …]) Get tree, sent from tree in Alpino format given as etree XML object.
dependencies(root) Lin (1995): A Dependency-based Method for Evaluating […] Parsers.
deplen(deps) Compute dependency length from result of dependencies().
exportsplit(line) Take a line in export format and split into fields.
exporttree(block[, functions, morphology, …]) Get tree, sentence from tree in export format given as list of lines.
ftbtree(block[, functions, morphology, lemmas]) Get tree, sent from tree in FTB format given as etree XML object.
handlefunctions(action, tree[, pos, root, …]) Add function tags to phrasal labels e.g., ‘VP’ => ‘VP-HD’.
handlemorphology(action, lemmaaction, …[, …]) Augment/replace preterminal label with morphological information.
incrementaltreereader(treeinput[, …]) Incremental corpus reader.
numbase(key) Split file name in numeric and string components to use as sort key.
segmentalpino(morphology, functions) Co-routine that accepts one line at a time.
segmentbrackets([strict, robust]) Co-routine that accepts one line at a time.
segmentexport(morphology, functions[, strict]) Co-routine that accepts one line at a time.
writealpinotree(tree, sent, key, commentstr) Return XML string with tree in AlpinoXML format.
writedependencies(tree, sent, fmt) Convert tree to dependencies in mst or conll format.
writeexporttree(tree, sent, key, comment, …) Return string with given tree in Negra’s export format.
writetree(tree, sent, key, fmt[, comment, …]) Convert a tree to a string representation in the given treebank format.

Classes

AlpinoCompactCorpusReader(path[, encoding, …]) Corpus reader for the Alpino compact treebank format (Indexed Corpus).
AlpinoCorpusReader(path[, encoding, …]) Corpus reader for the Dutch Alpino treebank in XML format.
BracketCorpusReader(path[, encoding, …]) Corpus reader for phrase-structures in bracket notation.
CorpusReader(path[, encoding, ensureroot, …]) Abstract corpus reader.
DiscBracketCorpusReader(path[, encoding, …]) A corpus reader for discontinuous trees in bracket notation.
FTBXMLCorpusReader(*args, **kwargs) Corpus reader for the French treebank (FTB) in XML format.
Item(tree, sent, comment, block) A treebank item.
NegraCorpusReader(path[, encoding, …]) Read a corpus in the Negra export format.
TigerXMLCorpusReader(path[, encoding, …]) Corpus reader for the Tiger XML format.
class discodop.treebank.Item(tree, sent, comment, block)[source]

A treebank item.

class discodop.treebank.CorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

Abstract corpus reader.

Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
itertrees(start=None, end=None)[source]
Returns:an iterator returning tuples (key, item) of sentences in corpus, where item is an :py:class:Item instance with tree, sent, and comment attributes. Useful when the dictionary of all trees in corpus would not fit in memory.
trees()[source]
Returns:an ordered dictionary of parse trees (Tree objects with integer indices as leaves).
sents()[source]
Returns:an ordered dictionary of sentences, each sentence being a list of words.
tagged_sents()[source]
Returns:an ordered dictionary of tagged sentences, each tagged sentence being a list of (word, tag) pairs.
blocks()[source]
Returns:a list of strings containing the raw representation of trees in the original treebank.
class discodop.treebank.BracketCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

Corpus reader for phrase-structures in bracket notation.

For example:

(S (NP John) (VP (VB is) (JJ rich)) (. .))
Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
blocks()[source]
Returns:a list of strings containing the raw representation of trees in the original treebank.
class discodop.treebank.DiscBracketCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

A corpus reader for discontinuous trees in bracket notation.

Leaves are consist of an index and a word, with the indices indicating the word order of the sentence. For example:

(S (NP 1=John) (VP (VB 0=is) (JJ 2=rich)) (? 3=?))

There is one tree per line. Optionally, the tree may be followed by a comment, separated by a TAB. Compared to Negra’s export format, this format lacks morphology, lemmas and functional edges. On the other hand, it is close to the internal representation employed here, so it can be read efficiently.

Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
class discodop.treebank.NegraCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

Read a corpus in the Negra export format.

Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
blocks()[source]
Returns:a list of strings containing the raw representation of trees in the original treebank.
class discodop.treebank.TigerXMLCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

Corpus reader for the Tiger XML format.

Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
blocks()[source]
Returns:a list of strings containing the raw representation of trees in the treebank.
class discodop.treebank.AlpinoCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None, modifierrules=None)[source]

Corpus reader for the Dutch Alpino treebank in XML format.

Expects a corpus in directory format, where every sentence is in a single .xml file.

Parameters:
  • path – filename or pattern of corpus files; e.g., wsj*.mrg.
  • ensureroot – add root node with given label if necessary.
  • removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
  • headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
  • punct

    one of …

    None:leave punctuation as is [default].
    ’move’:move punctuation to appropriate constituents using heuristics.
    ’moveall’:same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation.
    ’prune’:prune away leading & ending quotes & periods, then move.
    ’remove’:eliminate punctuation.
    ’removeall’:eliminate all preterminals directly under root.
    ’root’:attach punctuation directly to root (as in original Negra/Tiger treebanks).
  • functions

    one of …

    None, ‘leave’:leave syntactic labels as is [default].
    ’add’:concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.
    ’remove’:strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.
    ’replace’:replace syntactic label with grammatical function, e.g., NP => SBJ.
  • morphology

    one of …

    None, ‘no’:use POS tags as preterminals [default].
    ’add’:concatenate morphological information to POS tags, e.g., DET/sg.def.
    ’replace’:use morphological information as preterminal label
    ’between’:add node with morphological information between POS tag and word, e.g., (DET (sg.def the)).
  • lemmas

    one of …

    None:ignore lemmas [default].
    ’add’:concatenate lemma to terminals, e.g., men/man.
    ’replace’:use lemmas as terminals.
    ’between’:insert lemma as node between POS tag and word.
blocks()[source]
Returns:a list of strings containing the raw representation of trees in the treebank.
class discodop.treebank.FTBXMLCorpusReader(*args, **kwargs)[source]

Corpus reader for the French treebank (FTB) in XML format.

blocks()[source]
Returns:a list of strings containing the raw representation of trees in the treebank.
discodop.treebank.exporttree(block, functions=None, morphology=None, lemmas=None)[source]

Get tree, sentence from tree in export format given as list of lines.

Parameters:block – a list of lines
Returns:Item object, with tree, sent, command, block fields.
discodop.treebank.exportsplit(line)[source]

Take a line in export format and split into fields.

Strip comments. Add dummy field for lemma if absent.

Returns:a list with >= 6 elements; if > 6, length is even since secondary edges are defined by pairs of (label, parentid) fields.
discodop.treebank.alpinotree(block, functions=None, morphology=None, lemmas=None)[source]

Get tree, sent from tree in Alpino format given as etree XML object.

discodop.treebank.ftbtree(block, functions=None, morphology=None, lemmas=None)[source]

Get tree, sent from tree in FTB format given as etree XML object.

discodop.treebank.writetree(tree, sent, key, fmt, comment=None, morphology=None, sentid=False)[source]

Convert a tree to a string representation in the given treebank format.

Parameters:
  • tree – should have indices as terminals
  • sent – contains the words corresponding to the indices in tree
  • key – an identifier for this tree; part of the output with some formats or when sentid is True.
  • fmt – Formats are bracket, discbracket, Negra’s export format, and alpino XML format, as well unlabeled dependency conversion into mst or conll format (requires head rules). The formats tokens and wordpos are to strip away tree structure and leave only lines with space-separated tokens or token/POS. When using bracket, make sure tree is canonicalized.
  • comment – optionally, a string that will go in the format’s comment field (supported by export and alpino), or at the end of the line preceded by a tab (discbracket); ignored by other formats. Should be a single line.
  • sentid – for line-based formats, prefix output by key|.

Lemmas, functions, and morphology information will be empty unless nodes contain a ‘source’ attribute with such information.

discodop.treebank.writeexporttree(tree, sent, key, comment, morphology)[source]

Return string with given tree in Negra’s export format.

discodop.treebank.writealpinotree(tree, sent, key, commentstr)[source]

Return XML string with tree in AlpinoXML format.

discodop.treebank.writedependencies(tree, sent, fmt)[source]

Convert tree to dependencies in mst or conll format.

discodop.treebank.dependencies(root)[source]

Lin (1995): A Dependency-based Method for Evaluating […] Parsers.

http://ijcai.org/Proceedings/95-2/Papers/052.pdf

Returns:list of tuples of the form (headidx, label, depidx).
discodop.treebank.deplen(deps)[source]

Compute dependency length from result of dependencies().

Returns:tuple (totaldeplen, numdeps).
discodop.treebank.handlefunctions(action, tree, pos=True, root=False, morphology=None)[source]

Add function tags to phrasal labels e.g., ‘VP’ => ‘VP-HD’.

Parameters:
  • action – one of {None, ‘add’, ‘replace’, ‘remove’}
  • pos – whether to add function tags to POS tags.
  • root – whether to add function tags to the root node.
  • morphology – if morphology=’between’, skip those nodes.
discodop.treebank.handlemorphology(action, lemmaaction, preterminal, source, sent=None)[source]

Augment/replace preterminal label with morphological information.

discodop.treebank.incrementaltreereader(treeinput, morphology=None, functions=None, strict=False, robust=True, othertext=False)[source]

Incremental corpus reader.

Supports brackets, discbrackets, export and alpino-xml format. The format is autodetected.

Parameters:
  • treeinput – an iterator giving one line at a time.
  • strict – if True, raise ValueError on malformed data.
  • robust – if True, only return trees with more than 2 brackets; e.g., (DT the) is not recognized as a tree.
  • othertext – if True, yield non-tree data as (None, None, line). By default, text in lines without trees is ignored.
Yields:

tuples (tree, sent, comment) with a Tree object, a separate lists of terminals, and a string with any other data following the tree.

discodop.treebank.segmentbrackets(strict=False, robust=True)[source]

Co-routine that accepts one line at a time.

Yields tuples (result, status) where …

  • result is None or one or more S-expressions as a list of
    tuples (tree, sent, rest), where rest is the string outside of brackets between this S-expression and the next.
  • status is 1 if the line was consumed, else 0.
Parameters:
  • strict – if True, raise ValueError for improperly nested brackets.
  • robust – if True, only return trees with at least 2 brackets; e.g., (DT the) is not recognized as a tree.
discodop.treebank.segmentalpino(morphology, functions)[source]

Co-routine that accepts one line at a time. Yields tuples (result, status) where …

  • result is None or a segment delimited by
    <alpino_ds> and </alpino_ds> as a list of lines;
  • status is 1 if the line was consumed, else 0.
discodop.treebank.segmentexport(morphology, functions, strict=False)[source]

Co-routine that accepts one line at a time. Yields tuples (result, status) where …

  • result is None or a segment delimited by
    #BOS and #EOS as a list of lines;
  • status is 1 if the line was consumed, else 0.
discodop.treebank.numbase(key)[source]

Split file name in numeric and string components to use as sort key.