discodop.treebank¶
Read and write treebanks.
Functions
alpinotree(block[, functions, morphology, ...]) |
Get tree, sent from tree in Alpino format given as etree XML object. |
dependencies(root) |
Lin (1995): A Dependency-based Method for Evaluating [...] Parsers. |
deplen(deps) |
Compute dependency length from result of dependencies(). |
exportsplit(line) |
Take a line in export format and split into fields. |
exporttree(block[, functions, morphology, ...]) |
Get tree, sentence from tree in export format given as list of lines. |
handlefunctions(action, tree[, pos, top, ...]) |
Add function tags to phrasal labels e.g., ‘VP’ => ‘VP-HD’. |
handlemorphology(action, lemmaaction, ...[, ...]) |
Augment/replace preterminal label with morphological information. |
incrementaltreereader(treeinput[, ...]) |
Incremental corpus reader. |
numbase(key) |
Split file name in numeric and string components to use as sort key. |
segmentalpino(morphology, functions) |
Co-routine that accepts one line at a time. |
segmentbrackets([strict, robust]) |
Co-routine that accepts one line at a time. |
segmentexport(morphology, functions[, strict]) |
Co-routine that accepts one line at a time. |
writealpinotree(tree, sent, key, commentstr) |
Return XML string with tree in AlpinoXML format. |
writedependencies(tree, sent, fmt) |
Convert tree to unlabeled dependencies in mst or conll format. |
writeexporttree(tree, sent, key, comment, ...) |
Return string with given tree in Negra’s export format. |
writetree(tree, sent, key, fmt[, comment, ...]) |
Convert a tree to a string representation in the given treebank format. |
Classes
AlpinoCorpusReader(path[, encoding, ...]) |
Corpus reader for the Dutch Alpino treebank in XML format. |
BracketCorpusReader(path[, encoding, ...]) |
Corpus reader for phrase-structures in bracket notation. |
CorpusReader(path[, encoding, ensureroot, ...]) |
Abstract corpus reader. |
DactCorpusReader(path[, encoding, ...]) |
Corpus reader for Alpino trees in Dact format (DB XML). |
DiscBracketCorpusReader(path[, encoding, ...]) |
A corpus reader for discontinuous trees in bracket notation. |
Item(tree, sent, comment, block) |
A treebank item. |
NegraCorpusReader(path[, encoding, ...]) |
Read a corpus in the Negra export format. |
TigerXMLCorpusReader(path[, encoding, ...]) |
Corpus reader for the Tiger XML format. |
-
class
discodop.treebank.CorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Abstract corpus reader.
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
-
itertrees(start=None, end=None)[source]¶ Returns: an iterator returning tuples (key, item)of sentences in corpus, whereitemis an :py:class:Item instance withtree,sent, andcommentattributes. Useful when the dictionary of all trees in corpus would not fit in memory.
-
trees()[source]¶ Returns: an ordered dictionary of parse trees ( Treeobjects with integer indices as leaves).
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.BracketCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Corpus reader for phrase-structures in bracket notation.
For example:
(S (NP John) (VP (VB is) (JJ rich)) (. .))
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.DiscBracketCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ A corpus reader for discontinuous trees in bracket notation.
Leaves are consist of an index and a word, with the indices indicating the word order of the sentence. For example:
(S (NP 1=John) (VP (VB 0=is) (JJ 2=rich)) (? 3=?))
There is one tree per line. Optionally, the tree may be followed by a comment, separated by a TAB. Compared to Negra’s export format, this format lacks morphology, lemmas and functional edges. On the other hand, it is close to the internal representation employed here, so it can be read efficiently.
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.NegraCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Read a corpus in the Negra export format.
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.TigerXMLCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Corpus reader for the Tiger XML format.
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.AlpinoCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Corpus reader for the Dutch Alpino treebank in XML format.
Expects a corpus in directory format, where every sentence is in a single
.xmlfile.Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
class
discodop.treebank.DactCorpusReader(path, encoding='utf8', ensureroot=None, punct=None, headrules=None, removeempty=False, functions=None, morphology=None, lemmas=None)[source]¶ Corpus reader for Alpino trees in Dact format (DB XML).
Parameters: - path – filename or pattern of corpus files; e.g.,
wsj*.mrg. - ensureroot – add root node with given label if necessary.
- removeempty – remove empty nodes and any empty ancestors; a terminal is empty if it is equal to None, ‘’, or ‘-NONE-‘.
- headrules – if given, read rules for assigning heads and apply them by ordering constituents according to their heads.
- punct –
one of ...
None: leave punctuation as is [default]. ‘move’: move punctuation to appropriate constituents using heuristics. ‘moveall’: same as ‘move’, but moves all preterminals under root, instead of only recognized punctuation. ‘prune’: prune away leading & ending quotes & periods, then move. ‘remove’: eliminate punctuation. ‘removeall’: eliminate all preterminals directly under root. ‘root’: attach punctuation directly to root (as in original Negra/Tiger treebanks). - functions –
one of ...
None, ‘leave’: leave syntactic labels as is [default]. ‘add’: concatenate grammatical function to syntactic label, separated by a hypen: e.g., NP => NP-SBJ.‘remove’: strip away hyphen-separated grammatical function, e.g., NP-SBJ => NP.‘replace’: replace syntactic label with grammatical function, e.g., NP => SBJ. - morphology –
one of ...
None, ‘no’: use POS tags as preterminals [default]. ‘add’: concatenate morphological information to POS tags, e.g., DET/sg.def.‘replace’: use morphological information as preterminal label ‘between’: add node with morphological information between POS tag and word, e.g., (DET (sg.def the)). - lemmas –
one of ...
None: ignore lemmas [default]. ‘add’: concatenate lemma to terminals, e.g., men/man. ‘replace’: use lemmas as terminals. ‘between’: insert lemma as node between POS tag and word.
- path – filename or pattern of corpus files; e.g.,
-
discodop.treebank.exporttree(block, functions=None, morphology=None, lemmas=None)[source]¶ Get tree, sentence from tree in export format given as list of lines.
-
discodop.treebank.exportsplit(line)[source]¶ Take a line in export format and split into fields.
Add dummy fields lemma, sec. edge if those fields are absent.
-
discodop.treebank.alpinotree(block, functions=None, morphology=None, lemmas=None)[source]¶ Get tree, sent from tree in Alpino format given as etree XML object.
-
discodop.treebank.writetree(tree, sent, key, fmt, comment=None, morphology=None, sentid=False)[source]¶ Convert a tree to a string representation in the given treebank format.
Parameters: - tree – should have indices as terminals
- sent – contains the words corresponding to the indices in
tree - key – an identifier for this tree; part of the output with some
formats or when
sentidis True. - fmt – Formats are
bracket,discbracket, Negra’sexportformat, andalpinoXML format, as well unlabeled dependency conversion intomstorconllformat (requires head rules). The formatstokensandwordposare to strip away tree structure and leave only lines with space-separated tokens ortoken/POS. When usingbracket, make sure tree is canonicalized. - comment – optionally, a string that will go in the format’s comment
field (supported by
exportandalpino), or at the end of the line preceded by a tab (discbracket); ignored by other formats. Should be a single line. - sentid – for line-based formats, prefix output by
key|.
Lemmas, functions, and morphology information will be empty unless nodes contain a ‘source’ attribute with such information.
-
discodop.treebank.writeexporttree(tree, sent, key, comment, morphology)[source]¶ Return string with given tree in Negra’s export format.
-
discodop.treebank.writealpinotree(tree, sent, key, commentstr)[source]¶ Return XML string with tree in AlpinoXML format.
-
discodop.treebank.writedependencies(tree, sent, fmt)[source]¶ Convert tree to unlabeled dependencies in mst or conll format.
-
discodop.treebank.dependencies(root)[source]¶ Lin (1995): A Dependency-based Method for Evaluating [...] Parsers.
http://ijcai.org/Proceedings/95-2/Papers/052.pdf
Returns: list of tuples of the form (headidx, label, depidx).
-
discodop.treebank.deplen(deps)[source]¶ Compute dependency length from result of
dependencies().Returns: tuple (totaldeplen, numdeps).
-
discodop.treebank.handlefunctions(action, tree, pos=True, top=False, morphology=None)[source]¶ Add function tags to phrasal labels e.g., ‘VP’ => ‘VP-HD’.
Parameters: - action – one of {None, ‘add’, ‘replace’, ‘remove’}
- pos – whether to add function tags to POS tags.
- top – whether to add function tags to the top node.
- morphology – if morphology=’between’, skip those nodes.
-
discodop.treebank.handlemorphology(action, lemmaaction, preterminal, source, sent=None)[source]¶ Augment/replace preterminal label with morphological information.
-
discodop.treebank.incrementaltreereader(treeinput, morphology=None, functions=None, strict=False, robust=True, othertext=False)[source]¶ Incremental corpus reader.
Supports brackets, discbrackets, export and alpino-xml format. The format is autodetected.
Parameters: - treeinput – an iterator giving one line at a time.
- strict – if True, raise ValueError on malformed data.
- robust – if True, only return trees with more than 2 brackets; e.g., (DT the) is not recognized as a tree.
- othertext – if True, yield non-tree data as
(None, None, line). By default, text in lines without trees is ignored.
Yields: tuples
(tree, sent, comment)with a Tree object, a separate lists of terminals, and a string with any other data following the tree.
-
discodop.treebank.segmentbrackets(strict=False, robust=True)[source]¶ Co-routine that accepts one line at a time.
Yields tuples
(result, status)where ...- result is None or one or more S-expressions as a list of
- tuples (tree, sent, rest), where rest is the string outside of brackets between this S-expression and the next.
- status is 1 if the line was consumed, else 0.
Parameters: - strict – if True, raise ValueError for improperly nested brackets.
- robust – if True, only return trees with at least 2 brackets; e.g., (DT the) is not recognized as a tree.
-
discodop.treebank.segmentalpino(morphology, functions)[source]¶ Co-routine that accepts one line at a time. Yields tuples
(result, status)where ...- result is
Noneor a segment delimited by <alpino_ds>and</alpino_ds>as a list of lines;
- result is
- status is 1 if the line was consumed, else 0.