discodop.treesearch

Objects for searching through collections of trees.

Functions

applyhighlight(sent, high1, high2[, reset, …]) Highlight character indices high1 & high2 in sent with ANSI colors.
charindices(sent, indices[, indices2]) Project token indices to character indices.
cpu_count() Return number of CPUs or 1.
filterlabels(line, nofunc, nomorph) Remove morphological and/or grammatical function labels from tree(s).
main()
writecounts(results[, flat, columns]) Write a dictionary of dictionaries to stdout as CSV or in a flat format.

Classes

CorpusInfo(len, numwords, numnodes, maxnodes) Create new instance of CorpusInfo(len, numwords, numnodes, maxnodes)
CorpusSearcher(files[, macros, numproc]) Abstract base class to wrap corpus files that can be queried.
FIFOOrederedDict(limit) FIFO cache with maximum number of elements based on OrderedDict.
FragmentSearcher(files[, macros, numproc, …]) Search for fragments in a bracket treebank.
NoFuture(func, *args, **kwargs) A non-asynchronous version of concurrent.futures.Future.
RegexSearcher(files[, macros, numproc, …]) Search a plain text file in UTF-8 with regular expressions.
TgrepSearcher(files[, macros, numproc]) Search a corpus with tgrep2.
class discodop.treesearch.CorpusSearcher(files, macros=None, numproc=None)[source]

Abstract base class to wrap corpus files that can be queried.

Parameters:
  • files – a sequence of filenames of corpora
  • macros – a filename with macros that can be used in queries.
  • numproc – the number of concurrent threads / processes to use; pass 1 to use a single core.
counts(query, subset=None, start=None, end=None, indices=False, breakdown=False)[source]

Run query and return a dict of the form {corpus1: nummatches, …}.

Parameters:
  • query – the search query
  • subset – an iterable of filenames to run the query on; by default all filenames are used.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • indices – if True, return a sequence of indices of matching occurrences, instead of an integer count.
  • breakdown – if True, return a Counter mapping matches to counts.
trees(query, subset=None, start=None, end=None, maxresults=10, nofunc=False, nomorph=False)[source]

Run query and return list of matching trees.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return.
  • nomorph (nofunc,) – whether to remove / add function tags and morphological features from trees.
Returns:

list of tuples of the form (corpus, sentno, tree, sent, highlight) highlight is a list of matched Tree nodes from tree.

sents(query, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Run query and return matching sentences.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return; pass None for no limit.
  • brackets – if True, return trees as they appear in the treebank, match1 and match2 are strings with the matching subtree. If False (default), sentences are returned as a sequence of tokens.
Returns:

list of tuples of the form (corpus, sentno, sent, match1, match2) sent is a single string with space-separated tokens; match1 and match2 are iterables of integer indices of characters matched by the query. If the distinction is applicable, match2 contains the complete subtree, of which match1 is a subset.

batchcounts(queries, subset=None, start=None, end=None)[source]

Like counts(), but executes multiple queries on multiple files.

Useful in combination with pandas.DataFrame; e.g.:

queries = ['NP < PP', 'VP < PP']
corpus = treesearch.TgrepSearcher(glob.glob('*.mrg'))
pandas.DataFrame.from_items(list(corpus.batchcounts(queries)),
                orient='index', columns=queries)
Parameters:
  • queries – an iterable of strings.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
Yields:

tuples of the form (corpus1, [count1, count2, ...]). where count1, count2, ... corresponds to queries. Order of queries and corpora is preserved.

batchsents(queries, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Variant of sents() to run a batch of queries.

yields:tuples of the form (corpus1, matches)
where matches is in the same format returned by sents()
excluding the filename, with the results of different patterns merged together.
extract(filename, indices, nofunc=False, nomorph=False, sents=False)[source]

Extract a range of trees / sentences.

Parameters:
  • filename – one of the filenames in self.files
  • indices – iterable of indices of sentences to extract (1-based, excluding empty lines)
  • sents – if True, return sentences instead of trees. Sentences are strings with space-separated tokens.
  • nomorph (nofunc,) – same as for trees() method.
Returns:

a list of Tree objects or sentences.

getinfo(filename)[source]

Return named tuple with members len, numnodes, and numwords.

close()[source]

Close files and free memory.

class discodop.treesearch.TgrepSearcher(files, macros=None, numproc=None)[source]

Search a corpus with tgrep2.

counts(query, subset=None, start=None, end=None, indices=False, breakdown=False)[source]

Run query and return a dict of the form {corpus1: nummatches, …}.

Parameters:
  • query – the search query
  • subset – an iterable of filenames to run the query on; by default all filenames are used.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • indices – if True, return a sequence of indices of matching occurrences, instead of an integer count.
  • breakdown – if True, return a Counter mapping matches to counts.
batchcounts(queries, subset=None, start=None, end=None)[source]

Like counts(), but executes multiple queries on multiple files.

Useful in combination with pandas.DataFrame; e.g.:

queries = ['NP < PP', 'VP < PP']
corpus = treesearch.TgrepSearcher(glob.glob('*.mrg'))
pandas.DataFrame.from_items(list(corpus.batchcounts(queries)),
                orient='index', columns=queries)
Parameters:
  • queries – an iterable of strings.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
Yields:

tuples of the form (corpus1, [count1, count2, ...]). where count1, count2, ... corresponds to queries. Order of queries and corpora is preserved.

batchsents(queries, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Variant of sents() to run a batch of queries.

yields:tuples of the form (corpus1, matches)
where matches is in the same format returned by sents()
excluding the filename, with the results of different patterns merged together.
trees(query, subset=None, start=None, end=None, maxresults=10, nofunc=False, nomorph=False)[source]

Run query and return list of matching trees.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return.
  • nomorph (nofunc,) – whether to remove / add function tags and morphological features from trees.
Returns:

list of tuples of the form (corpus, sentno, tree, sent, highlight) highlight is a list of matched Tree nodes from tree.

sents(query, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Run query and return matching sentences.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return; pass None for no limit.
  • brackets – if True, return trees as they appear in the treebank, match1 and match2 are strings with the matching subtree. If False (default), sentences are returned as a sequence of tokens.
Returns:

list of tuples of the form (corpus, sentno, sent, match1, match2) sent is a single string with space-separated tokens; match1 and match2 are iterables of integer indices of characters matched by the query. If the distinction is applicable, match2 contains the complete subtree, of which match1 is a subset.

extract(filename, indices, nofunc=False, nomorph=False, sents=False)[source]

Extract a range of trees / sentences.

Parameters:
  • filename – one of the filenames in self.files
  • indices – iterable of indices of sentences to extract (1-based, excluding empty lines)
  • sents – if True, return sentences instead of trees. Sentences are strings with space-separated tokens.
  • nomorph (nofunc,) – same as for trees() method.
Returns:

a list of Tree objects or sentences.

getinfo(filename)[source]

Return named tuple with members len, numnodes, and numwords.

class discodop.treesearch.FragmentSearcher(files, macros=None, numproc=None, inmemory=True)[source]

Search for fragments in a bracket treebank.

Format of treebanks and queries can be bracket, discbracket, or export (autodetected). Each query consists of one or more tree fragments, and the results will be merged together, except with batchcounts(), which returns the results for each fragment separately.

Example queries::
(S (NP (DT The) (NN )) (VP )) (NP (DT 0=The) (NN 1=queen))
Parameters:
  • macros – a file containing lines of the form 'name=fragment'; an occurrence of '{name}' will be replaced with fragment when it appears in a query.
  • inmemory – if True, keep all corpora in memory; otherwise, load them from disk with each query.
close()[source]

Close files and free memory.

counts(query, subset=None, start=None, end=None, indices=False, breakdown=False)[source]

Run query and return a dict of the form {corpus1: nummatches, …}.

Parameters:
  • query – the search query
  • subset – an iterable of filenames to run the query on; by default all filenames are used.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • indices – if True, return a sequence of indices of matching occurrences, instead of an integer count.
  • breakdown – if True, return a Counter mapping matches to counts.
batchcounts(queries, subset=None, start=None, end=None)[source]

Like counts(), but executes multiple queries on multiple files.

Useful in combination with pandas.DataFrame; e.g.:

queries = ['NP < PP', 'VP < PP']
corpus = treesearch.TgrepSearcher(glob.glob('*.mrg'))
pandas.DataFrame.from_items(list(corpus.batchcounts(queries)),
                orient='index', columns=queries)
Parameters:
  • queries – an iterable of strings.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
Yields:

tuples of the form (corpus1, [count1, count2, ...]). where count1, count2, ... corresponds to queries. Order of queries and corpora is preserved.

trees(query, subset=None, start=None, end=None, maxresults=10, nofunc=False, nomorph=False)[source]

Run query and return list of matching trees.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return.
  • nomorph (nofunc,) – whether to remove / add function tags and morphological features from trees.
Returns:

list of tuples of the form (corpus, sentno, tree, sent, highlight) highlight is a list of matched Tree nodes from tree.

sents(query, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Run query and return matching sentences.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return; pass None for no limit.
  • brackets – if True, return trees as they appear in the treebank, match1 and match2 are strings with the matching subtree. If False (default), sentences are returned as a sequence of tokens.
Returns:

list of tuples of the form (corpus, sentno, sent, match1, match2) sent is a single string with space-separated tokens; match1 and match2 are iterables of integer indices of characters matched by the query. If the distinction is applicable, match2 contains the complete subtree, of which match1 is a subset.

extract(filename, indices, nofunc=False, nomorph=False, sents=False)[source]

Extract a range of trees / sentences.

Parameters:
  • filename – one of the filenames in self.files
  • indices – iterable of indices of sentences to extract (1-based, excluding empty lines)
  • sents – if True, return sentences instead of trees. Sentences are strings with space-separated tokens.
  • nomorph (nofunc,) – same as for trees() method.
Returns:

a list of Tree objects or sentences.

getinfo(filename)[source]

Return named tuple with members len, numnodes, and numwords.

class discodop.treesearch.RegexSearcher(files, macros=None, numproc=None, ignorecase=False, inmemory=False)[source]

Search a plain text file in UTF-8 with regular expressions.

Assumes that non-empty lines correspond to sentences; empty lines do not count towards line numbers (e.g., when used as paragraph breaks).

Parameters:
  • macros – a file containing lines of the form 'name=regex'; an occurrence of '{name}' will be replaced with regex when it appears in a query.
  • ignorecase – ignore case in all queries.
close()[source]

Close files and free memory.

counts(query, subset=None, start=None, end=None, indices=False, breakdown=False)[source]

Run query and return a dict of the form {corpus1: nummatches, …}.

Parameters:
  • query – the search query
  • subset – an iterable of filenames to run the query on; by default all filenames are used.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • indices – if True, return a sequence of indices of matching occurrences, instead of an integer count.
  • breakdown – if True, return a Counter mapping matches to counts.
sents(query, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Run query and return matching sentences.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return; pass None for no limit.
  • brackets – if True, return trees as they appear in the treebank, match1 and match2 are strings with the matching subtree. If False (default), sentences are returned as a sequence of tokens.
Returns:

list of tuples of the form (corpus, sentno, sent, match1, match2) sent is a single string with space-separated tokens; match1 and match2 are iterables of integer indices of characters matched by the query. If the distinction is applicable, match2 contains the complete subtree, of which match1 is a subset.

trees(query, subset=None, start=None, end=None, maxresults=10, nofunc=False, nomorph=False)[source]

Run query and return list of matching trees.

Parameters:
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
  • maxresults – the maximum number of matches to return.
  • nomorph (nofunc,) – whether to remove / add function tags and morphological features from trees.
Returns:

list of tuples of the form (corpus, sentno, tree, sent, highlight) highlight is a list of matched Tree nodes from tree.

batchcounts(queries, subset=None, start=None, end=None)[source]

Like counts(), but executes multiple queries on multiple files.

Useful in combination with pandas.DataFrame; e.g.:

queries = ['NP < PP', 'VP < PP']
corpus = treesearch.TgrepSearcher(glob.glob('*.mrg'))
pandas.DataFrame.from_items(list(corpus.batchcounts(queries)),
                orient='index', columns=queries)
Parameters:
  • queries – an iterable of strings.
  • end (start,) – the interval of sentences to query in each corpus; by default, all sentences are queried. 1-based, inclusive.
Yields:

tuples of the form (corpus1, [count1, count2, ...]). where count1, count2, ... corresponds to queries. Order of queries and corpora is preserved.

batchsents(queries, subset=None, start=None, end=None, maxresults=100, brackets=False)[source]

Variant of sents() to run a batch of queries.

extract(filename, indices, nofunc=False, nomorph=False, sents=True)[source]

Extract a range of trees / sentences.

Parameters:
  • filename – one of the filenames in self.files
  • indices – iterable of indices of sentences to extract (1-based, excluding empty lines)
  • sents – if True, return sentences instead of trees. Sentences are strings with space-separated tokens.
  • nomorph (nofunc,) – same as for trees() method.
Returns:

a list of Tree objects or sentences.

getinfo(filename)[source]

Return named tuple with members len, numnodes, and numwords.

class discodop.treesearch.NoFuture(func, *args, **kwargs)[source]

A non-asynchronous version of concurrent.futures.Future.

result(timeout=None)[source]

Return the precomputed result.

class discodop.treesearch.FIFOOrederedDict(limit)[source]

FIFO cache with maximum number of elements based on OrderedDict.

discodop.treesearch.filterlabels(line, nofunc, nomorph)[source]

Remove morphological and/or grammatical function labels from tree(s).

discodop.treesearch.charindices(sent, indices, indices2=None)[source]

Project token indices to character indices.

>>> sorted(charindices(['The', 'cat', 'is', 'on', 'the', 'mat'], {0, 2, 4}))
[0, 1, 2, 3, 8, 9, 10, 14, 15, 16, 17]
discodop.treesearch.cpu_count()[source]

Return number of CPUs or 1.

discodop.treesearch.applyhighlight(sent, high1, high2, reset=False, high1color='red', high2color='blue')[source]

Highlight character indices high1 & high2 in sent with ANSI colors.

Parameters:reset – if True, reset to normal color before every change (useful in IPython notebook).