polyglot package

Submodules

polyglot.base module

Basic data types.

class polyglot.base.Sequence(text)[source]

Bases: object

Text with indices indicates boundaries.

empty()[source]
split(sequence)[source]

Split into subsequences according to sequence.

text
tokens()[source]

Returns segmented text after stripping whitespace.

class polyglot.base.TextFile(file, delimiter=u'n')[source]

Bases: object

Wrapper around text files.

It uses io.open to guarantee reading text files with unicode encoding. It has an iterator that supports arbitrary delimiter instead of only new lines.
delimiter

string – A string that defines the limit of each chunk.

file

string – A path to a file.

buf

StringIO – a buffer to store the results of peeking into the file.

apply(func, workers=1, job_size=10000)[source]

Apply func to lines of text in parallel or sequential.

Parameters:func – a function that takes a list of lines.
iter_chunks(chunksize)[source]
iter_delimiter(byte_size=8192)[source]

Generalization of the default iter file delimited by ‘ ‘.

Note:
The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to ‘’ is particularly good for use with an input file created with something like “os.popen(‘find -print0’)”.
Args:
byte_size (integer): Number of bytes to be read at each time.
peek(size)[source]
read(size=None)[source]

Read size of bytes.

readline()[source]
class polyglot.base.TextFiles(files, delimiter=u'n')[source]

Bases: polyglot.base.TextFile

Interface for a sequence of files.

names
peek(size)[source]
read(size=None)[source]
readline()[source]
class polyglot.base.TokenSequence[source]

Bases: list

A list of tokens.

Parameters:tokens (list) – list of symbols.
sliding_window(width=2, padding=None)[source]

polyglot.decorators module

class polyglot.decorators.cached_property(func)[source]

Bases: object

A property that is only computed once per instance and then replaces itself with an ordinary attribute. Deleting the attribute resets the property. Credit to Marcel Hellkamp, author of bottle.py.

polyglot.decorators.memoize(obj)[source]

polyglot.downloader module

The Polyglot corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with polyglot.

Downloading Packages

If called with no arguments, download() will display an interactive interface which can be used to download and install new packages. If Tkinter is available, then a graphical interface will be shown, otherwise a simple text interface will be provided.

Individual packages can be downloaded by calling the download() function with a single argument, giving the package identifier for the package that should be downloaded:

>>> download('treebank') 
[polyglot_data] Downloading package 'treebank'...
[polyglot_data]   Unzipping corpora/treebank.zip.

Polyglot also provides a number of “package collections”, consisting of a group of related packages. To download all packages in a colleciton, simply call download() with the collection’s identifier:

>>> download('all-corpora') 
[polyglot_data] Downloading package 'abc'...
[polyglot_data]   Unzipping corpora/abc.zip.
[polyglot_data] Downloading package 'alpino'...
[polyglot_data]   Unzipping corpora/alpino.zip.
  ...
[polyglot_data] Downloading package 'words'...
[polyglot_data]   Unzipping corpora/words.zip.

Download Directory

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

See Downloader.default_download_dir() for more a detailed description of how the default download directory is chosen.

Polyglot Download Server

Before downloading any packages, the corpus and module downloader contacts the Polyglot download server, to retrieve an index file describing the available packages. By default, this index file is loaded from http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml. If necessary, it is possible to create a new Downloader object, specifying a different URL for the package index file.

Usage:

python polyglot/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS

or:

python -m polyglot.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
class polyglot.downloader.Collection(id, children, name=None, **kw)[source]

Bases: object

A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by Downloader.

children = None

A list of the Collections or Packages directly contained by this collection.

id = None

A unique identifier for this collection.

name = None

A string name for this collection.

packages = None

A list of Packages contained by this collection or any collections it recursively contains.

class polyglot.downloader.Downloader(server_index_url=None, source=None, download_dir=None)[source]

Bases: object

A class used to access the Polyglot data server, which can be used to download corpora and other data packages.

DEFAULT_SOURCE = u'mirror'

The source for index and other data files. Two values are supported: ‘mirror’ or ‘google’.

For ‘mirror’, the DEFAULT_URL should be set as a prefix of mirrored directory, like ‘http://address.of.mirror/dir/‘, and the downloader expects a file named ‘index.json’ as index file.

For ‘google’, the DEFAULT_URL should be the bucket of google cloud, and the downloader expects index from google api.

So set the following DEFAULT_URL properly.

DEFAULT_URL = u'http://polyglot.cs.stonybrook.edu/~polyglot/'

The default URL for the Polyglot data server’s index. An alternative URL can be specified when creating a new Downloader object.

For ‘google’ as DEFAULT_SOURCE, ‘polyglot-models’ is the default place. For ‘mirror’ as DEFAULT_SOURCE, use an proper mirror.

INDEX_TIMEOUT = 3600

The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.

INSTALLED = u'installed'

A status string indicating that a package or collection is installed and up-to-date.

LANG_PREFIX = u'LANG:'

Collection ID prefix for collections that gathers models of a specific task.

NOT_INSTALLED = u'not installed'

A status string indicating that a package or collection is not installed.

PARTIAL = u'partial'

A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)

STALE = u'out of date'

A status string indicating that a package or collection is corrupt or out-of-date.

TASK_PREFIX = u'TASK:'

Collection ID prefix for collections that gathers models of a specific task.

clear_status_cache(id=None)[source]
collections()[source]
corpora()[source]
default_download_dir()[source]

Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the download_dir argument when calling download().

On all other platforms, the default directory is ~/polyglot_data.

download(info_or_id=None, download_dir=None, quiet=False, force=False, prefix=u'[polyglot_data] ', halt_on_error=True, raise_on_error=False)[source]
download_dir

The default directory to which packages will be downloaded. This defaults to the value returned by default_download_dir(). To override this default on a case-by-case basis, use the download_dir argument when calling download().

get_collection(lang=None, task=None)[source]

Return the collection that represents a specific language or task.

Parameters:
  • lang (string) – Language code.
  • task (string) – Task name.
incr_download(info_or_id, download_dir=None, force=False)[source]
index()[source]

Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.

info(id)[source]

Return the Package or Collection record for the given item.

is_installed(info_or_id, download_dir=None)[source]
is_stale(info_or_id, download_dir=None)[source]
list(download_dir=None, show_packages=False, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]
models()[source]
packages()[source]
status(info_or_id, download_dir=None)[source]

Return a constant describing the status of the given package or collection. Status can be one of INSTALLED, NOT_INSTALLED, STALE, or PARTIAL.

supported_language(lang)[source]

Return True if polyglot supports the language.

Parameters:lang (string) – Language code.
supported_languages(task=None)[source]

Languages that are covered by a specific task.

Parameters:task (string) – Task name.
supported_languages_table(task, cols=3)[source]
supported_tasks(lang=None)[source]

Languages that are covered by a specific task.

Parameters:lang (string) – Language code name.
update(quiet=False, prefix=u'[polyglot_data] ')[source]

Re-download any packages whose status is STALE.

url

The URL for the data server’s index file.

xmlinfo(id)[source]

Return the XML info record for the given item

class polyglot.downloader.DownloaderMessage[source]

Bases: object

A status message object, used by incr_download to communicate its progress.

class polyglot.downloader.DownloaderShell(dataserver)[source]

Bases: object

run()[source]
class polyglot.downloader.ErrorMessage(package, message)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server encountered an error

exception polyglot.downloader.ExceptionBase[source]

Bases: exceptions.Exception

General base exception for the downloader module.

class polyglot.downloader.FinishCollectionMessage(collection)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished working on a collection of packages.

class polyglot.downloader.FinishDownloadMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished downloading a package.

class polyglot.downloader.FinishPackageMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished working on a package.

class polyglot.downloader.FinishUnzipMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has finished unzipping a package.

exception polyglot.downloader.LanguageNotSupported[source]

Bases: polyglot.downloader.ExceptionBase

Raised if the language is not covered by polyglot.

class polyglot.downloader.Package(id, url, name=None, subdir=u'', size=None, filename=u'', task=u'', language=u'', attrs=None, **kw)[source]

Bases: object

A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by Downloader. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.

attrs = None

Extra attributes generated by Google Cloud Storage.

filename = None

The filename that should be used for this package’s file.

static fromcsobj(csobj)[source]
id = None

A unique identifier for this package.

language = None

The langauge code this package belongs to.

name = None

A string name for this package.

size = None

The filesize (in bytes) of the package file.

subdir = None

The subdirectory where this package should be installed. E.g., 'corpora' or 'taggers'.

task = None

The task this package is serving.

url = None

A URL that can be used to download this package’s file.

class polyglot.downloader.ProgressMessage(progress)[source]

Bases: polyglot.downloader.DownloaderMessage

Indicates how much progress the data server has made

class polyglot.downloader.SelectDownloadDirMessage(download_dir)[source]

Bases: polyglot.downloader.DownloaderMessage

Indicates what download directory the data server is using

class polyglot.downloader.StaleMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

The package download file is out-of-date or corrupt

class polyglot.downloader.StartCollectionMessage(collection)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started working on a collection of packages.

class polyglot.downloader.StartDownloadMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started downloading a package.

class polyglot.downloader.StartPackageMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started working on a package.

class polyglot.downloader.StartUnzipMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

Data server has started unzipping a package.

exception polyglot.downloader.TaskNotSupported[source]

Bases: polyglot.downloader.ExceptionBase

Raised if the task is not covered by polyglot.

class polyglot.downloader.UpToDateMessage(package)[source]

Bases: polyglot.downloader.DownloaderMessage

The package download file is already up-to-date

polyglot.downloader.build_index(root, base_url)[source]

Create a new data.xml index file, by combining the xml description files for various packages and collections. root should be the path to a directory containing the package xml and zip files; and the collection xml files. The root directory is expected to have the following subdirectories:

root/
packages/ .................. subdirectory for packages
  corpora/ ................. zip & xml files for corpora
  grammars/ ................ zip & xml files for grammars
  taggers/ ................. zip & xml files for taggers
  tokenizers/ .............. zip & xml files for tokenizers
  etc.
collections/ ............... xml files for collections

For each package, there should be two files: package.zip (where package is the package name) which contains the package itself as a compressed zip file; and package.xml, which is an xml description of the package. The zipfile package.zip should expand to a single subdirectory named package/. The base filename package must match the identifier given in the package’s xml file.

For each collection, there should be a single file collection.zip describing the collection, where collection is the name of the collection.

All identifiers (for both packages and collections) must be unique.

polyglot.downloader.download_gui()[source]
polyglot.downloader.download_shell()[source]
polyglot.downloader.is_writable(path)[source]
polyglot.downloader.unzip(filename, root, verbose=True)[source]

Extract the contents of the zip file filename into the directory root.

polyglot.downloader.update()[source]

polyglot.load module

polyglot.mixins module

class polyglot.mixins.BlobComparableMixin[source]

Bases: polyglot.mixins.ComparableMixin

Allow blob objects to be comparable with both strings and blobs.

class polyglot.mixins.ComparableMixin[source]

Bases: object

Implements rich operators for an object.

class polyglot.mixins.StringlikeMixin[source]

Bases: object

Make blob objects behave like Python strings.

Expects that classes that use this mixin to have a _strkey() method that returns the string to apply string methods to. Using _strkey() instead of __str__ ensures consistent behavior between Python 2 and 3.

ends_with(suffix, start=0, end=9223372036854775807)[source]

Returns True if the blob ends with the given suffix.

endswith(suffix, start=0, end=9223372036854775807)[source]

Returns True if the blob ends with the given suffix.

find(sub, start=0, end=9223372036854775807)[source]

Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].

format(*args, **kwargs)[source]

Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.

index(sub, start=0, end=9223372036854775807)[source]

Like blob.find() but raise ValueError when the substring is not found.

join(iterable)[source]

Behaves like the built-in str.join(iterable) method, except returns a blob object.

Returns a blob which is the concatenation of the strings or blobs in the iterable.

lower()[source]

Like str.lower(), returns new object with all lower-cased characters.

replace(old, new, count=9223372036854775807)[source]

Return a new blob object with all the occurence of old replaced by new.

rfind(sub, start=0, end=9223372036854775807)[source]

Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].

rindex(sub, start=0, end=9223372036854775807)[source]

Like blob.rfind() but raise ValueError when substring is not found.

split(sep=None, maxsplit=9223372036854775807)[source]

Behaves like the built-in str.split().

starts_with(prefix, start=0, end=9223372036854775807)[source]

Returns True if the blob starts with the given prefix.

startswith(prefix, start=0, end=9223372036854775807)[source]

Returns True if the blob starts with the given prefix.

strip(chars=None)[source]

Behaves like the built-in str.strip([chars]) method. Returns an object with leading and trailing whitespace removed.

title()[source]

Returns a blob object with the text in title-case.

upper()[source]

Like str.upper(), returns new object with all upper-cased characters.

polyglot.mixins.implements_to_string(cls)[source]

Class decorator that renames __str__ to __unicode__ and modifies __str__ that returns utf-8.

polyglot.text module

polyglot.utils module

Collection of general utilities.

polyglot.utils.pretty_list(items, cols=3)[source]

Module contents

class polyglot.Sequence(text)[source]

Bases: object

Text with indices indicates boundaries.

empty()[source]
split(sequence)[source]

Split into subsequences according to sequence.

text
tokens()[source]

Returns segmented text after stripping whitespace.

class polyglot.TokenSequence[source]

Bases: list

A list of tokens.

Parameters:tokens (list) – list of symbols.
sliding_window(width=2, padding=None)[source]