polyglot package¶
Subpackages¶
Submodules¶
polyglot.base module¶
Basic data types.
-
class
polyglot.base.
Sequence
(text)[source]¶ Bases:
object
Text with indices indicates boundaries.
-
text
¶
-
-
class
polyglot.base.
TextFile
(file, delimiter=u'n')[source]¶ Bases:
object
Wrapper around text files.
It uses io.open to guarantee reading text files with unicode encoding. It has an iterator that supports arbitrary delimiter instead of only new lines.-
delimiter
¶ string – A string that defines the limit of each chunk.
-
file
¶ string – A path to a file.
-
buf
¶ StringIO – a buffer to store the results of peeking into the file.
-
apply
(func, workers=1, job_size=10000)[source]¶ Apply func to lines of text in parallel or sequential.
Parameters: func – a function that takes a list of lines.
-
iter_delimiter
(byte_size=8192)[source]¶ Generalization of the default iter file delimited by ‘ ‘.
- Note:
- The newline string can be arbitrarily long; it need not be restricted to a single character. You can also set the read size and control whether or not the newline string is left on the end of the iterated lines. Setting newline to ‘’ is particularly good for use with an input file created with something like “os.popen(‘find -print0’)”.
- Args:
- byte_size (integer): Number of bytes to be read at each time.
-
-
class
polyglot.base.
TextFiles
(files, delimiter=u'n')[source]¶ Bases:
polyglot.base.TextFile
Interface for a sequence of files.
-
names
¶
-
polyglot.decorators module¶
polyglot.downloader module¶
The Polyglot corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with polyglot.
Downloading Packages¶
If called with no arguments, download()
will display an interactive
interface which can be used to download and install new packages.
If Tkinter is available, then a graphical interface will be shown,
otherwise a simple text interface will be provided.
Individual packages can be downloaded by calling the download()
function with a single argument, giving the package identifier for the
package that should be downloaded:
>>> download('treebank')
[polyglot_data] Downloading package 'treebank'...
[polyglot_data] Unzipping corpora/treebank.zip.
Polyglot also provides a number of “package collections”, consisting of
a group of related packages. To download all packages in a
colleciton, simply call download()
with the collection’s
identifier:
>>> download('all-corpora')
[polyglot_data] Downloading package 'abc'...
[polyglot_data] Unzipping corpora/abc.zip.
[polyglot_data] Downloading package 'alpino'...
[polyglot_data] Unzipping corpora/alpino.zip.
...
[polyglot_data] Downloading package 'words'...
[polyglot_data] Unzipping corpora/words.zip.
Download Directory¶
By default, packages are installed in either a system-wide directory
(if Python has sufficient access to write to it); or in the current
user’s home directory. However, the download_dir
argument may be
used to specify a different installation target, if desired.
See Downloader.default_download_dir()
for more a detailed
description of how the default download directory is chosen.
Polyglot Download Server¶
Before downloading any packages, the corpus and module downloader
contacts the Polyglot download server, to retrieve an index file
describing the available packages. By default, this index file is
loaded from http://nltk.googlecode.com/svn/trunk/polyglot_data/index.xml
.
If necessary, it is possible to create a new Downloader
object,
specifying a different URL for the package index file.
Usage:
python polyglot/downloader.py [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
or:
python -m polyglot.downloader [-d DATADIR] [-q] [-f] [-k] PACKAGE_IDS
-
class
polyglot.downloader.
Collection
(id, children, name=None, **kw)[source]¶ Bases:
object
A directory entry for a collection of downloadable packages. These entries are extracted from the XML index file that is downloaded by
Downloader
.-
children
= None¶ A list of the
Collections
orPackages
directly contained by this collection.
-
id
= None¶ A unique identifier for this collection.
-
name
= None¶ A string name for this collection.
-
packages
= None¶ A list of
Packages
contained by this collection or any collections it recursively contains.
-
-
class
polyglot.downloader.
Downloader
(server_index_url=None, source=None, download_dir=None)[source]¶ Bases:
object
A class used to access the Polyglot data server, which can be used to download corpora and other data packages.
-
DEFAULT_SOURCE
= u'mirror'¶ The source for index and other data files. Two values are supported: ‘mirror’ or ‘google’.
For ‘mirror’, the DEFAULT_URL should be set as a prefix of mirrored directory, like ‘http://address.of.mirror/dir/‘, and the downloader expects a file named ‘index.json’ as index file.
For ‘google’, the DEFAULT_URL should be the bucket of google cloud, and the downloader expects index from google api.
So set the following DEFAULT_URL properly.
-
DEFAULT_URL
= u'http://whoisbigger.com/polyglot/'¶ The default URL for the Polyglot data server’s index. An alternative URL can be specified when creating a new
Downloader
object.For ‘google’ as DEFAULT_SOURCE, ‘polyglot-models’ is the default place. For ‘mirror’ as DEFAULT_SOURCE, use an proper mirror.
-
INDEX_TIMEOUT
= 3600¶ The amount of time after which the cached copy of the data server index will be considered ‘stale,’ and will be re-downloaded.
-
INSTALLED
= u'installed'¶ A status string indicating that a package or collection is installed and up-to-date.
-
LANG_PREFIX
= u'LANG:'¶ Collection ID prefix for collections that gathers models of a specific task.
-
NOT_INSTALLED
= u'not installed'¶ A status string indicating that a package or collection is not installed.
-
PARTIAL
= u'partial'¶ A status string indicating that a collection is partially installed (i.e., only some of its packages are installed.)
-
STALE
= u'out of date'¶ A status string indicating that a package or collection is corrupt or out-of-date.
-
TASK_PREFIX
= u'TASK:'¶ Collection ID prefix for collections that gathers models of a specific task.
-
default_download_dir
()[source]¶ Return the directory to which packages will be downloaded by default. This value can be overridden using the constructor, or on a case-by-case basis using the
download_dir
argument when callingdownload()
.On Windows, the default download directory is
PYTHONHOME/lib/nltk
, where PYTHONHOME is the directory containing Python, e.g.C:\Python25
.On all other platforms, the default directory is the first of the following which exists or which can be created with write permission:
/usr/share/polyglot_data
,/usr/local/share/polyglot_data
,/usr/lib/polyglot_data
,/usr/local/lib/polyglot_data
,~/polyglot_data
.
-
download
(info_or_id=None, download_dir=None, quiet=False, force=False, prefix=u'[polyglot_data] ', halt_on_error=True, raise_on_error=False)[source]¶
-
download_dir
¶ The default directory to which packages will be downloaded. This defaults to the value returned by
default_download_dir()
. To override this default on a case-by-case basis, use thedownload_dir
argument when callingdownload()
.
-
get_collection
(lang=None, task=None)[source]¶ Return the collection that represents a specific language or task.
Parameters: - lang (string) – Language code.
- task (string) – Task name.
-
index
()[source]¶ Return the XML index describing the packages available from the data server. If necessary, this index will be downloaded from the data server.
-
list
(download_dir=None, show_packages=False, show_collections=True, header=True, more_prompt=False, skip_installed=False)[source]¶
-
status
(info_or_id, download_dir=None)[source]¶ Return a constant describing the status of the given package or collection. Status can be one of
INSTALLED
,NOT_INSTALLED
,STALE
, orPARTIAL
.
-
supported_language
(lang)[source]¶ Return True if polyglot supports the language.
Parameters: lang (string) – Language code.
-
supported_languages
(task=None)[source]¶ Languages that are covered by a specific task.
Parameters: task (string) – Task name.
-
supported_tasks
(lang=None)[source]¶ Languages that are covered by a specific task.
Parameters: lang (string) – Language code name.
-
update
(quiet=False, prefix=u'[polyglot_data] ')[source]¶ Re-download any packages whose status is STALE.
-
url
¶ The URL for the data server’s index file.
-
-
class
polyglot.downloader.
DownloaderMessage
[source]¶ Bases:
object
A status message object, used by
incr_download
to communicate its progress.
-
class
polyglot.downloader.
ErrorMessage
(package, message)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server encountered an error
-
exception
polyglot.downloader.
ExceptionBase
[source]¶ Bases:
exceptions.Exception
General base exception for the downloader module.
-
class
polyglot.downloader.
FinishCollectionMessage
(collection)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished working on a collection of packages.
-
class
polyglot.downloader.
FinishDownloadMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished downloading a package.
-
class
polyglot.downloader.
FinishPackageMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished working on a package.
-
class
polyglot.downloader.
FinishUnzipMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has finished unzipping a package.
-
exception
polyglot.downloader.
LanguageNotSupported
[source]¶ Bases:
polyglot.downloader.ExceptionBase
Raised if the language is not covered by polyglot.
-
class
polyglot.downloader.
Package
(id, url, name=None, subdir=u'', size=None, filename=u'', task=u'', language=u'', attrs=None, **kw)[source]¶ Bases:
object
A directory entry for a downloadable package. These entries are extracted from the XML index file that is downloaded by
Downloader
. Each package consists of a single file; but if that file is a zip file, then it can be automatically decompressed when the package is installed.-
attrs
= None¶ Extra attributes generated by Google Cloud Storage.
-
filename
= None¶ The filename that should be used for this package’s file.
-
id
= None¶ A unique identifier for this package.
-
language
= None¶ The langauge code this package belongs to.
-
name
= None¶ A string name for this package.
-
size
= None¶ The filesize (in bytes) of the package file.
-
subdir
= None¶ The subdirectory where this package should be installed. E.g.,
'corpora'
or'taggers'
.
-
task
= None¶ The task this package is serving.
-
url
= None¶ A URL that can be used to download this package’s file.
-
-
class
polyglot.downloader.
ProgressMessage
(progress)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Indicates how much progress the data server has made
-
class
polyglot.downloader.
SelectDownloadDirMessage
(download_dir)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Indicates what download directory the data server is using
-
class
polyglot.downloader.
StaleMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
The package download file is out-of-date or corrupt
-
class
polyglot.downloader.
StartCollectionMessage
(collection)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started working on a collection of packages.
-
class
polyglot.downloader.
StartDownloadMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started downloading a package.
-
class
polyglot.downloader.
StartPackageMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started working on a package.
-
class
polyglot.downloader.
StartUnzipMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
Data server has started unzipping a package.
-
exception
polyglot.downloader.
TaskNotSupported
[source]¶ Bases:
polyglot.downloader.ExceptionBase
Raised if the task is not covered by polyglot.
-
class
polyglot.downloader.
UpToDateMessage
(package)[source]¶ Bases:
polyglot.downloader.DownloaderMessage
The package download file is already up-to-date
-
polyglot.downloader.
build_index
(root, base_url)[source]¶ Create a new data.xml index file, by combining the xml description files for various packages and collections.
root
should be the path to a directory containing the package xml and zip files; and the collection xml files. Theroot
directory is expected to have the following subdirectories:root/ packages/ .................. subdirectory for packages corpora/ ................. zip & xml files for corpora grammars/ ................ zip & xml files for grammars taggers/ ................. zip & xml files for taggers tokenizers/ .............. zip & xml files for tokenizers etc. collections/ ............... xml files for collections
For each package, there should be two files:
package.zip
(where package is the package name) which contains the package itself as a compressed zip file; andpackage.xml
, which is an xml description of the package. The zipfilepackage.zip
should expand to a single subdirectory namedpackage/
. The base filenamepackage
must match the identifier given in the package’s xml file.For each collection, there should be a single file
collection.zip
describing the collection, where collection is the name of the collection.All identifiers (for both packages and collections) must be unique.
polyglot.load module¶
polyglot.mixins module¶
-
class
polyglot.mixins.
BlobComparableMixin
[source]¶ Bases:
polyglot.mixins.ComparableMixin
Allow blob objects to be comparable with both strings and blobs.
-
class
polyglot.mixins.
ComparableMixin
[source]¶ Bases:
object
Implements rich operators for an object.
-
class
polyglot.mixins.
StringlikeMixin
[source]¶ Bases:
object
Make blob objects behave like Python strings.
Expects that classes that use this mixin to have a _strkey() method that returns the string to apply string methods to. Using _strkey() instead of __str__ ensures consistent behavior between Python 2 and 3.
-
ends_with
(suffix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob ends with the given suffix.
-
endswith
(suffix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob ends with the given suffix.
-
find
(sub, start=0, end=9223372036854775807)[source]¶ Behaves like the built-in str.find() method. Returns an integer, the index of the first occurrence of the substring argument sub in the sub-string given by [start:end].
-
format
(*args, **kwargs)[source]¶ Perform a string formatting operation, like the built-in str.format(*args, **kwargs). Returns a blob object.
-
index
(sub, start=0, end=9223372036854775807)[source]¶ Like blob.find() but raise ValueError when the substring is not found.
-
join
(iterable)[source]¶ Behaves like the built-in str.join(iterable) method, except returns a blob object.
Returns a blob which is the concatenation of the strings or blobs in the iterable.
-
replace
(old, new, count=9223372036854775807)[source]¶ Return a new blob object with all the occurence of old replaced by new.
-
rfind
(sub, start=0, end=9223372036854775807)[source]¶ Behaves like the built-in str.rfind() method. Returns an integer, the index of he last (right-most) occurence of the substring argument sub in the sub-sequence given by [start:end].
-
rindex
(sub, start=0, end=9223372036854775807)[source]¶ Like blob.rfind() but raise ValueError when substring is not found.
-
starts_with
(prefix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob starts with the given prefix.
-
startswith
(prefix, start=0, end=9223372036854775807)[source]¶ Returns True if the blob starts with the given prefix.
-
polyglot.text module¶
polyglot.utils module¶
Collection of general utilities.