polyglot.mapping package¶

Subpackages¶

Submodules¶

polyglot.mapping.base module¶

Supports word embeddings.

class polyglot.mapping.base.CountedVocabulary(word_count=None)[source]¶

Bases: polyglot.mapping.base.OrderedVocabulary

List of words and counts sorted according to word count.

classmethod from_textfile(textfile, workers=1, job_size=1000)[source]¶

Count the set of words appeared in a text file.

Parameters:

textfile (string) – The name of the text file or TextFile object.
min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
workers (integer) – Number of parallel workers to read the file simulatenously.
job_size (integer) – Size of the batch send to each worker.
most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.

Returns:

A vocabulary of the most frequent words appeared in the document.

static from_textfiles(files, workers=1, job_size=1000)[source]¶

static from_vocabfile(filename)[source]¶

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1 count1: word2 count2

getstate()[source]¶

min_count(n=1)[source]¶

Returns a vocabulary after eliminating the words that appear < n.

Parameters:	n (integer) – specifies the minimum word frequency allowed.

most_frequent(k)[source]¶

Returns a vocabulary with the most frequent k words.

Parameters:	k (integer) – specifies the top k most frequent words to be returned.

class polyglot.mapping.base.OrderedVocabulary(words=None)[source]¶

Bases: polyglot.mapping.base.VocabularyBase

An ordered list of words/tokens according to their frequency.

Note

The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.

word_id¶: dictionary – Mapping from words to IDs.

id_word¶: dictionary – A reverse map of word_id.

most_frequent(k)[source]¶

Returns a vocabulary with the most frequent k words.

Parameters:	k (integer) – specifies the top k most frequent words to be returned.

class polyglot.mapping.base.VocabularyBase(words=None)[source]¶

Bases: object

A set of words/tokens that have consistent IDs.

Note

Words will be sorted according to their lexicographic order.

word_id¶: dictionary – Mapping from words to IDs.

id_word¶: dictionary – A reverse map of word_id.

classmethod from_vocabfile(filename)[source]¶

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1: word2

get(k, default=None)[source]¶

getstate()[source]¶

sanitize_words(words)[source]¶: Guarantees that all textual symbols are unicode.

Note

We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.

words¶: Ordered list of words according to their IDs.

polyglot.mapping.base.count(lines)[source]¶: Counts the word frequences in a list of sentences.

Note

This is a helper function for parallel execution of Vocabulary.from_text method.

polyglot.mapping.embeddings module¶

Defines classes related to mapping vocabulary to n-dimensional points.

class polyglot.mapping.embeddings.Embedding(vocabulary, vectors)[source]¶

Bases: object

Mapping a vocabulary to a d-dimensional points.

apply_expansion(expansion)[source]¶: Apply a vocabulary expansion to the current emebddings.

distances(word, words)[source]¶

Calculate eucledean pairwise distances between word and words.

Parameters:	word (string) – single word. words (list) – list of strings.
Returns:	numpy array of the distances.

Note

L2 metric is used to calculate distances.

static from_gensim(model)[source]¶

static from_glove(fname)[source]¶

static from_word2vec(fname, fvocab=None, binary=False)[source]¶

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

static from_word2vec_vocab(fvocab)[source]¶

get(k, default=None)[source]¶

static load(fname)[source]¶: Load an embedding dump generated by save

most_frequent(k, inplace=False)[source]¶: Only most frequent k words to be included in the embeddings.

nearest_neighbors(word, top_k=10)[source]¶

Return the nearest k words to the given word.

Parameters:	word (string) – single word. top_k (integer) – decides how many neighbors to report.
Returns:	A list of words sorted by the distances. The closest is the first.

Note

L2 metric is used to calculate distances.

normalize_words(ord=2, inplace=False)[source]¶

Normalize embeddings matrix row-wise.

Parameters:	ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}

save(fname)[source]¶: Save a pickled version of the embedding into fname.

shape¶

words¶

zero_vector()[source]¶: Returns a zero vector of embedding dimension.

polyglot.mapping.expansion module¶

class polyglot.mapping.expansion.CaseExpander(vocabulary, strategy='most_frequent')[source]¶: Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.expansion.DigitExpander(vocabulary, strategy='most_frequent')[source]¶: Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.expansion.VocabExpander(vocabulary, formatters, strategy)[source]¶

Bases: polyglot.mapping.base.OrderedVocabulary

approximate(w)[source]¶

approximate_ids(key)[source]¶

expand(formatters)[source]¶

format(w)[source]¶

Module contents¶

class polyglot.mapping.CountedVocabulary(word_count=None)[source]¶

Bases: polyglot.mapping.base.OrderedVocabulary

List of words and counts sorted according to word count.

classmethod from_textfile(textfile, workers=1, job_size=1000)[source]¶

Count the set of words appeared in a text file.

Parameters:

textfile (string) – The name of the text file or TextFile object.
min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
workers (integer) – Number of parallel workers to read the file simulatenously.
job_size (integer) – Size of the batch send to each worker.
most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.

Returns:

A vocabulary of the most frequent words appeared in the document.

static from_textfiles(files, workers=1, job_size=1000)[source]¶

static from_vocabfile(filename)[source]¶

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1 count1: word2 count2

getstate()[source]¶

min_count(n=1)[source]¶

Returns a vocabulary after eliminating the words that appear < n.

Parameters:	n (integer) – specifies the minimum word frequency allowed.

most_frequent(k)[source]¶

Returns a vocabulary with the most frequent k words.

Parameters:	k (integer) – specifies the top k most frequent words to be returned.

class polyglot.mapping.OrderedVocabulary(words=None)[source]¶

Bases: polyglot.mapping.base.VocabularyBase

An ordered list of words/tokens according to their frequency.

Note

The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.

word_id¶: dictionary – Mapping from words to IDs.

id_word¶: dictionary – A reverse map of word_id.

most_frequent(k)[source]¶

Returns a vocabulary with the most frequent k words.

Parameters:	k (integer) – specifies the top k most frequent words to be returned.

class polyglot.mapping.VocabularyBase(words=None)[source]¶

Bases: object

A set of words/tokens that have consistent IDs.

Note

Words will be sorted according to their lexicographic order.

word_id¶: dictionary – Mapping from words to IDs.

id_word¶: dictionary – A reverse map of word_id.

classmethod from_vocabfile(filename)[source]¶

Construct a CountedVocabulary out of a vocabulary file.

Note

File has the following format word1: word2

get(k, default=None)[source]¶

getstate()[source]¶

sanitize_words(words)[source]¶: Guarantees that all textual symbols are unicode.

Note

We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.

words¶: Ordered list of words according to their IDs.

class polyglot.mapping.Embedding(vocabulary, vectors)[source]¶

Bases: object

Mapping a vocabulary to a d-dimensional points.

apply_expansion(expansion)[source]¶: Apply a vocabulary expansion to the current emebddings.

distances(word, words)[source]¶

Calculate eucledean pairwise distances between word and words.

Parameters:	word (string) – single word. words (list) – list of strings.
Returns:	numpy array of the distances.

Note

L2 metric is used to calculate distances.

static from_gensim(model)[source]¶

static from_glove(fname)[source]¶

static from_word2vec(fname, fvocab=None, binary=False)[source]¶

Load the input-hidden weight matrix from the original C word2vec-tool format.

Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.

binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).

static from_word2vec_vocab(fvocab)[source]¶

get(k, default=None)[source]¶

static load(fname)[source]¶: Load an embedding dump generated by save

most_frequent(k, inplace=False)[source]¶: Only most frequent k words to be included in the embeddings.

nearest_neighbors(word, top_k=10)[source]¶

Return the nearest k words to the given word.

Parameters:	word (string) – single word. top_k (integer) – decides how many neighbors to report.
Returns:	A list of words sorted by the distances. The closest is the first.

Note

L2 metric is used to calculate distances.

normalize_words(ord=2, inplace=False)[source]¶

Normalize embeddings matrix row-wise.

Parameters:	ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}

save(fname)[source]¶: Save a pickled version of the embedding into fname.

shape¶

words¶

zero_vector()[source]¶: Returns a zero vector of embedding dimension.

class polyglot.mapping.CaseExpander(vocabulary, strategy='most_frequent')[source]¶: Bases: polyglot.mapping.expansion.VocabExpander

class polyglot.mapping.DigitExpander(vocabulary, strategy='most_frequent')[source]¶: Bases: polyglot.mapping.expansion.VocabExpander