polyglot.mapping package¶
Subpackages¶
Submodules¶
polyglot.mapping.base module¶
Supports word embeddings.
-
class
polyglot.mapping.base.
CountedVocabulary
(word_count=None)[source]¶ Bases:
polyglot.mapping.base.OrderedVocabulary
List of words and counts sorted according to word count.
-
classmethod
from_textfile
(textfile, workers=1, job_size=1000)[source]¶ Count the set of words appeared in a text file.
Parameters: - textfile (string) – The name of the text file or TextFile object.
- min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
- workers (integer) – Number of parallel workers to read the file simulatenously.
- job_size (integer) – Size of the batch send to each worker.
- most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns: A vocabulary of the most frequent words appeared in the document.
-
static
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1 count1
- word2 count2
-
classmethod
-
class
polyglot.mapping.base.
OrderedVocabulary
(words=None)[source]¶ Bases:
polyglot.mapping.base.VocabularyBase
An ordered list of words/tokens according to their frequency.
Note
The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
-
class
polyglot.mapping.base.
VocabularyBase
(words=None)[source]¶ Bases:
object
A set of words/tokens that have consistent IDs.
Note
Words will be sorted according to their lexicographic order.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
classmethod
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1
- word2
-
sanitize_words
(words)[source]¶ Guarantees that all textual symbols are unicode.
Note
We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.
-
words
¶ Ordered list of words according to their IDs.
-
polyglot.mapping.embeddings module¶
Defines classes related to mapping vocabulary to n-dimensional points.
-
class
polyglot.mapping.embeddings.
Embedding
(vocabulary, vectors)[source]¶ Bases:
object
Mapping a vocabulary to a d-dimensional points.
-
distances
(word, words)[source]¶ Calculate eucledean pairwise distances between word and words.
Parameters: - word (string) – single word.
- words (list) – list of strings.
Returns: numpy array of the distances.
Note
L2 metric is used to calculate distances.
-
static
from_word2vec
(fname, fvocab=None, binary=False)[source]¶ Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
-
most_frequent
(k, inplace=False)[source]¶ Only most frequent k words to be included in the embeddings.
-
nearest_neighbors
(word, top_k=10)[source]¶ Return the nearest k words to the given word.
Parameters: - word (string) – single word.
- top_k (integer) – decides how many neighbors to report.
Returns: A list of words sorted by the distances. The closest is the first.
Note
L2 metric is used to calculate distances.
-
normalize_words
(ord=2, inplace=False)[source]¶ Normalize embeddings matrix row-wise.
Parameters: ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
-
shape
¶
-
words
¶
-
polyglot.mapping.expansion module¶
Module contents¶
-
class
polyglot.mapping.
CountedVocabulary
(word_count=None)[source]¶ Bases:
polyglot.mapping.base.OrderedVocabulary
List of words and counts sorted according to word count.
-
classmethod
from_textfile
(textfile, workers=1, job_size=1000)[source]¶ Count the set of words appeared in a text file.
Parameters: - textfile (string) – The name of the text file or TextFile object.
- min_count (integer) – Minimum number of times a word/token appeared in the document to be considered part of the vocabulary.
- workers (integer) – Number of parallel workers to read the file simulatenously.
- job_size (integer) – Size of the batch send to each worker.
- most_frequent (integer) – if no min_count is specified, consider the most frequent k words for the vocabulary.
Returns: A vocabulary of the most frequent words appeared in the document.
-
static
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1 count1
- word2 count2
-
classmethod
-
class
polyglot.mapping.
OrderedVocabulary
(words=None)[source]¶ Bases:
polyglot.mapping.base.VocabularyBase
An ordered list of words/tokens according to their frequency.
Note
The words order is assumed to be sorted according to the word frequency. Most frequent words appear first in the list.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
-
class
polyglot.mapping.
VocabularyBase
(words=None)[source]¶ Bases:
object
A set of words/tokens that have consistent IDs.
Note
Words will be sorted according to their lexicographic order.
-
word_id
¶ dictionary – Mapping from words to IDs.
-
id_word
¶ dictionary – A reverse map of word_id.
-
classmethod
from_vocabfile
(filename)[source]¶ Construct a CountedVocabulary out of a vocabulary file.
Note
- File has the following format word1
- word2
-
sanitize_words
(words)[source]¶ Guarantees that all textual symbols are unicode.
Note
We do not convert numbers, only strings to unicode. We assume that the strings are encoded in utf-8.
-
words
¶ Ordered list of words according to their IDs.
-
-
class
polyglot.mapping.
Embedding
(vocabulary, vectors)[source]¶ Bases:
object
Mapping a vocabulary to a d-dimensional points.
-
distances
(word, words)[source]¶ Calculate eucledean pairwise distances between word and words.
Parameters: - word (string) – single word.
- words (list) – list of strings.
Returns: numpy array of the distances.
Note
L2 metric is used to calculate distances.
-
static
from_word2vec
(fname, fvocab=None, binary=False)[source]¶ Load the input-hidden weight matrix from the original C word2vec-tool format.
Note that the information stored in the file is incomplete (the binary tree is missing), so while you can query for word similarity etc., you cannot continue training with a model loaded this way.
binary is a boolean indicating whether the data is in binary word2vec format. Word counts are read from fvocab filename, if set (this is the file generated by -save-vocab flag of the original C tool).
-
most_frequent
(k, inplace=False)[source]¶ Only most frequent k words to be included in the embeddings.
-
nearest_neighbors
(word, top_k=10)[source]¶ Return the nearest k words to the given word.
Parameters: - word (string) – single word.
- top_k (integer) – decides how many neighbors to report.
Returns: A list of words sorted by the distances. The closest is the first.
Note
L2 metric is used to calculate distances.
-
normalize_words
(ord=2, inplace=False)[source]¶ Normalize embeddings matrix row-wise.
Parameters: ord – normalization order. Possible values {1, 2, ‘inf’, ‘-inf’}
-
shape
¶
-
words
¶
-