Part of Speech Tagging¶

Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary verb
CONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

Languages Coverage¶

The models were trained on a combination of:

Original CONLL datasets after the tags were converted using the universal POS tables.
Universal Dependencies 1.0 corpora whenever they are available.

from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))

German                     2. Italian                    3. Danish
Czech                      5. Slovene                    6. French
English                    8. Swedish                    9. Bulgarian
Spanish; Castilian        11. Indonesian                12. Portuguese
Finnish                   14. Irish                     15. Hungarian
Dutch

Download Necessary Models¶

%%bash
polyglot download embeddings2.en pos2.en

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package pos2.en is already up-to-date!

Example¶

We tag each word in the text with one part of speech.

from polyglot.text import Text

blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

We can query all the tagged words

text.pos_tags

[(u'We', u'PRON'),
 (u'will', u'AUX'),
 (u'meet', u'VERB'),
 (u'at', u'ADP'),
 (u'eight', u'NUM'),
 (u"o'clock", u'NOUN'),
 (u'on', u'ADP'),
 (u'Thursday', u'PROPN'),
 (u'morning', u'NOUN'),
 (u'.', u'PUNCT')]

After calling the pos_tags property once, the words objects will carry the POS tags.

text.words[0].pos_tag

u'PRON'

!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en pos | tail -n 30

which           DET
India           PROPN
beat            VERB
Bermuda         PROPN
in              ADP
Port            PROPN
of              ADP
Spain           PROPN
in              ADP
2007            NUM
,               PUNCT
which           DET
was             AUX
equalled        VERB
five            NUM
days            NOUN
ago             ADV
by              ADP
South           PROPN
Africa          PROPN
in              ADP
their           PRON
victory         NOUN
over            ADP
West            PROPN
Indies          PROPN
in              ADP
Sydney          PROPN
.               PUNCT

This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@InProceedings{polyglot:2013:ACL-CoNLL,
  author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
  title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
  booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
  month     = {August},
  year      = {2013},
  address   = {Sofia, Bulgaria},
  publisher = {Association for Computational Linguistics},
  pages     = {183--192},
  url       = {http://www.aclweb.org/anthology/W13-3520}
}

Part of Speech Tagging¶

Languages Coverage¶

Download Necessary Models¶

Example¶

References¶