Named Entity Extraction

Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities:

  • Locations (Tag: I-LOC): cities, countries, regions, continents, neighborhoods, administrative divisions ...
  • Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profits, companies, ...
  • Persons (Tag: I-PER): politicians, scientists, artists, atheletes ...

Languages Coverage

The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages.

from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))
 1. Polish                     2. Turkish                    3. Russian
 4. Indonesian                 5. Czech                      6. Arabic
 7. Korean                     8. Catalan; Valencian         9. Italian
10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog
13. Danish                    14. Finnish                   15. German
16. Persian                   17. Dutch                     18. Chinese
19. French                    20. Portuguese                21. Slovak
22. Hebrew (modern)           23. Malay                     24. Slovene
25. Bulgarian                 26. Hindi                     27. Japanese
28. Hungarian                 29. Croatian                  30. Ukrainian
31. Serbian                   32. Lithuanian                33. Norwegian
34. Latvian                   35. Swedish                   36. English
37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese
40. Estonian

Download Necessary Models

%%bash
polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package ner2.en is already up-to-date!

Example

Entities inside a text object or a sentence are represented as chunks. Each chunk identifies the start and the end indices of the word subsequence within the text.

from polyglot.text import Text
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)

We can query all entities mentioned in a text.

text.entities
[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])]

Or, we can query entites per sentence

for sent in text.sentences:
  print(sent, "\n")
  for entity in sent.entities:
    print(entity.tag, entity)
The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world".

I-ORG [u'Israeli']
I-PER [u'Benjamin', u'Netanyahu']
I-LOC [u'Iran']

By doing more careful inspection of the second entity Benjamin Netanyahu, we can locate the position of the entity within the sentence.

benjamin = sent.entities[1]
sent.words[benjamin.start: benjamin.end]
WordList([u'Benjamin', u'Netanyahu'])
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en ner | tail -n 20
,               O
which           O
was             O
equalled        O
five            O
days            O
ago             O
by              O
South           I-LOC
Africa          I-LOC
in              O
their           O
victory         O
over            O
West            I-ORG
Indies          I-ORG
in              O
Sydney          I-LOC
.               O

Demo

This work is a direct implementation of the research being described in the Polyglot-NER: Multilingual Named Entity Recognition paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

@article{polyglotner,
        author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
        title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},
        journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},
        month     = {April},
        year      = {2015},
        publisher = {SIAM}
}