Morphological Analysis¶

Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages. In particular, Morpho project is focussing on the discovery of morphemes, which are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. Morphemes are important in automatic generation and recognition of a language, especially in languages in which words may have many different inflected forms.

Languages Coverage¶

Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words 50,000 words of each language.

from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

Piedmontese language       2. Lombard language           3. Gan Chinese
Sicilian                   5. Scots                      6. Kirghiz, Kyrgyz
Pashto, Pushto             8. Kurdish                    9. Portuguese
Kannada                   11. Korean                    12. Khmer
Kazakh                    14. Ilokano                   15. Polish
Panjabi, Punjabi          17. Georgian                  18. Chuvash
Alemannic                 20. Czech                     21. Welsh
Chechen                   23. Catalan; Valencian        24. Northern Sami
Sanskrit (Saṁskṛta)       26. Slovene                   27. Javanese
Slovak                    29. Bosnian-Croatian-Serbian  30. Bavarian
Swedish                   32. Swahili                   33. Sundanese
Serbian                   35. Albanian                  36. Japanese
Western Frisian           38. French                    39. Finnish
Upper Sorbian             41. Faroese                   42. Persian
Sinhala, Sinhalese        44. Italian                   45. Amharic
Aragonese                 47. Volapük                   48. Icelandic
Sakha                     50. Afrikaans                 51. Indonesian
Interlingua               53. Azerbaijani               54. Ido
Arabic                    56. Assamese                  57. Yoruba
Yiddish                   59. Waray-Waray               60. Croatian
Hungarian                 62. Haitian; Haitian Creole   63. Quechua
Armenian                  65. Hebrew (modern)           66. Silesian
Hindi                     68. Divehi; Dhivehi; Mald...  69. German
Danish                    71. Occitan                   72. Tagalog
Turkmen                   74. Thai                      75. Tajik
Greek, Modern             77. Telugu                    78. Tamil
Oriya                     80. Ossetian, Ossetic         81. Tatar
Turkish                   83. Kapampangan               84. Venetian
Manx                      86. Gujarati                  87. Galician
Irish                     89. Scottish Gaelic; Gaelic   90. Nepali
Cebuano                   92. Zazaki                    93. Walloon
Dutch                     95. Norwegian                 96. Norwegian Nynorsk
West Flemish              98. Chinese                   99. Bosnian
Breton                   101. Belarusian               102. Bulgarian
Bashkir                  104. Egyptian Arabic          105. Tibetan Standard, Tib...
Bengali                  107. Burmese                  108. Romansh
Marathi (Marāṭhī)        110. Malay                    111. Maltese
Russian                  113. Macedonian               114. Malayalam
Mongolian                116. Malagasy                 117. Vietnamese
Spanish; Castilian       119. Estonian                 120. Basque
Bishnupriya Manipuri     122. Asturian                 123. English
Esperanto                125. Luxembourgish, Letzeb... 126. Latin
Uighur, Uyghur           128. Ukrainian                129. Limburgish, Limburgan...
Latvian                  131. Urdu                     132. Lithuanian
Fiji Hindi               134. Uzbek                    135. Romanian, Moldavian, ...

Download Necessary Models¶

%%bash
polyglot download morph2.en morph2.ar

[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!

Example¶

from polyglot.text import Text, Word

words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))

preprocessing       ['pre', 'process', 'ing']
processor           ['process', 'or']
invaluable          ['in', 'valuable']
thankful            ['thank', 'ful']
crossed             ['cross', 'ed']

If the text is not tokenized properly, morphological analysis could offer a smart of way of splitting the text into its original units. Here, is an example:

blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"

text.morphemes

WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])

!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en morph | tail -n 30

which           which
India           In_dia
beat            beat
Bermuda         Ber_mud_a
in              in
Port            Port
of              of
Spain           Spa_in
in              in
2007            2007
,               ,
which           which
was             wa_s
equalled        equal_led
five            five
days            day_s
ago             ago
by              by
South           South
Africa          Africa
in              in
their           t_heir
victory         victor_y
over            over
West            West
Indies          In_dies
in              in
Sydney          Syd_ney
.               .

Demo¶

This demo does not reflect the models supplied by polyglot, however, we think it is indicative of what you should expect from morfessor

Demo

This is an interface to the implementation being described in the Morfessor2.0: Python Implementation and Extensions for Morfessor Baseline technical report.

@InProceedings{morfessor2,
               title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
               author:  {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
               year: {2013},
               publisher: {Department of Signal Processing and Acoustics, Aalto University},
               booktitle:{Aalto University publication series}
}

Morphological Analysis¶

Languages Coverage¶

Download Necessary Models¶

Example¶

Demo¶

References¶