Tokenization is the process that identifies the text boundaries of words and sentences. We can identify the boundaries of sentences first then tokenize each sentence to identify the words that compose the sentence. Of course, we can do word tokenization first and then segment the token sequence into sentneces. Tokenization in polyglot relies on the Unicode Text Segmentation algorithm as implemented by the ICU Project.

You can use C/C++ ICU library by installing the required package libicu-dev. For example, on ubuntu/debian systems you should use apt-get utility as the following:

sudo apt-get install libicu-dev
from polyglot.text import Text

Word Tokenization

To call our word tokenizer, first we need to construct a Text object.

blob = u"""
text = Text(blob)

The property words will call the word tokenizer.

WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。', '该', '超市', '1', '月', '9', '日', '遭受', '枪手', '袭击', ',', '导致', '4', '人', '死亡', ',', '据悉', '这', '起', '事件', '与', '法国', '《', '查理', '周刊', '》', '杂志', '社', '恐怖', '袭击', '案', '有关', '。'])

Since ICU boundary break algorithms are language aware, polyglot will detect the language used first before calling the tokenizer

name:             code: zh       confidence:  99.0 read bytes:  1920

Sentence Segmentation

If we are interested in segmenting the text first into sentences, we can query the sentences property


Sentence class inherits Text, therefore, we can tokenize each sentence into words using the same property words

first_sentence = text.sentences[0]
WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', ',', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', ',', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。'])

Command Line

The subcommand tokenize does by default sentence segmentation and word tokenization.

! polyglot tokenize --help
usage: polyglot tokenize [-h] [--only-sent | --only-word] [--input [INPUT [INPUT ...]]]

optional arguments:
  -h, --help            show this help message and exit
  --only-sent           Segment sentences without word tokenization
  --only-word           Tokenize words without sentence segmentation
  --input [INPUT [INPUT ...]]

Each line represents a sentence where the words are split by spaces.

!polyglot --lang en tokenize --input testdata/cricket.txt
Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .
Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them .
Australia's score surpassed the 413 - 5 India made against Bermuda in 2007 .
It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages , following South Africa's 408 - 5 and 411 - 4 against West Indies and Ireland respectively .
The winning margin beats the 257 - run amount by which India beat Bermuda in Port of Spain in 2007 , which was equalled five days ago by South Africa in their victory over West Indies in Sydney .