Ukrainian tokenization bug: words with internal apostrophe #1265

glebm · 2022-04-06T10:41:53Z

The apostrophe (') is a normal letter in Ukranian (https://en.wikipedia.org/wiki/Rules_for_using_the_apostrophe_in_the_Ukrainian_language)

Example word: прив'язана

The tokenizer used by fastText appears to split this single word in 3: ["прив", "'", "язана"]

The text was updated successfully, but these errors were encountered:

whysage · 2022-04-18T06:41:24Z

Hi, @glebm
Can't reproduce.

test.py

import fasttext
model = fasttext.train_unsupervised('data.txt', model='skipgram', minCount=1)
print(model.words)

data.txt

Перші археоантропи на території сучасної України з'явилися в епоху раннього палеоліту, понад 900—800 тис. років тому.
Слово прив'язана - як приклад.
1199 р. Роман Великий об'єднав Галичину і Волинь у єдину Галицько-Волинську державу.

out

Read 0M words
Number of words: 34
Number of labels: 0
Progress: 100.0% words/sec/thread: 69646 lr: 0.000000 avg.loss: 4.149282 ETA: 0h 0m 0s
['', 'Великий', "прив'язана", '-', 'як', 'приклад.', '1199', 'р.', 'Роман', 'Слово', "об'єднав", 'Галичину', 'і', 'Волинь', 'у', 'єдину', 'Галицько-Волинську', 'державу.', 'Перші', 'тому.', 'років', 'тис.', '900—800', 'понад', 'палеоліту,', 'раннього', 'епоху', 'в', "з'явилися", 'України', 'сучасної', 'території', 'на', 'археоантропи']

glebm · 2022-04-18T08:29:42Z

Looking at the pretrained Ukrainian embeddings -- there are no words with an internal ' in them. Perhaps the published pretrained embeddings were trained with an older/different tokenizer?

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz
gunzip cc.uk.300.vec.gz
grep "'" cc.uk.300.vec

' -0.0292 0.0219 0.3533 ...

whysage · 2022-04-18T08:51:35Z

As i see apostrophes are just omitted

grep "зобовязання" cc.uk.300.vec

зобовязання ...
зобовязаннями ...

grep "з'явилися" cc.uk.300.vec

grep "зявилися" cc.uk.300.vec
зявилися 0.0721 -0.0460 0.0400 -0.0369 ....

I can't remember words that have different meaning in Ukrainian with apostrophe and without it.

So maybe you can just remove apostrophes in your text in preprocessing step.

P.S. #StandWithUkraine

glebm · 2022-04-18T10:53:07Z

Ah, that makes sense. Do you know where the apostophe-omitting code is?
Also, perhaps this caveat should be documented? Thanks!

#StandWithUkraine!

whysage · 2022-04-18T13:49:34Z

Do you know where the apostophe-omitting code is?

Maybe it is outside the fastText.

https://github.com/facebookresearch/fastText/blob/main/docs/crawl-vectors.md

Tokenization
We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

Also, perhaps this caveat should be documented?

I created pull request #1268
Maybe some day it will be merged.

glebm changed the title ~~Ukrainian tokenization splits words with internal apostrophe~~ Ukrainian tokenization bug: words with internal apostrophe Apr 6, 2022

whysage mentioned this issue Apr 18, 2022

Issue: 1265 Doc about apostrophes in Ukrainian language #1268

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ukrainian tokenization bug: words with internal apostrophe #1265

Ukrainian tokenization bug: words with internal apostrophe #1265

glebm commented Apr 6, 2022

whysage commented Apr 18, 2022 •

edited

Loading

glebm commented Apr 18, 2022 •

edited

Loading

whysage commented Apr 18, 2022 •

edited

Loading

glebm commented Apr 18, 2022

whysage commented Apr 18, 2022

Ukrainian tokenization bug: words with internal apostrophe #1265

Ukrainian tokenization bug: words with internal apostrophe #1265

Comments

glebm commented Apr 6, 2022

whysage commented Apr 18, 2022 • edited Loading

test.py

data.txt

out

glebm commented Apr 18, 2022 • edited Loading

whysage commented Apr 18, 2022 • edited Loading

glebm commented Apr 18, 2022

whysage commented Apr 18, 2022

whysage commented Apr 18, 2022 •

edited

Loading

glebm commented Apr 18, 2022 •

edited

Loading

whysage commented Apr 18, 2022 •

edited

Loading