Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

Ukrainian tokenization bug: words with internal apostrophe #1265

Open
glebm opened this issue Apr 6, 2022 · 5 comments
Open

Ukrainian tokenization bug: words with internal apostrophe #1265

glebm opened this issue Apr 6, 2022 · 5 comments

Comments

@glebm
Copy link

glebm commented Apr 6, 2022

The apostrophe (') is a normal letter in Ukranian (https://en.wikipedia.org/wiki/Rules_for_using_the_apostrophe_in_the_Ukrainian_language)

Example word: прив'язана

The tokenizer used by fastText appears to split this single word in 3: ["прив", "'", "язана"]

@glebm glebm changed the title Ukrainian tokenization splits words with internal apostrophe Ukrainian tokenization bug: words with internal apostrophe Apr 6, 2022
@whysage
Copy link

whysage commented Apr 18, 2022

Hi, @glebm
Can't reproduce.

test.py

import fasttext
model = fasttext.train_unsupervised('data.txt', model='skipgram', minCount=1)
print(model.words)

data.txt

Перші археоантропи на території сучасної України з'явилися в епоху раннього палеоліту, понад 900—800 тис. років тому.
Слово прив'язана - як приклад.
1199 р. Роман Великий об'єднав Галичину і Волинь у єдину Галицько-Волинську державу.

out

Read 0M words
Number of words: 34
Number of labels: 0
Progress: 100.0% words/sec/thread: 69646 lr: 0.000000 avg.loss: 4.149282 ETA: 0h 0m 0s
['', 'Великий', "прив'язана", '-', 'як', 'приклад.', '1199', 'р.', 'Роман', 'Слово', "об'єднав", 'Галичину', 'і', 'Волинь', 'у', 'єдину', 'Галицько-Волинську', 'державу.', 'Перші', 'тому.', 'років', 'тис.', '900—800', 'понад', 'палеоліту,', 'раннього', 'епоху', 'в', "з'явилися", 'України', 'сучасної', 'території', 'на', 'археоантропи']

@glebm
Copy link
Author

glebm commented Apr 18, 2022

Looking at the pretrained Ukrainian embeddings -- there are no words with an internal ' in them. Perhaps the published pretrained embeddings were trained with an older/different tokenizer?

wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz
gunzip cc.uk.300.vec.gz
grep "'" cc.uk.300.vec
' -0.0292 0.0219 0.3533 ...

@whysage
Copy link

whysage commented Apr 18, 2022

As i see apostrophes are just omitted

grep "зобовязання" cc.uk.300.vec

зобовязання ...
зобовязаннями ...

grep "з'явилися" cc.uk.300.vec

grep "зявилися" cc.uk.300.vec
зявилися 0.0721 -0.0460 0.0400 -0.0369 ....

I can't remember words that have different meaning in Ukrainian with apostrophe and without it.

So maybe you can just remove apostrophes in your text in preprocessing step.

P.S. #StandWithUkraine

@glebm
Copy link
Author

glebm commented Apr 18, 2022

Ah, that makes sense. Do you know where the apostophe-omitting code is?
Also, perhaps this caveat should be documented? Thanks!

#StandWithUkraine!

@whysage
Copy link

whysage commented Apr 18, 2022

Do you know where the apostophe-omitting code is?

Maybe it is outside the fastText.

https://github.com/facebookresearch/fastText/blob/main/docs/crawl-vectors.md

Tokenization
We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

Also, perhaps this caveat should be documented?

I created pull request #1268
Maybe some day it will be merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants