-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Ukrainian tokenization bug: words with internal apostrophe #1265
Comments
Hi, @glebm test.pyimport fasttext data.txtПерші археоантропи на території сучасної України з'явилися в епоху раннього палеоліту, понад 900—800 тис. років тому. outRead 0M words |
Looking at the pretrained Ukrainian embeddings -- there are no words with an internal wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.uk.300.vec.gz
gunzip cc.uk.300.vec.gz
grep "'" cc.uk.300.vec
|
As i see apostrophes are just omitted grep "зобовязання" cc.uk.300.vec зобовязання ... grep "з'явилися" cc.uk.300.vec grep "зявилися" cc.uk.300.vec I can't remember words that have different meaning in Ukrainian with apostrophe and without it. So maybe you can just remove apostrophes in your text in preprocessing step. P.S. #StandWithUkraine |
Ah, that makes sense. Do you know where the apostophe-omitting code is? #StandWithUkraine! |
Maybe it is outside the fastText. https://github.com/facebookresearch/fastText/blob/main/docs/crawl-vectors.md Tokenization
I created pull request #1268 |
The apostrophe (
'
) is a normal letter in Ukranian (https://en.wikipedia.org/wiki/Rules_for_using_the_apostrophe_in_the_Ukrainian_language)Example word: прив'язана
The tokenizer used by fastText appears to split this single word in 3:
["прив", "'", "язана"]
The text was updated successfully, but these errors were encountered: