-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix word wise for stressed Russian epubs #192
Conversation
Only one character need to be removed? It could be added at here: Lines 137 to 144 in 460db47
|
Yes, that'll change the book text, I forget you want to keep the stress marker... This issue should be fixed in spaCy's Russian lemmatizer, change the text using |
There is an issue on the spacy repo that is related to the fixing of the lemmatizer: explosion/spaCy#12530. It seems terribly complicated. |
I think it's better to wait for spaCy's pr, sorry. This patch runs the model pipelines on words that have stress marks again... |
On Russian words with stress marks. So it doesn't affect any non-Russian books, and no normal Russian books. It probably won't even affect normal Russian books that have French citations like "À mauvais ouvrier point de bon outil.", because it only detects combining accent marks. So the performance impact should be negligible in all other cases except with stressed Russian books, where the program is currently broken. The problem with the spacy PR is that the original one has been sitting for almost a year, and only fixes lemmatisation, but not POS detection. For fixing POS detection we would apparently have to host our own |
Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db. |
True, this works. |
Another way to fix this is add word wise notes first then add stress marks... |
It is possible, although implementing the processing of word wise files would be hard. 🗿 |
I think this is fixed in the latest release, all enabled forms can be matched. |
I think it should be, thanks!
Dne pá 9. 8. 2024 16:11 uživatel xxyzz ***@***.***> napsal:
… I think this is fixed in the latest release, all enabled forms can be
matched.
—
Reply to this email directly, view it on GitHub
<#192 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG66XKO5QW7R2WLASU55QGTZQTE25AVCNFSM6AAAAABDXN4JOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGA2DCNBSGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I fixed the code for stressed epubs by using (only in this special case) two spacy docs: one containing the text and one for the lemmatization/pos detection. I tested it for one Russian and one non-Russian book so far and it seemed to work.
Should I add the same for kindle? (I only can't test it.)