Fix word wise for stressed Russian epubs #192

Vuizur · 2024-02-23T22:29:21Z

I fixed the code for stressed epubs by using (only in this special case) two spacy docs: one containing the text and one for the lemmatization/pos detection. I tested it for one Russian and one non-Russian book so far and it seemed to work.
Should I add the same for kindle? (I only can't test it.)

xxyzz · 2024-02-24T00:11:16Z

Only one character need to be removed? It could be added at here:

WordDumb/epub.py

Lines 137 to 144 in 460db47

    
           with xhtml_path.open("r", encoding="utf-8") as f: 
        
               # remove soft hyphen, byte order mark, word joiner 
        
               xhtml_text = re.sub( 
        
                   r"\xad|&shy;|&#xad;|&#173;|\ufeff|\u2060|&NoBreak;", 
        
                   "", 
        
                   f.read(), 
        
                   flags=re.I, 
        
               )

Vuizur · 2024-02-24T08:37:38Z

Removing that one character works fine for books created by my program, for general purpose one should maybe use the more sophisticated remove_accents function like in Proficiency, which can also remove grave accents.

I don't 100 percent understand the code, but removing it from the place you suggested will also remove the character from the output epub, right? I would want to keep it. So that at the end it looks like this, but currently spacy can't perform lemmatization and POS analysis with stressed text:

xxyzz · 2024-02-24T09:12:03Z

Yes, that'll change the book text, I forget you want to keep the stress marker...

This issue should be fixed in spaCy's Russian lemmatizer, change the text using str.replace breaks the word location and would make the footnote added to the wrong place.

Vuizur · 2024-02-24T09:36:37Z

There is an issue on the spacy repo that is related to the fixing of the lemmatizer: explosion/spaCy#12530. It seems terribly complicated.
I think my workaround works fine. It takes the token index positions, because these don't change between stressed and unstressed texts, so the alignment is kept.

xxyzz · 2024-02-24T09:59:22Z

I think it's better to wait for spaCy's pr, sorry. This patch runs the model pipelines on words that have stress marks again...

Vuizur · 2024-02-24T10:26:02Z

On Russian words with stress marks. So it doesn't affect any non-Russian books, and no normal Russian books. It probably won't even affect normal Russian books that have French citations like "À mauvais ouvrier point de bon outil.", because it only detects combining accent marks. So the performance impact should be negligible in all other cases except with stressed Russian books, where the program is currently broken.

The problem with the spacy PR is that the original one has been sitting for almost a year, and only fixes lemmatisation, but not POS detection. For fixing POS detection we would apparently have to host our own unstressed_core_news_* models and implement a custom language, which would probably result in more convoluted code changes than in this PR.

xxyzz · 2024-02-24T10:48:22Z

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

Vuizur · 2024-02-24T12:27:38Z

Doesn't the English Wiktionary have forms that have stress marks? Have you tried disable the "Use POS type" feature, stressed forms should be matched if they are in the Word Wise db.

True, this works.

xxyzz · 2024-02-24T15:33:38Z

Another way to fix this is add word wise notes first then add stress marks...

Vuizur · 2024-02-24T16:30:57Z

Another way to fix this is add word wise notes first then add stress marks...

It is possible, although implementing the processing of word wise files would be hard. 🗿

xxyzz · 2024-08-09T14:11:36Z

I think this is fixed in the latest release, all enabled forms can be matched.

Vuizur · 2024-08-09T15:12:29Z

I think it should be, thanks! Dne pá 9. 8. 2024 16:11 uživatel xxyzz ***@***.***> napsal:

…

I think this is fixed in the latest release, all enabled forms can be matched. — Reply to this email directly, view it on GitHub <#192 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AG66XKO5QW7R2WLASU55QGTZQTE25AVCNFSM6AAAAABDXN4JOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZYGA2DCNBSGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Fix word wise for stressed Russian epubs

6f43e46

xxyzz closed this Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix word wise for stressed Russian epubs #192

Fix word wise for stressed Russian epubs #192

Vuizur commented Feb 23, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024 •

edited

Loading

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024 •

edited

Loading

xxyzz commented Aug 9, 2024

Vuizur commented Aug 9, 2024 via email

Fix word wise for stressed Russian epubs #192

Fix word wise for stressed Russian epubs #192

Conversation

Vuizur commented Feb 23, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024 • edited Loading

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024

xxyzz commented Feb 24, 2024

Vuizur commented Feb 24, 2024 • edited Loading

xxyzz commented Aug 9, 2024

Vuizur commented Aug 9, 2024 via email

Vuizur commented Feb 24, 2024 •

edited

Loading

Vuizur commented Feb 24, 2024 •

edited

Loading