Recognition model on Hebrew data #2

johnlockejrr · 2025-01-10T09:39:16Z

I just tried the new recognition model on Hebrew data from Sofer Mahir, it doesn't look to OCR anything just output an almost the same phrase with some variants. Something wrong with the tokenizer or the decoder?

Original:

New model output:

Data:

data.zip

The text was updated successfully, but these errors were encountered:

johnlockejrr · 2025-01-13T15:49:51Z

Any idea?

mittagessen · 2025-01-13T18:17:01Z

On 25/01/13 07:50AM, johnlockejrr wrote: Any idea?

There isn't a lot of Hebrew in the training data for the base model so it performs astonishingly badly on it. Same with Syriac and to a lesser extent Arabic-script. The character accuracy on the validation set: 1806 0 100.00% Hiragana 119259 1590 98.67% Han 611 13 97.87% Katakana 29431 1640 94.43% Cyrillic 69462 6191 91.09% Common 221855 21773 90.19% Latin 24992 2740 89.04% Arabic 135 20 85.19% Greek 4092 1047 74.41% Inherited 2066 718 65.25% Georgian 201 97 51.74% Unknown 599 325 45.74% Syriac 641 465 27.46% Newa 51 38 25.49% Hebrew I'm working on it although my boss is likely to embargo the result of the second round of base training to make sure we can fine-tune a Hebrew model that follows the transcription guidelines properly.

johnlockejrr · 2025-01-13T19:16:09Z

I'm sure that is the case. Even so, is weird it comes up with the same phrase for a page extant in the training data, I mean in the Sofer Mahir dataset. Talking about Sofer Mahir dataset, that dataset insn't gold, in some manuscripts the transcriptions are not properly corrected by a human hand and they are really inexact and faulty. I heard a big work on transcribing about 24M of pages is ongoing on NLI, maybe that will change that.

Anyway, I'm 99% happy with the current model type kraken has right now, all the problems I have are with the segmentation model that is trickier. When the segmentation is good, the recognition is good also if the recognition model is trained on good data.

mittagessen · 2025-01-13T19:24:28Z

There is much stronger language modelling in party than the current kraken so it tends to output "valid" text even though the recognition is absolute garbage although it is weird that it's the same phrase over and over again. When you put in completely random data it outputs German words for example.

Sofer Mahir wasn't in the training data due to a silly mistake. That's one of the things we're fixing right now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognition model on Hebrew data #2

Recognition model on Hebrew data #2

johnlockejrr commented Jan 10, 2025 •

edited

Loading

johnlockejrr commented Jan 13, 2025

mittagessen commented Jan 13, 2025 via email

johnlockejrr commented Jan 13, 2025 •

edited

Loading

mittagessen commented Jan 13, 2025

Recognition model on Hebrew data #2

Recognition model on Hebrew data #2

Comments

johnlockejrr commented Jan 10, 2025 • edited Loading

johnlockejrr commented Jan 13, 2025

mittagessen commented Jan 13, 2025 via email

johnlockejrr commented Jan 13, 2025 • edited Loading

mittagessen commented Jan 13, 2025

johnlockejrr commented Jan 10, 2025 •

edited

Loading

johnlockejrr commented Jan 13, 2025 •

edited

Loading