Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognition model on Hebrew data #2

Open
johnlockejrr opened this issue Jan 10, 2025 · 4 comments
Open

Recognition model on Hebrew data #2

johnlockejrr opened this issue Jan 10, 2025 · 4 comments

Comments

@johnlockejrr
Copy link

johnlockejrr commented Jan 10, 2025

I just tried the new recognition model on Hebrew data from Sofer Mahir, it doesn't look to OCR anything just output an almost the same phrase with some variants. Something wrong with the tokenizer or the decoder?

Original:

image

New model output:

image

Data:

data.zip

@johnlockejrr
Copy link
Author

Any idea?

@mittagessen
Copy link
Owner

mittagessen commented Jan 13, 2025 via email

@johnlockejrr
Copy link
Author

johnlockejrr commented Jan 13, 2025

I'm sure that is the case. Even so, is weird it comes up with the same phrase for a page extant in the training data, I mean in the Sofer Mahir dataset. Talking about Sofer Mahir dataset, that dataset insn't gold, in some manuscripts the transcriptions are not properly corrected by a human hand and they are really inexact and faulty. I heard a big work on transcribing about 24M of pages is ongoing on NLI, maybe that will change that.

Anyway, I'm 99% happy with the current model type kraken has right now, all the problems I have are with the segmentation model that is trickier. When the segmentation is good, the recognition is good also if the recognition model is trained on good data.

@mittagessen
Copy link
Owner

There is much stronger language modelling in party than the current kraken so it tends to output "valid" text even though the recognition is absolute garbage although it is weird that it's the same phrase over and over again. When you put in completely random data it outputs German words for example.

Sofer Mahir wasn't in the training data due to a silly mistake. That's one of the things we're fixing right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants