-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recognition model on Hebrew data #2
Comments
Any idea? |
On 25/01/13 07:50AM, johnlockejrr wrote:
Any idea?
There isn't a lot of Hebrew in the training data for the base model so
it performs astonishingly badly on it. Same with Syriac and to a lesser
extent Arabic-script.
The character accuracy on the validation set:
1806 0 100.00% Hiragana
119259 1590 98.67% Han
611 13 97.87% Katakana
29431 1640 94.43% Cyrillic
69462 6191 91.09% Common
221855 21773 90.19% Latin
24992 2740 89.04% Arabic
135 20 85.19% Greek
4092 1047 74.41% Inherited
2066 718 65.25% Georgian
201 97 51.74% Unknown
599 325 45.74% Syriac
641 465 27.46% Newa
51 38 25.49% Hebrew
I'm working on it although my boss is likely to embargo the result of
the second round of base training to make sure we can fine-tune a Hebrew
model that follows the transcription guidelines properly.
|
I'm sure that is the case. Even so, is weird it comes up with the same phrase for a page extant in the training data, I mean in the Sofer Mahir dataset. Talking about Sofer Mahir dataset, that dataset insn't gold, in some manuscripts the transcriptions are not properly corrected by a human hand and they are really inexact and faulty. I heard a big work on transcribing about 24M of pages is ongoing on NLI, maybe that will change that. Anyway, I'm 99% happy with the current model type |
There is much stronger language modelling in party than the current kraken so it tends to output "valid" text even though the recognition is absolute garbage although it is weird that it's the same phrase over and over again. When you put in completely random data it outputs German words for example. Sofer Mahir wasn't in the training data due to a silly mistake. That's one of the things we're fixing right now. |
I just tried the new recognition model on Hebrew data from
Sofer Mahir
, it doesn't look to OCR anything just output an almost the same phrase with some variants. Something wrong with the tokenizer or the decoder?Original:
New model output:
Data:
data.zip
The text was updated successfully, but these errors were encountered: