-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update TextLine.transcription_confidence when logits are available
- Loading branch information
Showing
1 changed file
with
11 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ecbbd7a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ibenes
Karel, this is breaking our API, where we use confidence threshold 0.66. Results are missing text lines and page confidence is way lower. The second is not a problem the first is. API should behave as before. Should we change the threshold, API code or this code?
run_client.py from line 156:
` alto_xml = page_layout.to_altoxml_string(ocr_processing=ocr_processing,
min_line_confidence=args.min_confidence)
ecbbd7a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective, the issue runs deep. I see its core in the fact that
Layout.to_altoxml_string()
does more than just producing the string: it computes confidences of its own, stores them in the respectiveTextLine
s, filters those lines, and finally, this is relied upon.I think we should sit down and decide which confidence measure we want -- and the computation should not be a side product of producing some string from
Layout
; I think that then, theto_altoxml_string()
will even be able to enjoy a bit of cleanup. If we want both the ALTO thing and the "worst of best", they cannot live in the same member variable.