-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add script to evaluate perfomance on SROIE dataset #44
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I left some feedback that should be easy to address.
Regarding Tesseract's performance, I found a paper on the Papers With Code leaderboard for SROIE that quoted F1=54 (https://arxiv.org/pdf/2109.10282v5.pdf), but I couldn't see what preprocessing or other settings were used. I wouldn't be surprised if the Tesseract results can be bumped up a bit with some preprocessing adjustments, although part of the point of this project is that users shouldn't need to spent effort on that.
Regarding Ocrs's runtime performance, indeed it should be possible to make it quite a bit faster. This script can serve as a benchmark for that too.
tools/evaluate-sroie.py
Outdated
# Use a tempfs if available (Linux, MacOS) to reduce disk I/O overhead | ||
TMP_DIR = Path("/dev/shm") | ||
if not TMP_DIR.exists(): | ||
TMP_DIR = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/dev/shm
is a Linux-specific thing as far as I'm aware. How much of a difference does it make?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You, are right it doesn't work on Mac. It's a few milisecond or less depending on SSD performance, it was more to avoid unnecessairly write stuff to disk. Will remove.
from tqdm import tqdm | ||
|
||
from sklearn.feature_extraction.text import CountVectorizer | ||
from sklearn.metrics import f1_score, precision_score, recall_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you sort the imports using isort conventions. You can do this with isort
(or ruff).
tools/evaluate-sroie.py
Outdated
return result.stdout | ||
|
||
|
||
def run_global_retrieval_eval(max_samples: int) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type annotation says this returns a bool, but the body doesn't return anything.
def run_global_retrieval_eval(max_samples: int) -> bool: | ||
""" | ||
Evaluate OCR performance, by computing precision, recall and F1 score | ||
for the detected tokens globally on the whole document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, is this intended to follow the evaluation protocol from "Scanned Receipt OCR" in the SROIE paper - https://arxiv.org/pdf/2103.10213.pdf? Are there any differences between the tokenization mentioned there and what scikit-learn does as far as you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So initially I wanted to benchmark multiple datasets, in which case matching exactly the same evaluation procedure is a bit harder. For instance, https://huggingface.co/datasets/naver-clova-ix/cord-v2 should also be easy to add to this script using the same loader. However since SROIE already takes up to 5min (orcs + tesseract) in the end I didn't.
In SROIE they use whitespace tokenization, which is a bit less forgiving than scikit-learn's one. Here is an example,
>>> import re
>>> line = "l'b about 22.33 10/28"
>>> re.findall(r'[^\s]+', line) # whitespace tokenization
["l'b", 'about', '22.33', '10/28']
>>> re.findall(r'(?u)\b\w\w+\b', line) # scikit-learn tokenization
['about', '22', '33', '10', '28']
so you are right's let's revert back to whitespace tokenization as SROIE does, in which case the scores are bit worse,
Evaluating on SROIE 2019 dataset...
- Ocrs: 1.57 s / image, precision 0.88, recall 0.72, F1 0.79
- Tesseract: 0.86 s / image, precision 0.28, recall 0.26, F1 0.27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:
For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.
I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.
run("cargo build --release -p ocrs-cli", shell=True, check=True, text=True) | ||
|
||
|
||
def extract_text(image_path: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now. In future I think we can introduce some optimizations for running ocrs on many images in succession.
Co-authored-by: Robert Knight <[email protected]>
Co-authored-by: Robert Knight <[email protected]>
Thanks for the review. Addressed review comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. The sensitivity to spacing differences is unfortunate, although it could well indicate an area where the model needs improvement - you'd have to dig into the actual vs expected outputs in more detail to understand this. Perhaps for this PR it would make sense to report metrics under both kinds of tokenization.
def run_global_retrieval_eval(max_samples: int) -> bool: | ||
""" | ||
Evaluate OCR performance, by computing precision, recall and F1 score | ||
for the detected tokens globally on the whole document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:
For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.
I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.
pip install scikit-learn datasets tqdm | ||
|
||
Optionally, you can install pytesseract to compare with tesseract. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that ArgumentParser
doesn't preserve new lines in the text when using --help
. The dumbest solution I can see is just to write the description in such a way that it can be read either as multiple lines in the code or a single paragraph.
description="""
Evaluate ocrs on the benchmark datasets.
To run this script, you need, to install dependencies: pip install scikit-learn datasets tqdm.
Optionally, you can install pytesseract to compare with tesseract.
"""
Related to the discussion in #43 this adds a script to evaluate on the SROIE 2019 dataset (scanned recipes). I wanted to do end-to-end eval, and needed the executable, so it seemed easier to put it here rather than in https://github.com/robertknight/ocrs-models.
But feel free to close, I was mostly curious about the results.
To run this script:
(I saw there are some metrics in orcs-models, but for text vectorization it seemed easier to use scikit-learn)
which produces (on the first 100 images / ~230)
The precision, recall scores are computed globally on the text extracted from the image, after tokenizing with scikit-learn's vectorizer.
So overall the scores look quite good! I'm not sure, maybe I'm not using tesseract right, it's performance looks pretty bad on this dataset. Or maybe it needs some pre-processing.
Run time is a bit slower than tesseract, but I imagine that could always be improved somewhat.