Add script to evaluate perfomance on SROIE dataset #44

rth · 2024-03-30T19:49:51Z

Related to the discussion in #43 this adds a script to evaluate on the SROIE 2019 dataset (scanned recipes). I wanted to do end-to-end eval, and needed the executable, so it seemed easier to put it here rather than in https://github.com/robertknight/ocrs-models.

But feel free to close, I was mostly curious about the results.

To run this script:

Install dependencies:
- pip install scikit-learn datasets tqdm
  (I saw there are some metrics in orcs-models, but for text vectorization it seemed easier to use scikit-learn)
Optionally install pytesseract + tesseract
Run,
```
python tools/evaluate-sroie.py
```

which produces (on the first 100 images / ~230)

Evaluating on SROIE 2019 dataset...
 - ORCS: 1.45 s / image, precision 0.96, recall 0.84, F1 0.90
 - Tesseract: 0.84 s / image, precision 0.36, recall 0.34, F1 0.35

The precision, recall scores are computed globally on the text extracted from the image, after tokenizing with scikit-learn's vectorizer.

So overall the scores look quite good! I'm not sure, maybe I'm not using tesseract right, it's performance looks pretty bad on this dataset. Or maybe it needs some pre-processing.

Run time is a bit slower than tesseract, but I imagine that could always be improved somewhat.

robertknight

Thanks for this. I left some feedback that should be easy to address.

Regarding Tesseract's performance, I found a paper on the Papers With Code leaderboard for SROIE that quoted F1=54 (https://arxiv.org/pdf/2109.10282v5.pdf), but I couldn't see what preprocessing or other settings were used. I wouldn't be surprised if the Tesseract results can be bumped up a bit with some preprocessing adjustments, although part of the point of this project is that users shouldn't need to spent effort on that.

Regarding Ocrs's runtime performance, indeed it should be possible to make it quite a bit faster. This script can serve as a benchmark for that too.

robertknight · 2024-03-31T16:49:20Z

tools/evaluate-sroie.py

+    # Use a tempfs if available (Linux, MacOS) to reduce disk I/O overhead
+    TMP_DIR = Path("/dev/shm")
+    if not TMP_DIR.exists():
+        TMP_DIR = None


/dev/shm is a Linux-specific thing as far as I'm aware. How much of a difference does it make?

You, are right it doesn't work on Mac. It's a few milisecond or less depending on SSD performance, it was more to avoid unnecessairly write stuff to disk. Will remove.

robertknight · 2024-03-31T16:55:50Z

tools/evaluate-sroie.py

+from tqdm import tqdm
+
+from sklearn.feature_extraction.text import CountVectorizer
+from sklearn.metrics import f1_score, precision_score, recall_score


Can you sort the imports using isort conventions. You can do this with isort (or ruff).

tools/evaluate-sroie.py

robertknight · 2024-03-31T17:02:34Z

tools/evaluate-sroie.py

+    return result.stdout
+
+
+def run_global_retrieval_eval(max_samples: int) -> bool:


The type annotation says this returns a bool, but the body doesn't return anything.

tools/evaluate-sroie.py

robertknight · 2024-03-31T17:20:43Z

tools/evaluate-sroie.py

+def run_global_retrieval_eval(max_samples: int) -> bool:
+    """
+    Evaluate OCR performance, by computing precision, recall and F1 score
+    for the detected tokens globally on the whole document


To clarify, is this intended to follow the evaluation protocol from "Scanned Receipt OCR" in the SROIE paper - https://arxiv.org/pdf/2103.10213.pdf? Are there any differences between the tokenization mentioned there and what scikit-learn does as far as you know?

So initially I wanted to benchmark multiple datasets, in which case matching exactly the same evaluation procedure is a bit harder. For instance, https://huggingface.co/datasets/naver-clova-ix/cord-v2 should also be easy to add to this script using the same loader. However since SROIE already takes up to 5min (orcs + tesseract) in the end I didn't.

In SROIE they use whitespace tokenization, which is a bit less forgiving than scikit-learn's one. Here is an example,

>>> import re >>> line = "l'b about 22.33 10/28" >>> re.findall(r'[^\s]+', line) # whitespace tokenization ["l'b", 'about', '22.33', '10/28'] >>> re.findall(r'(?u)\b\w\w+\b', line) # scikit-learn tokenization ['about', '22', '33', '10', '28']

so you are right's let's revert back to whitespace tokenization as SROIE does, in which case the scores are bit worse,

Evaluating on SROIE 2019 dataset... - Ocrs: 1.57 s / image, precision 0.88, recall 0.72, F1 0.79 - Tesseract: 0.86 s / image, precision 0.28, recall 0.26, F1 0.27

Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:

For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.

I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.

robertknight · 2024-03-31T17:26:06Z

tools/evaluate-sroie.py

+    run("cargo build --release -p ocrs-cli", shell=True, check=True, text=True)
+
+
+def extract_text(image_path: str) -> str:


This is fine for now. In future I think we can introduce some optimizations for running ocrs on many images in succession.

Co-authored-by: Robert Knight <[email protected]>

rth · 2024-04-01T09:52:57Z

Thanks for the review. Addressed review comments.

robertknight

Thanks for the update. The sensitivity to spacing differences is unfortunate, although it could well indicate an area where the model needs improvement - you'd have to dig into the actual vs expected outputs in more detail to understand this. Perhaps for this PR it would make sense to report metrics under both kinds of tokenization.

robertknight · 2024-04-02T11:50:07Z

tools/evaluate-sroie.py

+def run_global_retrieval_eval(max_samples: int) -> bool:
+    """
+    Evaluate OCR performance, by computing precision, recall and F1 score
+    for the detected tokens globally on the whole document


Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:

For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.

I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.

robertknight · 2024-04-02T11:56:35Z

tools/evaluate-sroie.py

+    pip install scikit-learn datasets tqdm
+
+Optionally, you can install pytesseract to compare with tesseract.
+"""


It seems that ArgumentParser doesn't preserve new lines in the text when using --help. The dumbest solution I can see is just to write the description in such a way that it can be read either as multiple lines in the code or a single paragraph.

description=""" Evaluate ocrs on the benchmark datasets. To run this script, you need, to install dependencies: pip install scikit-learn datasets tqdm. Optionally, you can install pytesseract to compare with tesseract. """

Add script to evaluate perfomance on SROIE dataset

b83f877

robertknight reviewed Mar 31, 2024

View reviewed changes

rth and others added 5 commits April 1, 2024 10:51

Apply suggestions from code review

fd6c066

Co-authored-by: Robert Knight <[email protected]>

Update tools/evaluate-sroie.py

e94ef95

Co-authored-by: Robert Knight <[email protected]>

Address review comments

fb1b5ed

Merge branch 'evaluate-sroie' of github.com:rth/ocrs into evaluate-sroie

b98369d

Use whitespace tokenization

a6c5494

robertknight reviewed Apr 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to evaluate perfomance on SROIE dataset #44

Add script to evaluate perfomance on SROIE dataset #44

rth commented Mar 30, 2024 •

edited

Loading

robertknight left a comment

robertknight Mar 31, 2024 •

edited

Loading

rth Apr 1, 2024

robertknight Mar 31, 2024

robertknight Mar 31, 2024

robertknight Mar 31, 2024

rth Apr 1, 2024 •

edited

Loading

robertknight Apr 2, 2024

robertknight Mar 31, 2024

rth commented Apr 1, 2024

robertknight left a comment

robertknight Apr 2, 2024

robertknight Apr 2, 2024

		return result.stdout


		def run_global_retrieval_eval(max_samples: int) -> bool:

		run("cargo build --release -p ocrs-cli", shell=True, check=True, text=True)


		def extract_text(image_path: str) -> str:

Add script to evaluate perfomance on SROIE dataset #44

Are you sure you want to change the base?

Add script to evaluate perfomance on SROIE dataset #44

Conversation

rth commented Mar 30, 2024 • edited Loading

robertknight left a comment

Choose a reason for hiding this comment

robertknight Mar 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Apr 1, 2024

robertknight left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rth commented Mar 30, 2024 •

edited

Loading

robertknight Mar 31, 2024 •

edited

Loading

rth Apr 1, 2024 •

edited

Loading