Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add script to evaluate perfomance on SROIE dataset #44

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rth
Copy link

@rth rth commented Mar 30, 2024

Related to the discussion in #43 this adds a script to evaluate on the SROIE 2019 dataset (scanned recipes). I wanted to do end-to-end eval, and needed the executable, so it seemed easier to put it here rather than in https://github.com/robertknight/ocrs-models.

But feel free to close, I was mostly curious about the results.

To run this script:

  1. Install dependencies:
    • pip install scikit-learn datasets tqdm
      (I saw there are some metrics in orcs-models, but for text vectorization it seemed easier to use scikit-learn)
  2. Optionally install pytesseract + tesseract
  3. Run,
    python tools/evaluate-sroie.py
    

which produces (on the first 100 images / ~230)

Evaluating on SROIE 2019 dataset...
 - ORCS: 1.45 s / image, precision 0.96, recall 0.84, F1 0.90
 - Tesseract: 0.84 s / image, precision 0.36, recall 0.34, F1 0.35

The precision, recall scores are computed globally on the text extracted from the image, after tokenizing with scikit-learn's vectorizer.

So overall the scores look quite good! I'm not sure, maybe I'm not using tesseract right, it's performance looks pretty bad on this dataset. Or maybe it needs some pre-processing.

Run time is a bit slower than tesseract, but I imagine that could always be improved somewhat.

Copy link
Owner

@robertknight robertknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I left some feedback that should be easy to address.

Regarding Tesseract's performance, I found a paper on the Papers With Code leaderboard for SROIE that quoted F1=54 (https://arxiv.org/pdf/2109.10282v5.pdf), but I couldn't see what preprocessing or other settings were used. I wouldn't be surprised if the Tesseract results can be bumped up a bit with some preprocessing adjustments, although part of the point of this project is that users shouldn't need to spent effort on that.

Regarding Ocrs's runtime performance, indeed it should be possible to make it quite a bit faster. This script can serve as a benchmark for that too.

# Use a tempfs if available (Linux, MacOS) to reduce disk I/O overhead
TMP_DIR = Path("/dev/shm")
if not TMP_DIR.exists():
TMP_DIR = None
Copy link
Owner

@robertknight robertknight Mar 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/dev/shm is a Linux-specific thing as far as I'm aware. How much of a difference does it make?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You, are right it doesn't work on Mac. It's a few milisecond or less depending on SSD performance, it was more to avoid unnecessairly write stuff to disk. Will remove.

from tqdm import tqdm

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score, precision_score, recall_score
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sort the imports using isort conventions. You can do this with isort (or ruff).

tools/evaluate-sroie.py Outdated Show resolved Hide resolved
return result.stdout


def run_global_retrieval_eval(max_samples: int) -> bool:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotation says this returns a bool, but the body doesn't return anything.

tools/evaluate-sroie.py Outdated Show resolved Hide resolved
tools/evaluate-sroie.py Outdated Show resolved Hide resolved
def run_global_retrieval_eval(max_samples: int) -> bool:
"""
Evaluate OCR performance, by computing precision, recall and F1 score
for the detected tokens globally on the whole document
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, is this intended to follow the evaluation protocol from "Scanned Receipt OCR" in the SROIE paper - https://arxiv.org/pdf/2103.10213.pdf? Are there any differences between the tokenization mentioned there and what scikit-learn does as far as you know?

Copy link
Author

@rth rth Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially I wanted to benchmark multiple datasets, in which case matching exactly the same evaluation procedure is a bit harder. For instance, https://huggingface.co/datasets/naver-clova-ix/cord-v2 should also be easy to add to this script using the same loader. However since SROIE already takes up to 5min (orcs + tesseract) in the end I didn't.

In SROIE they use whitespace tokenization, which is a bit less forgiving than scikit-learn's one. Here is an example,

>>> import re
>>> line = "l'b about 22.33 10/28"
>>> re.findall(r'[^\s]+', line)  # whitespace tokenization
["l'b", 'about', '22.33', '10/28']
>>> re.findall(r'(?u)\b\w\w+\b', line)  # scikit-learn tokenization
['about', '22', '33', '10', '28'] 

so you are right's let's revert back to whitespace tokenization as SROIE does, in which case the scores are bit worse,

Evaluating on SROIE 2019 dataset...
 - Ocrs: 1.57 s / image, precision 0.88, recall 0.72, F1 0.79
 - Tesseract: 0.86 s / image, precision 0.28, recall 0.26, F1 0.27

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:

For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.

I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.

run("cargo build --release -p ocrs-cli", shell=True, check=True, text=True)


def extract_text(image_path: str) -> str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now. In future I think we can introduce some optimizations for running ocrs on many images in succession.

@rth
Copy link
Author

rth commented Apr 1, 2024

Thanks for the review. Addressed review comments.

Copy link
Owner

@robertknight robertknight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The sensitivity to spacing differences is unfortunate, although it could well indicate an area where the model needs improvement - you'd have to dig into the actual vs expected outputs in more detail to understand this. Perhaps for this PR it would make sense to report metrics under both kinds of tokenization.

def run_global_retrieval_eval(max_samples: int) -> bool:
"""
Evaluate OCR performance, by computing precision, recall and F1 score
for the detected tokens globally on the whole document
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the sensitivity to spacing is unfortunate. Using the example given in the paper:

For example the string “Date: 12/3/56” should be tokenised “Date:”, “12/3/56”. While the string “Date: 12 / 3 / 56” should be tokenised “Date:” “12”, “/”, “3”, “/”, “56”.

I can see imagine that in some cases it might be ambiguous to humans whether spaces should appear in the transcription. If the annotators were given specific instructions which they followed, then a model could learn these conventions from the training split. This will make models that weren't trained specifically on this dataset appear worse though.

pip install scikit-learn datasets tqdm

Optionally, you can install pytesseract to compare with tesseract.
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that ArgumentParser doesn't preserve new lines in the text when using --help. The dumbest solution I can see is just to write the description in such a way that it can be read either as multiple lines in the code or a single paragraph.

    description="""
Evaluate ocrs on the benchmark datasets.

To run this script, you need, to install dependencies: pip install scikit-learn datasets tqdm.

Optionally, you can install pytesseract to compare with tesseract.
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants