Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating OCR documents seems very complex #373

Open
lava opened this issue Jan 19, 2025 · 0 comments
Open

Generating OCR documents seems very complex #373

lava opened this issue Jan 19, 2025 · 0 comments

Comments

@lava
Copy link

lava commented Jan 19, 2025

I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.

So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.

However, the subsequent workflow then becomes messy very quickly:

  1. As far as I understand, Document AI may internally deskew, convert and/or downscale the submitted document before performing OCR. So in order to have the overlay text match the displayed, I'm writing out the data stored in document["pages"][i]["image"]["content"]. However, the document.pages helper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:
    wrapped_document = document.Document.from_gcs(
            gcs_bucket_name=output_bucket, gcs_prefix=output_prefix
    )
    merged_document = wrapped_document.to_merged_documentai_document()
    document_json_string = documentai.Document.to_json(merged_document)
    document_json = json.loads(document_json_string)
        
    for i, page in enumerate(document_json["pages"]):
        raw_content = base64.b64decode(page["image"]["content"])
        # ...
  1. In order to get the OCR layer, I'm using document.export_hocr_str(). However, that returns a multi-page .hocr file, and all downstream tooling I could find expects one .hocr and one image as input to produce one pdf page. So I have to split the returned .hocr by pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.

Both of these feel very cumbersome, given that I'm already using the toolbox library that is supposed to make working with the API painless.

Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?

@rosiezou rosiezou removed their assignment Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants