Generating OCR documents seems very complex #373

lava · 2025-01-19T13:53:04Z

I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.

So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.

However, the subsequent workflow then becomes messy very quickly:

As far as I understand, Document AI may internally deskew, convert and/or downscale the submitted document before performing OCR. So in order to have the overlay text match the displayed, I'm writing out the data stored in document["pages"][i]["image"]["content"]. However, the document.pages helper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:

    wrapped_document = document.Document.from_gcs(
            gcs_bucket_name=output_bucket, gcs_prefix=output_prefix
    )
    merged_document = wrapped_document.to_merged_documentai_document()
    document_json_string = documentai.Document.to_json(merged_document)
    document_json = json.loads(document_json_string)
        
    for i, page in enumerate(document_json["pages"]):
        raw_content = base64.b64decode(page["image"]["content"])
        # ...

In order to get the OCR layer, I'm using document.export_hocr_str(). However, that returns a multi-page .hocr file, and all downstream tooling I could find expects one .hocr and one image as input to produce one pdf page. So I have to split the returned .hocr by pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.

Both of these feel very cumbersome, given that I'm already using the toolbox library that is supposed to make working with the API painless.

Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?

The text was updated successfully, but these errors were encountered:

blunderbuss-gcf bot assigned rosiezou Jan 19, 2025

rosiezou removed their assignment Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating OCR documents seems very complex #373

Generating OCR documents seems very complex #373

lava commented Jan 19, 2025

Generating OCR documents seems very complex #373

Generating OCR documents seems very complex #373

Comments

lava commented Jan 19, 2025