You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.
So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.
However, the subsequent workflow then becomes messy very quickly:
As far as I understand, Document AI may internally deskew, convert and/or downscale the submitted document before performing OCR. So in order to have the overlay text match the displayed, I'm writing out the data stored in document["pages"][i]["image"]["content"]. However, the document.pages helper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:
wrapped_document = document.Document.from_gcs(
gcs_bucket_name=output_bucket, gcs_prefix=output_prefix
)
merged_document = wrapped_document.to_merged_documentai_document()
document_json_string = documentai.Document.to_json(merged_document)
document_json = json.loads(document_json_string)
for i, page in enumerate(document_json["pages"]):
raw_content = base64.b64decode(page["image"]["content"])
# ...
In order to get the OCR layer, I'm using document.export_hocr_str(). However, that returns a multi-page .hocr file, and all downstream tooling I could find expects one .hocr and one image as input to produce one pdf page. So I have to split the returned .hocr by pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.
Both of these feel very cumbersome, given that I'm already using the toolbox library that is supposed to make working with the API painless.
Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?
The text was updated successfully, but these errors were encountered:
I'm trying to use Document AI to generate a searchable PDF out of an input documents. Given the marketing around Document AI and the availability of a pretrained "Document OCR" processor, I'm assuming this is one of the intended use cases.
So i'm uploading the file to GCS, running a Document OCR batch job, getting back a document, so far so good.
However, the subsequent workflow then becomes messy very quickly:
document["pages"][i]["image"]["content"]
. However, thedocument.pages
helper function only yields wrapped pages which don't expose the content as far as I can tell, so I'm forced to first export the whole document as json (in order to handle shards, which aren't documented anywhere btw!), and then to re-parse the json again in order to get the content:document.export_hocr_str()
. However, that returns a multi-page.hocr
file, and all downstream tooling I could find expects one.hocr
and one image as input to produce one pdf page. So I have to split the returned.hocr
by pages, generate a lot of individual pdfs, and finally use pikepdf to merge them together into a single output document.Both of these feel very cumbersome, given that I'm already using the
toolbox
library that is supposed to make working with the API painless.Am I missing some companion library that would make this workflow easier? I assume google has some internal libraries that handles these steps, would it make sense to include these in the toolbox?
The text was updated successfully, but these errors were encountered: