Integrate PDF extraction to sem_extract #69

BitLegion · 2025-01-05T18:45:25Z

With the infer_pdf flag, df2multimodal_info will detect which columns are populated with PDFs and subsequently parse those through PyMuPDF into raw text, that is then used by sem_extract. You can invoke this simply by setting the defaulted to False boolean to True when passing it to sem_extract on a DataFrame.

sidjha1 · 2025-01-05T19:35:57Z

lotus/templates/task_instructions.py

    image_cols = [col for col in cols if isinstance(df[col].dtype, ImageDtype)]
+    if infer_pdfs:
+        pdf_cols = [


Do you think we should have a PDFDtype? To support operating over images we added ImageDtype so I wonder if it makes sense to add a custom pandas type here too.

In this case we may not need the infer_pdfs flag, just as we do not have an infer_images flag.

I think there are a few potential problems in implementing a PDFDtype. ImageDType worked off of the Image object from pillow, but there is no equivalent standard we can use for PDFs, except maybe the PyMuPDF.Document object. Also, we can more objectively tell if a value is an image than if it is a path for a PDF. Perhaps, for instance, a cell simply contains titles in PDF format and that is actually the way the user wants it interpreted. It's less clear than it is with images.

sidjha1 · 2025-01-05T19:38:43Z

lotus/templates/task_instructions.py

+        with fitz.open(file_path) as doc:
+            return " ".join(page.get_text() for page in doc)
+    except Exception as e:
+        lotus.logger.debug(f"Error while processing pdf at file path {file_path}: {e}")


I think we should use lotus.logger.error rather than lotus.logger.debug here.

sidjha1 · 2025-01-05T19:40:52Z

requirements.txt

+tqdm==4.66.4
+PyMuPDF==1.25.1


PyMuPDF can just be an optional dependency. For reference, you can see how lxml dependency is handled.

sidjha1 · 2025-01-05T19:46:23Z

Also could we add an example, like how we have image examples here

Integrate PDF extraction to sem_extract

c28d0fa

BitLegion requested a review from sidjha1 January 5, 2025 18:45

BitLegion linked an issue Jan 5, 2025 that may be closed by this pull request

Support pdf extraction #67

Open

sidjha1 reviewed Jan 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate PDF extraction to sem_extract #69

Integrate PDF extraction to sem_extract #69

BitLegion commented Jan 5, 2025

sidjha1 Jan 5, 2025

sidjha1 Jan 5, 2025

BitLegion Jan 5, 2025

sidjha1 Jan 5, 2025

sidjha1 Jan 5, 2025

sidjha1 commented Jan 5, 2025

		tqdm==4.66.4
		PyMuPDF==1.25.1

Integrate PDF extraction to sem_extract #69

Are you sure you want to change the base?

Integrate PDF extraction to sem_extract #69

Conversation

BitLegion commented Jan 5, 2025

sidjha1 Jan 5, 2025

Choose a reason for hiding this comment

sidjha1 Jan 5, 2025

Choose a reason for hiding this comment

BitLegion Jan 5, 2025

Choose a reason for hiding this comment

sidjha1 Jan 5, 2025

Choose a reason for hiding this comment

sidjha1 Jan 5, 2025

Choose a reason for hiding this comment

sidjha1 commented Jan 5, 2025