Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PDF extraction to sem_extract #69

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BitLegion
Copy link
Collaborator

With the infer_pdf flag, df2multimodal_info will detect which columns are populated with PDFs and subsequently parse those through PyMuPDF into raw text, that is then used by sem_extract. You can invoke this simply by setting the defaulted to False boolean to True when passing it to sem_extract on a DataFrame.

@BitLegion BitLegion requested a review from sidjha1 January 5, 2025 18:45
@BitLegion BitLegion linked an issue Jan 5, 2025 that may be closed by this pull request
image_cols = [col for col in cols if isinstance(df[col].dtype, ImageDtype)]
if infer_pdfs:
pdf_cols = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should have a PDFDtype? To support operating over images we added ImageDtype so I wonder if it makes sense to add a custom pandas type here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we may not need the infer_pdfs flag, just as we do not have an infer_images flag.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a few potential problems in implementing a PDFDtype. ImageDType worked off of the Image object from pillow, but there is no equivalent standard we can use for PDFs, except maybe the PyMuPDF.Document object. Also, we can more objectively tell if a value is an image than if it is a path for a PDF. Perhaps, for instance, a cell simply contains titles in PDF format and that is actually the way the user wants it interpreted. It's less clear than it is with images.

with fitz.open(file_path) as doc:
return " ".join(page.get_text() for page in doc)
except Exception as e:
lotus.logger.debug(f"Error while processing pdf at file path {file_path}: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use lotus.logger.error rather than lotus.logger.debug here.

tqdm==4.66.4
PyMuPDF==1.25.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyMuPDF can just be an optional dependency. For reference, you can see how lxml dependency is handled.

@sidjha1
Copy link
Collaborator

sidjha1 commented Jan 5, 2025

Also could we add an example, like how we have image examples here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pdf extraction
2 participants