-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate PDF extraction to sem_extract #69
base: main
Are you sure you want to change the base?
Conversation
image_cols = [col for col in cols if isinstance(df[col].dtype, ImageDtype)] | ||
if infer_pdfs: | ||
pdf_cols = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should have a PDFDtype
? To support operating over images we added ImageDtype
so I wonder if it makes sense to add a custom pandas
type here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we may not need the infer_pdfs
flag, just as we do not have an infer_images
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a few potential problems in implementing a PDFDtype
. ImageDType
worked off of the Image
object from pillow
, but there is no equivalent standard we can use for PDFs, except maybe the PyMuPDF.Document
object. Also, we can more objectively tell if a value is an image than if it is a path for a PDF. Perhaps, for instance, a cell simply contains titles in PDF format and that is actually the way the user wants it interpreted. It's less clear than it is with images.
with fitz.open(file_path) as doc: | ||
return " ".join(page.get_text() for page in doc) | ||
except Exception as e: | ||
lotus.logger.debug(f"Error while processing pdf at file path {file_path}: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use lotus.logger.error
rather than lotus.logger.debug
here.
tqdm==4.66.4 | ||
PyMuPDF==1.25.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyMuPDF
can just be an optional dependency. For reference, you can see how lxml
dependency is handled.
Also could we add an example, like how we have image examples here |
With the
infer_pdf
flag,df2multimodal_info
will detect which columns are populated with PDFs and subsequently parse those throughPyMuPDF
into raw text, that is then used bysem_extract
. You can invoke this simply by setting the defaulted to False boolean to True when passing it tosem_extract
on a DataFrame.