You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an IndexError: list index out of range while processing a PDF containing mathematical formulas using the pymupdf4llm library (version 0.0.17). The full traceback is as follows:
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 936, in to_markdown
parms = get_page_output(doc, pno, margins, textflags)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 883, in get_page_output
parms.md_string += write_text(parms, text_rect, force_text=force_text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 429, in write_text
for l in get_raw_lines(parms.textpage, clip=clip, tolerance=3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/get_text_lines.py", line 114, in get_raw_lines
neighbor = line["spans"][i]
~~~~~~~~~~~~~^^^
IndexError: list index out of range
Debugging revealed that the issue appears to stem from a specific type of PDF formatting. In the problematic PDF, I observed instances of lone "superscript" elements within the text structure, without a preceding or following span. This condition causes the IndexError within the get_raw_lines function.
Unfortunately, I do not have the ability to modify or sanitize the PDF files, nor can I guarantee that other PDFs of this type won't be encountered. The current behavior of the library is to completely halt text extraction upon encountering this exception.
I suggest implementing a try/except block within the get_raw_lines function to handle the IndexError or as an alternative a conditional block to check if the preceding or followin span exists. When this exception occurs, the code could skip updating the bounding box for the affected span, rather than terminating the entire extraction process. This would allow the library to continue processing the rest of the document. This would ensure that the processing of other PDFs won't break.
ifs["flags"] &1==1: # if a superscript, modify bbox# with that of the preceding or following spaniflen(line["spans"]) >1: # <<<======================= ADD THIS!!i=1ifsno==0elsesno-1neighbor=line["spans"][i]
sbbox.y1=neighbor["bbox"][3]
s["text"] =f"[{s['text']}]"
The text was updated successfully, but these errors were encountered:
I encountered an IndexError: list index out of range while processing a PDF containing mathematical formulas using the pymupdf4llm library (version 0.0.17). The full traceback is as follows:
Debugging revealed that the issue appears to stem from a specific type of PDF formatting. In the problematic PDF, I observed instances of lone "superscript" elements within the text structure, without a preceding or following span. This condition causes the IndexError within the get_raw_lines function.
Unfortunately, I do not have the ability to modify or sanitize the PDF files, nor can I guarantee that other PDFs of this type won't be encountered. The current behavior of the library is to completely halt text extraction upon encountering this exception.
I suggest implementing a try/except block within the get_raw_lines function to handle the IndexError or as an alternative a conditional block to check if the preceding or followin span exists. When this exception occurs, the code could skip updating the bounding box for the affected span, rather than terminating the entire extraction process. This would allow the library to continue processing the rest of the document. This would ensure that the processing of other PDFs won't break.
The text was updated successfully, but these errors were encountered: