Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError in get_raw_lines when processing PDFs with formulas #218

Open
ccasadei opened this issue Jan 17, 2025 · 0 comments
Open

IndexError in get_raw_lines when processing PDFs with formulas #218

ccasadei opened this issue Jan 17, 2025 · 0 comments

Comments

@ccasadei
Copy link

I encountered an IndexError: list index out of range while processing a PDF containing mathematical formulas using the pymupdf4llm library (version 0.0.17). The full traceback is as follows:

File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 936, in to_markdown
  parms = get_page_output(doc, pno, margins, textflags)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 883, in get_page_output
  parms.md_string += write_text(parms, text_rect, force_text=force_text)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 429, in write_text
  for l in get_raw_lines(parms.textpage, clip=clip, tolerance=3)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/get_text_lines.py", line 114, in get_raw_lines
  neighbor = line["spans"][i]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range

Debugging revealed that the issue appears to stem from a specific type of PDF formatting. In the problematic PDF, I observed instances of lone "superscript" elements within the text structure, without a preceding or following span. This condition causes the IndexError within the get_raw_lines function.

Unfortunately, I do not have the ability to modify or sanitize the PDF files, nor can I guarantee that other PDFs of this type won't be encountered. The current behavior of the library is to completely halt text extraction upon encountering this exception.

I suggest implementing a try/except block within the get_raw_lines function to handle the IndexError or as an alternative a conditional block to check if the preceding or followin span exists. When this exception occurs, the code could skip updating the bounding box for the affected span, rather than terminating the entire extraction process. This would allow the library to continue processing the rest of the document. This would ensure that the processing of other PDFs won't break.

                if s["flags"] & 1 == 1:  # if a superscript, modify bbox
                    # with that of the preceding or following span
                    if len(line["spans"]) > 1:  # <<<======================= ADD THIS!!
                        i = 1 if sno == 0 else sno - 1
                        neighbor = line["spans"][i]
                        sbbox.y1 = neighbor["bbox"][3]
                    s["text"] = f"[{s['text']}]"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant