IndexError in get_raw_lines when processing PDFs with formulas #218

ccasadei · 2025-01-17T09:00:30Z

I encountered an IndexError: list index out of range while processing a PDF containing mathematical formulas using the pymupdf4llm library (version 0.0.17). The full traceback is as follows:

File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 936, in to_markdown
  parms = get_page_output(doc, pno, margins, textflags)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 883, in get_page_output
  parms.md_string += write_text(parms, text_rect, force_text=force_text)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/pymupdf_rag.py", line 429, in write_text
  for l in get_raw_lines(parms.textpage, clip=clip, tolerance=3)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/mg-genai-core/lib/python3.11/site-packages/pymupdf4llm/helpers/get_text_lines.py", line 114, in get_raw_lines
  neighbor = line["spans"][i]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range

Debugging revealed that the issue appears to stem from a specific type of PDF formatting. In the problematic PDF, I observed instances of lone "superscript" elements within the text structure, without a preceding or following span. This condition causes the IndexError within the get_raw_lines function.

Unfortunately, I do not have the ability to modify or sanitize the PDF files, nor can I guarantee that other PDFs of this type won't be encountered. The current behavior of the library is to completely halt text extraction upon encountering this exception.

I suggest implementing a try/except block within the get_raw_lines function to handle the IndexError or as an alternative a conditional block to check if the preceding or followin span exists. When this exception occurs, the code could skip updating the bounding box for the affected span, rather than terminating the entire extraction process. This would allow the library to continue processing the rest of the document. This would ensure that the processing of other PDFs won't break.

                if s["flags"] & 1 == 1:  # if a superscript, modify bbox
                    # with that of the preceding or following span
                    if len(line["spans"]) > 1:  # <<<======================= ADD THIS!!
                        i = 1 if sno == 0 else sno - 1
                        neighbor = line["spans"][i]
                        sbbox.y1 = neighbor["bbox"][3]
                    s["text"] = f"[{s['text']}]"

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError in get_raw_lines when processing PDFs with formulas #218

IndexError in get_raw_lines when processing PDFs with formulas #218

ccasadei commented Jan 17, 2025

IndexError in get_raw_lines when processing PDFs with formulas #218

IndexError in get_raw_lines when processing PDFs with formulas #218

Comments

ccasadei commented Jan 17, 2025