document extractor takes too long to parse or fails to parse PDFs #12516

Sakura4036 · 2025-01-09T01:20:54Z

Self Checks

This is only for bug report, if you would like to ask a question, please head to Discussions.
I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Uploading multiple files in chatflow workflow and executing them in iterative nodes using parallel mode

✔️ Expected Behavior

document extractor extraction should be very fast (ms level)
Successfully extract all pdf files

❌ Actual Behavior

the document extractor sometimes fails to extract the pdf files (files up to a few MB) or takes a very long time (at the minute level), but the problem doesn't occur 100% of the time, so it's not easy to tell where the problem lies.

dosubot · 2025-01-09T01:22:29Z

The document extractor's performance issues, such as taking too long to parse or failing to parse PDFs, can be attributed to several factors:

Complexity of PDF Content: The extraction process might struggle with PDFs that contain complex structures, such as images, non-standard fonts, or embedded objects, which can lead to failures or longer processing times ^[1].
Error Handling: Specific errors like "Failed to load document (PDFium: Data format error)" indicate issues with the PDF file itself or limitations in the pypdfium2 library used for extraction ^[1].
Resource Intensive Operations: Processing large or complex PDFs can be resource-intensive, especially in parallel mode, which might lead to delays ^[1].
Parallel Mode Limitations: Running in parallel mode can introduce exceptions and affect performance, particularly with answer nodes and continuous read/write operations ^[1].

To address these issues, consider the following:

Optimize PDF Content: Simplify the PDF content if possible, removing unnecessary images or complex formatting.
Check Resource Allocation: Ensure that your self-hosted environment has sufficient resources (CPU, memory) to handle the workload.
Review Parallel Mode Usage: Evaluate whether parallel mode is necessary for your workflow or if sequential processing might be more reliable for certain tasks.

These steps might help mitigate the performance issues you're experiencing with the document extractor.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Sakura4036 · 2025-01-13T01:37:55Z

need help

dosubot bot added the 🐞 bug Something isn't working label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document extractor takes too long to parse or fails to parse PDFs #12516

document extractor takes too long to parse or fails to parse PDFs #12516

Sakura4036 commented Jan 9, 2025

dosubot bot commented Jan 9, 2025

Sakura4036 commented Jan 13, 2025

document extractor takes too long to parse or fails to parse PDFs #12516

document extractor takes too long to parse or fails to parse PDFs #12516

Comments

Sakura4036 commented Jan 9, 2025

Self Checks

Dify version

Cloud or Self Hosted

Steps to reproduce

✔️ Expected Behavior

❌ Actual Behavior

dosubot bot commented Jan 9, 2025

Sakura4036 commented Jan 13, 2025