Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document extractor takes too long to parse or fails to parse PDFs #12516

Open
5 tasks done
Sakura4036 opened this issue Jan 9, 2025 · 2 comments
Open
5 tasks done

document extractor takes too long to parse or fails to parse PDFs #12516

Sakura4036 opened this issue Jan 9, 2025 · 2 comments
Labels
🐞 bug Something isn't working

Comments

@Sakura4036
Copy link
Contributor

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

Uploading multiple files in chatflow workflow and executing them in iterative nodes using parallel mode

image

✔️ Expected Behavior

  1. document extractor extraction should be very fast (ms level)
  2. Successfully extract all pdf files

❌ Actual Behavior

the document extractor sometimes fails to extract the pdf files (files up to a few MB) or takes a very long time (at the minute level), but the problem doesn't occur 100% of the time, so it's not easy to tell where the problem lies.

image
image

Copy link

dosubot bot commented Jan 9, 2025

The document extractor's performance issues, such as taking too long to parse or failing to parse PDFs, can be attributed to several factors:

  1. Complexity of PDF Content: The extraction process might struggle with PDFs that contain complex structures, such as images, non-standard fonts, or embedded objects, which can lead to failures or longer processing times [1].

  2. Error Handling: Specific errors like "Failed to load document (PDFium: Data format error)" indicate issues with the PDF file itself or limitations in the pypdfium2 library used for extraction [1].

  3. Resource Intensive Operations: Processing large or complex PDFs can be resource-intensive, especially in parallel mode, which might lead to delays [1].

  4. Parallel Mode Limitations: Running in parallel mode can introduce exceptions and affect performance, particularly with answer nodes and continuous read/write operations [1].

To address these issues, consider the following:

  • Optimize PDF Content: Simplify the PDF content if possible, removing unnecessary images or complex formatting.
  • Check Resource Allocation: Ensure that your self-hosted environment has sufficient resources (CPU, memory) to handle the workload.
  • Review Parallel Mode Usage: Evaluate whether parallel mode is necessary for your workflow or if sequential processing might be more reliable for certain tasks.

These steps might help mitigate the performance issues you're experiencing with the document extractor.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 9, 2025
@Sakura4036
Copy link
Contributor Author

need help
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant