You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
Please do not modify this template :) and fill in all the required fields.
Dify version
0.15.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
Uploading multiple files in chatflow workflow and executing them in iterative nodes using parallel mode
✔️ Expected Behavior
document extractor extraction should be very fast (ms level)
Successfully extract all pdf files
❌ Actual Behavior
the document extractor sometimes fails to extract the pdf files (files up to a few MB) or takes a very long time (at the minute level), but the problem doesn't occur 100% of the time, so it's not easy to tell where the problem lies.
The text was updated successfully, but these errors were encountered:
The document extractor's performance issues, such as taking too long to parse or failing to parse PDFs, can be attributed to several factors:
Complexity of PDF Content: The extraction process might struggle with PDFs that contain complex structures, such as images, non-standard fonts, or embedded objects, which can lead to failures or longer processing times [1].
Error Handling: Specific errors like "Failed to load document (PDFium: Data format error)" indicate issues with the PDF file itself or limitations in the pypdfium2 library used for extraction [1].
Resource Intensive Operations: Processing large or complex PDFs can be resource-intensive, especially in parallel mode, which might lead to delays [1].
Parallel Mode Limitations: Running in parallel mode can introduce exceptions and affect performance, particularly with answer nodes and continuous read/write operations [1].
To address these issues, consider the following:
Optimize PDF Content: Simplify the PDF content if possible, removing unnecessary images or complex formatting.
Check Resource Allocation: Ensure that your self-hosted environment has sufficient resources (CPU, memory) to handle the workload.
Review Parallel Mode Usage: Evaluate whether parallel mode is necessary for your workflow or if sequential processing might be more reliable for certain tasks.
These steps might help mitigate the performance issues you're experiencing with the document extractor.
Self Checks
Dify version
0.15.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
Uploading multiple files in chatflow workflow and executing them in iterative nodes using parallel mode
✔️ Expected Behavior
❌ Actual Behavior
the document extractor sometimes fails to extract the pdf files (files up to a few MB) or takes a very long time (at the minute level), but the problem doesn't occur 100% of the time, so it's not easy to tell where the problem lies.
The text was updated successfully, but these errors were encountered: