-
Notifications
You must be signed in to change notification settings - Fork 359
Using multithreading to extract tables from a large PDF #347
Comments
I later on learned that multi threading isn't the best solution for me. So I tried using multiprocessing.
See that I am using only one process to test it out. The output is:
If I remove the Camelot line it executes perfectly. Please help with this. |
I'm interested in making Camelot faster for large files... |
I tried as much as possible, but nothing worked. I ended up using bash to multiprocess the parsing |
@ash-perfect Let me try and reproduce this.
@anakin87 Using multiprocessing? |
I managed to extract a 100 pages PDF using Camelot on 4 processes, which yielded good results: 157 seconds with multiprocessing against 374 seconds without. |
Closed in favor of camelot-dev/camelot#20. |
Hey,Everyone, I am also encountering an ISSUE on processing many PDF concurrently on a Server: However, When I set the max_workers of the threading to 1, which means run task in Single Queue, |
@maximeboun can you share code snippet for multiprocessing? |
@mlbrothers take a look at this for inspiration: https://camelot-py.readthedocs.io/en/master/user/faq.html#how-to-reduce-memory-usage-for-long-pdfs I am not using multiprocessing but dividing the extraction into chunks. Maybe it can be a starting point... |
As the title says, I have 200 pages and it takes around 4 mins to extract the tables from all the pages. So I decided to use multiple threads to extract faster.
I am using Jupiter Notebook and all the code below is in a single cell.
Here is my code:
When I run this the Kernel in my Jupyter Notebook either dies completely or
Exception for either of the threads occurs and the other thread runs properly.
Exception goes like this
Please help me with this. Am I missing something?
The text was updated successfully, but these errors were encountered: