Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

Using multithreading to extract tables from a large PDF #347

Closed
ash-perfect opened this issue Jun 25, 2019 · 9 comments
Closed

Using multithreading to extract tables from a large PDF #347

ash-perfect opened this issue Jun 25, 2019 · 9 comments

Comments

@ash-perfect
Copy link

As the title says, I have 200 pages and it takes around 4 mins to extract the tables from all the pages. So I decided to use multiple threads to extract faster.

I am using Jupiter Notebook and all the code below is in a single cell.

Here is my code:

import camelot
import threading
import time

def getTables(start,end):
    for i in range(start,end):
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
        time.sleep(2)
        print(i)
if __name__ == "__main__":
    t1 = threading.Thread(target=getTables, name='t1',args=(10,20,)) 
    t2 = threading.Thread(target=getTables, name='t2',args=(30,40))
    
    t1.start()
    t2.start()
    t1.join()
    t2.join()

When I run this the Kernel in my Jupyter Notebook either dies completely or
Exception for either of the threads occurs and the other thread runs properly.

Exception goes like this

Exception in thread t2:
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-1-6180219f4c48>", line 18, in getTables
    tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
  File "/anaconda3/lib/python3.6/site-packages/camelot/io.py", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/camelot/handlers.py", line 162, in parse
    layout_kwargs=layout_kwargs)
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 351, in extract_tables
    self._generate_image()
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 193, in _generate_image
    with Ghostscript(*gs_call, stdout=null) as gs:
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/__init__.py", line 89, in Ghostscript
    __instance__ = gs.new_instance()
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/_gsprint.py", line 71, in new_instance
    raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100

Please help me with this. Am I missing something?

@ash-perfect
Copy link
Author

I later on learned that multi threading isn't the best solution for me. So I tried using multiprocessing.
When I used this, the Camelot code isn't even being executed, the process just stops when it comes across the Camelot.read_pdf line. No error is given.

import camelot
import threading
import time

from multiprocessing import Process

import timeit
start=0
stop=0
def tstart():
    global start
    start = timeit.default_timer()
def tend():
    global stop
    stop = timeit.default_timer()
    execution_time = stop - start
    print("Program Executed in "+str(round(execution_time,4))," seconds")

    
    
    
    
    
def getTables(start,end):
    print("is it working")
    for i in range(10):
        print('once')
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
        time.sleep(2)
        print(i)
        
if __name__ == "__main__":
    p1 = Process(target=getTables,args=(10,20,))
    #p2 = Process(target=getTables,args=(20,30,))
    tstart()
    p1.start()
    #p2.start()
    
    p1.join()
    #p2.join()
    tend()

See that I am using only one process to test it out. The output is:

is it working
once
Program Executed in 0.7326  seconds

If I remove the Camelot line it executes perfectly.

Please help with this.

@anakin87
Copy link

I'm interested in making Camelot faster for large files...

@ash-perfect
Copy link
Author

I tried as much as possible, but nothing worked. I ended up using bash to multiprocess the parsing

@vinayak-mehta
Copy link
Contributor

@ash-perfect Let me try and reproduce this.

I'm interested in making Camelot faster for large files...

@anakin87 Using multiprocessing?

@maximeboun
Copy link

I managed to extract a 100 pages PDF using Camelot on 4 processes, which yielded good results: 157 seconds with multiprocessing against 374 seconds without.
Also, you are passing start and end to getTables but you're not using these parameters in the function. Try to set the pages argument to str(start) + '-' + str(end)

@vinayak-mehta
Copy link
Contributor

Closed in favor of camelot-dev/camelot#20.

@LinanYaooo
Copy link

Hey,Everyone, I am also encountering an ISSUE on processing many PDF concurrently on a Server:
I've created a Flask async Service on extracting tables for Users, In which I use Camelot to as the Core technique. While for better performance, I create a threading pool to submit the extraction task. And What
I've found is that Camelot maybe not good on handling too much pdfs in threading pool method by using
Ghost Script?
I throw about 3-5 files in a docker. And some of the tasks just stuck & some others encounter the
GhostscriptError: -100 ISSUE.

However, When I set the max_workers of the threading to 1, which means run task in Single Queue,
I never have the ISSUE now. But as the tradeoff, I may deploy many instances to afford high concurrence.

@vinayak-mehta

@mlbrothers
Copy link

@maximeboun can you share code snippet for multiprocessing?

@anakin87
Copy link

@mlbrothers take a look at this for inspiration: https://camelot-py.readthedocs.io/en/master/user/faq.html#how-to-reduce-memory-usage-for-long-pdfs

I am not using multiprocessing but dividing the extraction into chunks. Maybe it can be a starting point...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants