Using multithreading to extract tables from a large PDF #347

ash-perfect · 2019-06-25T13:48:13Z

As the title says, I have 200 pages and it takes around 4 mins to extract the tables from all the pages. So I decided to use multiple threads to extract faster.

I am using Jupiter Notebook and all the code below is in a single cell.

Here is my code:

import camelot
import threading
import time

def getTables(start,end):
    for i in range(start,end):
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
        time.sleep(2)
        print(i)
if __name__ == "__main__":
    t1 = threading.Thread(target=getTables, name='t1',args=(10,20,)) 
    t2 = threading.Thread(target=getTables, name='t2',args=(30,40))
    
    t1.start()
    t2.start()
    t1.join()
    t2.join()

When I run this the Kernel in my Jupyter Notebook either dies completely or
Exception for either of the threads occurs and the other thread runs properly.

Exception goes like this

Exception in thread t2:
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-1-6180219f4c48>", line 18, in getTables
    tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
  File "/anaconda3/lib/python3.6/site-packages/camelot/io.py", line 106, in read_pdf
    layout_kwargs=layout_kwargs, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/camelot/handlers.py", line 162, in parse
    layout_kwargs=layout_kwargs)
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 351, in extract_tables
    self._generate_image()
  File "/anaconda3/lib/python3.6/site-packages/camelot/parsers/lattice.py", line 193, in _generate_image
    with Ghostscript(*gs_call, stdout=null) as gs:
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/__init__.py", line 89, in Ghostscript
    __instance__ = gs.new_instance()
  File "/anaconda3/lib/python3.6/site-packages/camelot/ext/ghostscript/_gsprint.py", line 71, in new_instance
    raise GhostscriptError(rc)
camelot.ext.ghostscript._gsprint.GhostscriptError: -100

Please help me with this. Am I missing something?

The text was updated successfully, but these errors were encountered:

ash-perfect · 2019-06-25T14:34:44Z

I later on learned that multi threading isn't the best solution for me. So I tried using multiprocessing.
When I used this, the Camelot code isn't even being executed, the process just stops when it comes across the Camelot.read_pdf line. No error is given.

import camelot
import threading
import time

from multiprocessing import Process

import timeit
start=0
stop=0
def tstart():
    global start
    start = timeit.default_timer()
def tend():
    global stop
    stop = timeit.default_timer()
    execution_time = stop - start
    print("Program Executed in "+str(round(execution_time,4))," seconds")

    
    
    
    
    
def getTables(start,end):
    print("is it working")
    for i in range(10):
        print('once')
        tables = camelot.read_pdf('Stryker_SPD.pdf',pages=str(i+1))
        time.sleep(2)
        print(i)
        
if __name__ == "__main__":
    p1 = Process(target=getTables,args=(10,20,))
    #p2 = Process(target=getTables,args=(20,30,))
    tstart()
    p1.start()
    #p2.start()
    
    p1.join()
    #p2.join()
    tend()

See that I am using only one process to test it out. The output is:

is it working
once
Program Executed in 0.7326  seconds

If I remove the Camelot line it executes perfectly.

Please help with this.

anakin87 · 2019-06-27T09:18:34Z

I'm interested in making Camelot faster for large files...

ash-perfect · 2019-06-28T08:22:11Z

I tried as much as possible, but nothing worked. I ended up using bash to multiprocess the parsing

vinayak-mehta · 2019-06-29T16:38:09Z

@ash-perfect Let me try and reproduce this.

I'm interested in making Camelot faster for large files...

@anakin87 Using multiprocessing?

maximeboun · 2019-07-04T13:10:43Z

I managed to extract a 100 pages PDF using Camelot on 4 processes, which yielded good results: 157 seconds with multiprocessing against 374 seconds without.
Also, you are passing start and end to getTables but you're not using these parameters in the function. Try to set the pages argument to str(start) + '-' + str(end)

vinayak-mehta · 2019-07-05T22:08:38Z

Closed in favor of camelot-dev/camelot#20.

LinanYaooo · 2020-07-27T04:58:59Z

Hey，Everyone， I am also encountering an ISSUE on processing many PDF concurrently on a Server:
I've created a Flask async Service on extracting tables for Users, In which I use Camelot to as the Core technique. While for better performance, I create a threading pool to submit the extraction task. And What
I've found is that Camelot maybe not good on handling too much pdfs in threading pool method by using
Ghost Script?
I throw about 3-5 files in a docker. And some of the tasks just stuck & some others encounter the
GhostscriptError: -100 ISSUE.

However, When I set the max_workers of the threading to 1, which means run task in Single Queue,
I never have the ISSUE now. But as the tradeoff, I may deploy many instances to afford high concurrence.

@vinayak-mehta

mlbrothers · 2023-06-19T11:17:14Z

@maximeboun can you share code snippet for multiprocessing?

anakin87 · 2023-06-19T11:51:28Z

@mlbrothers take a look at this for inspiration: https://camelot-py.readthedocs.io/en/master/user/faq.html#how-to-reduce-memory-usage-for-long-pdfs

I am not using multiprocessing but dividing the extraction into chunks. Maybe it can be a starting point...

vinayak-mehta closed this as completed Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using multithreading to extract tables from a large PDF #347

Using multithreading to extract tables from a large PDF #347

ash-perfect commented Jun 25, 2019

ash-perfect commented Jun 25, 2019

anakin87 commented Jun 27, 2019

ash-perfect commented Jun 28, 2019

vinayak-mehta commented Jun 29, 2019

maximeboun commented Jul 4, 2019

vinayak-mehta commented Jul 5, 2019

LinanYaooo commented Jul 27, 2020

mlbrothers commented Jun 19, 2023

anakin87 commented Jun 19, 2023

Using multithreading to extract tables from a large PDF #347

Using multithreading to extract tables from a large PDF #347

Comments

ash-perfect commented Jun 25, 2019

ash-perfect commented Jun 25, 2019

anakin87 commented Jun 27, 2019

ash-perfect commented Jun 28, 2019

vinayak-mehta commented Jun 29, 2019

maximeboun commented Jul 4, 2019

vinayak-mehta commented Jul 5, 2019

LinanYaooo commented Jul 27, 2020

mlbrothers commented Jun 19, 2023

anakin87 commented Jun 19, 2023