Performance issue - why? #278

pavlis · 2021-12-19T18:30:38Z

pavlis
Dec 19, 2021
Maintainer

This is an interesting discovery. It may be a usage error or it may point to something we need to address for MsPASS.

A summary of what I found that needs some detective work. I ran a rewrite of a program from the pwmig package I've been working on called pwstack. My README file says the original all-in-one C++ code took just over 1 hour to run on a test data set. A serial version of the same algorithm using python bindings and driver code in python took about 6000 s to process the same data. That didn't surprise me because the new code has to do database transactions with MongoDB while the old code was very optimized writing its output to a custom binary file. The puzzle is that running the same code with dask using this scheduler line:

import dask
dask.config.set(scheduler='threads',num_workers=8)

and constructs I'll show below took 10,000 s to complete.

I know from running in debug mode many times that the overhead in initialization of this code is tiny - about 20s. The parallel section that follows the initialization is this:

    mybag=dask.bag.from_sequence(allqueries,npartitions=16)
    # These can now be deleted to save memory
    del source_id_list
    del staids
    # parallel reader - result is a bag of ensembles created from
    # queries held in query
    mybag = mybag.map(lambda q : read_ensembles(db,q,control))
    # Now run pwstack_ensemble - it has a long arg list
    mybag = mybag.map(lambda d : pwstack_ensemble(d,
            control.SlowGrid,
              control.data_mute,
                control.stack_mute,
                  control.stack_count_cutoff,
                    control.tstart,
                      control.tend,
                        control.aperture,
                          control.aperture_taper_length,
                            control.centroid_cutoff,
                                False,'') )
    mybag = mybag.map(lambda d : db.save_ensemble_data(d,data_tag=output_data_tag))
    mybag.compute()

Noting a few things to help you understand what is going on here:

allqueries is a list of MongoDB query strings generated in the initialization. It drives and defines the way ensembles are to be assembled from data found in wf_Seismogram.
I first thought changing npartitions would have an effect and speed this up, but I don't think it did. I didn't record the previous run before I added that parameter. It might have been even worse, but it was clearly not the overwhelming issue here.
read_ensemble is a python function that take a query line, makes the query, and calls the read_ensemble method of Database to build an input SeismogramEnsemble it pushes to the bag. Note for these tests data the individual ensembles are fairly small (always less than about 15)
pwstack_ensemble is a C++ function that does all the computations. It has a long string of arguments. Only the ensemble data, d, is huge though. It returns an entirely new ensemble that is usually much larger than the input (In this run about 10 time larger - exactly 121 outputs for any inputs with more than 5 seismograms.
The final map call is a direct reference to the save_ensemble method of the mspass Database class.

I need to verify the parallel and serial algorithms produced the same answer, but I would be surprised it they did not. Any speculation on what could be causing this performance issue? To you experts - what tools can we use to sort out what is causing this?

A postscript before closing. This is the code I could not make work with the normal scheduler. That is, it wouldn't run before without the dask.config.set call as shown above. I had been getting a mysterious pickle error I asked @wangyinz to look into a few weeks ago. Something has changed because a new test running with default dask configuratoin seems to now be running fine. I am suspecting that the pickle error was a red herring created by a version skew between different mspass libraries I was linking this code against. These runs used a merge with the still pending branch that added serialization for the "TopMute" objects used in the argument list to pwstack_ensemble (control.data_mute and control.stack_mute). I suspect I was not using the link libraries I thought I was when I was getting that error before. So that problem seems to have gone away and pwstack is working but has this performance issue. I'll give you an update on the new test when it finishes - presumably a couple of hours from now.

pavlis · 2021-12-19T21:52:23Z

pavlis
Dec 19, 2021
Maintainer Author

PS to PS: job with default dask configuration for a single workstation had this result:

pwstack finished.  Elapsed time= 11570.969844341278

Small difference from explicit 8 threads. Default, as I read the documentation since this is using bags is equivalent to scheduler='processes' and the default sets the number of workers to the number of cpu. This machine has 8 so I think this test is exactly the same as the previous that took with 1 s of 10,000 s. Apparently the overhead in this workflow is not large for processes that appear to require pickle (when exactly is less clear to me anyway).

So, above questions are still the main ones: why is dask slower than serial? Answering this question can improve mspass.

0 replies

wangyinz · 2021-12-20T16:23:52Z

wangyinz
Dec 20, 2021
Maintainer

We definitely need to look into this, but I think the problem is in the database. Since you are running a local MongoDB instance, I don't think it can handle parallel write very well especially that you are saving all the waveforms into the GridFS here by calling the save with default arguments. Also, as Dask is using all 8 cores available on the machine, I think there is a contention for the MongoDB to get the CPU resources from Dask. There might be other issues, but I think those two should be the main contributor to the slowness. I don't think we've ever tested the database performance, and there are a lot of potential optimizations needed.

1 reply

pavlis Dec 20, 2021
Maintainer Author

That makes sense. I suspect the database bottleneck hypothesis is likely true, and we probably should design some special tests to understand the boundary conditions.

Something mysterious has happened with way dask.config.set seems to have done things. As noted I ran the same script I ran with the thread scheduler but just commented out the dask.config.set line. It ran fine with the timing noted. This morning after fixing a mistake in the way I put the data together (the delta attribute problem noted elsewhere) I ran exactly the same script but explicitly stating:

dask.config.set(scheduler='processes')

with or without specifying num_workers I am not again always getting the error about PyCapsule and pickle. I'm going to follow that call stack and see if I can make sense of it, but this is deep in the guts of dask.

pavlis · 2021-12-21T11:38:16Z

pavlis
Dec 21, 2021
Maintainer Author

Finished another run of pwstack with this config:

dask.config.set(scheduler='threads',num_workers=32

Ran to see what happened if I had more workers than cores (8 on this machine). Not 100% sure this is directly comparable to the previous run due to a mistake I made in the data assembly, but here is the timing output:

pwstack finished.  Elapsed time= 7891.261391401291

Which is slightly less than the first run but not by much.

0 replies

pavlis · 2021-12-29T14:11:35Z

pavlis
Dec 29, 2021
Maintainer Author

I have learned more about this issue and think we have a important conclusion that will follow.

The new data is runs of pwmig, which is a completely different algorithm than referenced above. However, because of some unrelated issues with pickle both implementations are unable currently to run anything but the "threaded" scheduler of dask. The pwmig runs were even more unambiguous. I ran my test data set through the prototype using: (1) 'single-threaded', (2) threaded with default number of workers (8 for this machine), (3) 16 workers, and (4) 4 workers. All ran within 20 or 30 s of 10,090 s. Obviously the all were actually running single threaded. Why, I think, is revealed by this quote I take from the dask documentation found here:

The threaded scheduler is a fine choice for working with large datasets out-of-core on a single machine, as long as the functions being used release the GIL most of the time. NumPy and pandas release the GIL in most places, so the threaded scheduler is the default for dask.array and dask.dataframe. The distributed scheduler, perhaps with processes=False, will also work well for these workloads on a single machine.

For workloads that do hold the GIL, as is common with dask.bag and custom code wrapped with dask.delayed, we recommend using the distributed scheduler, even on a single machine. Generally speaking, it’s more intelligent and provides better diagnostics than the processes scheduler.

The algorithms I'm developing here use bags and pwmig also uses delayed.

We'll see if this hypothesis is true when we get all the currently pending pull requests merged and I sort out pickle problems I have with these two algorithms, but I think the above quote likely defines this problem. I anticipate two actions once we confirm the truth of my hypothesis:

The documentation must warn about this issue.
We may have to address the issue of the GIL noted in the above quote. As I read this topic it seems an intrinsic issue with threads and python and suggest people should use the threaded scheduler only for dataframe operations or pure numpy operations. I confess I do not fully understand this issue though, but it seems pretty fundamental. The question I have is if there is a way within a python script to handle this issue manually and not create a disaster.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue - why? #278

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance issue - why? #278

pavlis Dec 19, 2021 Maintainer

Replies: 4 comments · 1 reply

pavlis Dec 19, 2021 Maintainer Author

wangyinz Dec 20, 2021 Maintainer

pavlis Dec 20, 2021 Maintainer Author

pavlis Dec 21, 2021 Maintainer Author

pavlis Dec 29, 2021 Maintainer Author

pavlis
Dec 19, 2021
Maintainer

Replies: 4 comments 1 reply

pavlis
Dec 19, 2021
Maintainer Author

wangyinz
Dec 20, 2021
Maintainer

pavlis Dec 20, 2021
Maintainer Author

pavlis
Dec 21, 2021
Maintainer Author

pavlis
Dec 29, 2021
Maintainer Author