Performance issue - why? #278
Replies: 4 comments 1 reply
-
PS to PS: job with default dask configuration for a single workstation had this result:
Small difference from explicit 8 threads. Default, as I read the documentation since this is using bags is equivalent to scheduler='processes' and the default sets the number of workers to the number of cpu. This machine has 8 so I think this test is exactly the same as the previous that took with 1 s of 10,000 s. Apparently the overhead in this workflow is not large for processes that appear to require pickle (when exactly is less clear to me anyway). So, above questions are still the main ones: why is dask slower than serial? Answering this question can improve mspass. |
Beta Was this translation helpful? Give feedback.
-
We definitely need to look into this, but I think the problem is in the database. Since you are running a local MongoDB instance, I don't think it can handle parallel write very well especially that you are saving all the waveforms into the GridFS here by calling the save with default arguments. Also, as Dask is using all 8 cores available on the machine, I think there is a contention for the MongoDB to get the CPU resources from Dask. There might be other issues, but I think those two should be the main contributor to the slowness. I don't think we've ever tested the database performance, and there are a lot of potential optimizations needed. |
Beta Was this translation helpful? Give feedback.
-
Finished another run of pwstack with this config:
Ran to see what happened if I had more workers than cores (8 on this machine). Not 100% sure this is directly comparable to the previous run due to a mistake I made in the data assembly, but here is the timing output:
Which is slightly less than the first run but not by much. |
Beta Was this translation helpful? Give feedback.
-
I have learned more about this issue and think we have a important conclusion that will follow. The new data is runs of pwmig, which is a completely different algorithm than referenced above. However, because of some unrelated issues with pickle both implementations are unable currently to run anything but the "threaded" scheduler of dask. The pwmig runs were even more unambiguous. I ran my test data set through the prototype using: (1) 'single-threaded', (2) threaded with default number of workers (8 for this machine), (3) 16 workers, and (4) 4 workers. All ran within 20 or 30 s of 10,090 s. Obviously the all were actually running single threaded. Why, I think, is revealed by this quote I take from the dask documentation found here:
The algorithms I'm developing here use bags and pwmig also uses delayed. We'll see if this hypothesis is true when we get all the currently pending pull requests merged and I sort out pickle problems I have with these two algorithms, but I think the above quote likely defines this problem. I anticipate two actions once we confirm the truth of my hypothesis:
|
Beta Was this translation helpful? Give feedback.
-
This is an interesting discovery. It may be a usage error or it may point to something we need to address for MsPASS.
A summary of what I found that needs some detective work. I ran a rewrite of a program from the pwmig package I've been working on called pwstack. My README file says the original all-in-one C++ code took just over 1 hour to run on a test data set. A serial version of the same algorithm using python bindings and driver code in python took about 6000 s to process the same data. That didn't surprise me because the new code has to do database transactions with MongoDB while the old code was very optimized writing its output to a custom binary file. The puzzle is that running the same code with dask using this scheduler line:
and constructs I'll show below took 10,000 s to complete.
I know from running in debug mode many times that the overhead in initialization of this code is tiny - about 20s. The parallel section that follows the initialization is this:
Noting a few things to help you understand what is going on here:
read_ensemble
is a python function that take a query line, makes the query, and calls the read_ensemble method of Database to build an input SeismogramEnsemble it pushes to the bag. Note for these tests data the individual ensembles are fairly small (always less than about 15)I need to verify the parallel and serial algorithms produced the same answer, but I would be surprised it they did not. Any speculation on what could be causing this performance issue? To you experts - what tools can we use to sort out what is causing this?
A postscript before closing. This is the code I could not make work with the normal scheduler. That is, it wouldn't run before without the
dask.config.set
call as shown above. I had been getting a mysterious pickle error I asked @wangyinz to look into a few weeks ago. Something has changed because a new test running with default dask configuratoin seems to now be running fine. I am suspecting that the pickle error was a red herring created by a version skew between different mspass libraries I was linking this code against. These runs used a merge with the still pending branch that added serialization for the "TopMute" objects used in the argument list to pwstack_ensemble (control.data_mute and control.stack_mute). I suspect I was not using the link libraries I thought I was when I was getting that error before. So that problem seems to have gone away and pwstack is working but has this performance issue. I'll give you an update on the new test when it finishes - presumably a couple of hours from now.Beta Was this translation helpful? Give feedback.
All reactions