MongoDB database handles in parallel constructs #298
Replies: 3 comments 2 replies
-
hmmm.... what is this mspass/python/mspasspy/db/database.py Lines 193 to 205 in 3fe0c52 Therefore, I am not quite sure where does the mspass/cxx/python/utility/utility_py.cc Lines 201 to 209 in 85c8f46 where you can clearly see that the relevant code is already commented out with a new numpy implementation. I guess the only way to sort out the problem is to have me run and debug the code. We attempted this before but was blocked by a separate issue. Maybe it is time to reinitiate that process. |
Beta Was this translation helpful? Give feedback.
-
Seems my hypothesis may well be false then. As noted I was guessing because I couldn't untangle the messy data structure dask was passing through the scheduler. I learned a lot about how dask works by doing that process with spyder, but it was complicated enough I got lost in the woods, stepped back and looked at what was being done, and made the conjecture that Database was to blame even though I couldn't actually even see any references to Database in the data I was viewing with spyder. It seems my conjecture was likely wrong. Based on what you are saying I will dig back into this before punting it to you. FYI, I did learn a key thing in working on this to make the "PyCapsule" reference a bit less confusing. One source I read (no longer can easily find it) described it as a way to push around an opaque pointer. In C that approach is abused way too often and used even in stdlib for things malloc where a void* pointer has to be cast into whatever the code wants it to be. A source, by the way, of a large number of software bugs. What is mysterious is what is being miscast in this code into a PyCapsule? I certainly didn't sort it out yesterday, but let me have another go at it before I punt it to you. Putting aside the needs for fixing pwstack/pwmig, this still raises design and documentation issues for MsPASS. We really really need to have a clear set of documentation on the whole serialization issue. Seems to me that is a huge weakness in both dask and spark to not have a common tool for testing the feasibility of serializing a random python object. I suspect strongly it is because the dominant use of these packages are for things like dataframes where what is atomic is very rigidly defined relative to a bag/rdd where the contents can be anything. Our use, I think, is particularly fringe because we have all these custom classes bound to python with pybind11. Not sure what is realistic to support in MsPASS, but an absolute mininum is more extensive documentation on parallelizing a workflow. The one draft page we have is far far from sufficient, especially since the parallel thing is one of the key innovations we are advertising for this package. |
Beta Was this translation helpful? Give feedback.
-
It is starting to look like my hypothesis about obspy's travel time calculator being a problem is an issue. It turned out my original idea was not going to be easy to do and would not have been very definitive anyway. I realized it was easier to write a pure test program that did more or less what the large function was doing that I suspected was trouble. So, I wrote this little test program I would like both of you to see if you can run:
When I run this as above I get this chain of errors:
Note if you change the dask.config.set line to use Now what happens here is a completely different problem from what I was getting with the pwstack program, but seems to suggest an issue with the TauPyModel class. Could you guys verify this behaves the same for you? I ran this from my local standalone machine not from the docker container. We need to rule out the issue being something with my local version of dask. (No reason to think that but this is not our standard container.) |
Beta Was this translation helpful? Give feedback.
-
I am not 100% sure of this, but I think I've figure out why my parallel version of pwstack (part of pwmig I've been testing using mspass) is failing with an obscure error about not being unable to pickle a PyCapsule. My current working hypothesis is that the error is being created by a somewhat hidden (from a rookie perspective anyway) attempt to serialize a MongoDB Database class.
Let me break this algorithm down in stages so you can see how this is happening. The algorithm first creates a list of strings that are queries for MongoDB find:
Pretty stock python code.
Next I take that list of strings and turn it into a dask bag:
which is also pretty standard stuff.
I think the problem is right here with the next line:
db is a mspasspy.db.Database handle and, I believe, is not serializable. I am pretty sure it needs to be serializable since it appears in the arg list to read_ensemble. Can you, @wangyinz , confirm that hypothesis?
The reason I'm not positive about this issue is that the error chain tracking this really "leads down a rathole" to again us a cliche. The error message python posts is totally obscure:
Furthermore, when I followed the call sequence with spyder I got into a very confusing chain of results I (preserved here](pavlis/parallel_pwmig#2) (the rathole in the cliche above). A simple summary of what I found was the data structure dask passes around is pretty opaque and finding the problem from the internals is a "needle in a haystack problem", which I guess is yet another cliche.
If, as I suspect strongly my hypothesis is true, there are some followup issues we need to deal with:
Beta Was this translation helpful? Give feedback.
All reactions