-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asyncio support #554
Comments
Your question is, unfortunately, a bit vague so I have to somewhat guess what the concern is. As I understand it Python coroutines you refer to are a method of doing asynchronous io in python. There is no question that balancing I/O and CPU time is critical to optimize performance of a workflow handling large volumes of data. I first recommend you read this section of the MsPASS User Manual. A second point I'd make is that because MsPASS is (currently at least) focused on the map-reduce model of parallel computing, I/O is mostly treated as a scheduling problem for the parallel scheduler and the operating system scheduling. It is our experience that most seismic processing workflows reduce to parallel pipelines with reads and head and writes at the tail of the pipeline. If the compute tasks are lightweight such jobs get I/O bound quickly. Howeer, the more you put the pipeline the more negligible I/O becomes. As the manual section above notes another key thing is avoid database transactions within any workflow a if a job blocks on a database transaction it is easy to create a completely I/O bound job from even a lightweight task. Finally, you might want to read the new documentation for dask on what they call "Futures" in dask distributed found here. Futures are an abstraction of compute tasks submitted asynchrously to a cluster. They are used, in fact, under the hood in the map-reduce approach we have focused on using in MsPASS. Long answer I hope was helpful. Happy to continue this dialogue. |
Looking at the pipelines I see the parallel handling through high-level processing pipelines. The modern I would like to make you aware that |
Thanks for letting us know about this likely future standard in python. That could help a lot of pure python codes. Two points I'd like to make as a followup though:
|
Postscript to previous: In MsPASS you should realize we treat indexing of miniseed files as a fundamentally different operation than reading them. We, in fact, recommend normal use with large data sets should minimize the number of files and organize data into files grouped in a way appropriate for the data being analyzed. e.g. event-based processing should have all the data for an event in one file. One of the best worst things you can do on an HPC system is create the equivalent of a set of SAC files with one signal per file. You can crash an entire HPC cluster if you try to read millions of files in parallel if the files or stored in a Lustre file system. Read the tutorials from the Earthscope short course I pointed to above where we discuss some of these issues further. The final point is that our reader called |
Dear mspass team,
to leverage true HPC computing and juggle I/O, CPU and GPU work
asyncio
support is required.What are your plans to incorporate native Python coroutines?
Best,
M
The text was updated successfully, but these errors were encountered: