-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel batch submission #172
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #172 +/- ##
===========================================
- Coverage 73.65% 72.82% -0.84%
===========================================
Files 46 47 +1
Lines 6070 6207 +137
Branches 953 979 +26
===========================================
+ Hits 4471 4520 +49
- Misses 1252 1337 +85
- Partials 347 350 +3
|
Thank you, @gpetretto, for this very exciting contribution!! Unfortunately, I won't be able to provide much input at this time, but I will be monitoring the progress here. |
I am surely very interested. I have to see when I/we can test it. I have no real experience with 1. |
One question I have regarding implementation: a common scenario on MP's side is to request a huge allocation (e.g. 1024 nodes) and run concurrent VASP jobs on that allocation (e.g. 4 nodes per VASP job, so 256 max concurrent jobs in one Slurm allocation). When one of those jobs finishes, we don't necessarily (in this use case) want to leave the nodes empty. We want to pull in a new job so all resources are fully utilized until the walltime is hit. Based on the description, I wasn't sure if the "continue pulling in new jobs" aspect was being covered or not. Is this something within scope for this PR? |
In principle this should allow more or less what is available in fireworks. After one Job finishes another can be started. You can specify a maximum number of jobs to be executed or a timeout after which no more jobs will be started, to avoid hitting the walltime. Having collected a few more comments, I think that at this point the best strategy would be to proceed with the review and merge of the PR. It would be easier for people to try it and give feedback once it is in an official release. In the documentation it is still specified that this shoule be considered an experimental option, as it has not been extensively tested. |
That sounds wonderful! I think releasing as experimental is probably the best way to get input. We will definitely try it out here but sadly not for a few months. Waiting on graduate students to join the group and our dedicated MongoDB server to be installed this month. But we will for sure be stress testing things once that's taken care of. |
@gpetretto I agree! We will test this as soon as it is in the main or in am official version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I think we can go on with this mentioning it is currently experimental as we need feedback from users. Only thing I would still change is the name of max_jobs of a given batch submission to max_jobs_per_batch as we discussed also.
As a long standing request (see #86), I have started implementing the option to allow the execution of parallel jobs in the batch mode. Up to now only one job at the time could run in a batch job (see docs).
With this implementation I managed to run multiple atomate2 vasp Jobs in parallel in a single job submitted to SLURM.
I have two main doubts about the implementation and I would be interested to have some comments from people that have used this functionality with fireworks.
nodefile
, so that a list of nodes/cores can be generatedhttps://github.com/materialsproject/fireworks/blob/96300617c8a3471f83e1b8aed978552e6ac6a15d/fireworks/scripts/mlaunch_run.py#L57
https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/features/multi_launcher.py#L235
However, it seems that this information is never used in atomate/atomate2. So, if this is not really useful, I would avoid adding this and eventually save this in case an explicit request comes in in the future. Can anybody confirm that this is not used for typical fireworks multi launch executions?
Job
s) that have been submitted and have finished running. These do not fit in the regularjobs
documents, since there is not a 1->1 correspondence between the SLURM jobs and the jobflow Job. It is instead possible to get a list of those that are currently running. This could be mainly a problem if some issues happen inside the batch execution and one has to track down its logs (the logs of the SLURM job, not the jobflow one. The latter will still be easily accessible) or check that the batch submission is actually working fine. The questions here are: how important would be to have the full history of all the batch jobs submitted to SLURM/PBS? The most straightforward way would be to add a new collection to contain these data, is it worth adding this? Any alternative suggestion?I think @Andrew-S-Rosen, @JaGeo, @utf may be interested in the feature.
Fixes #86
Fixes #96
TODO