Parallel batch submission #172

gpetretto · 2024-09-05T10:02:01Z

As a long standing request (see #86), I have started implementing the option to allow the execution of parallel jobs in the batch mode. Up to now only one job at the time could run in a batch job (see docs).
With this implementation I managed to run multiple atomate2 vasp Jobs in parallel in a single job submitted to SLURM.

I have two main doubts about the implementation and I would be interested to have some comments from people that have used this functionality with fireworks.

In fireworks it is possible to specify the name of the nodefile, so that a list of nodes/cores can be generated
https://github.com/materialsproject/fireworks/blob/96300617c8a3471f83e1b8aed978552e6ac6a15d/fireworks/scripts/mlaunch_run.py#L57
https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/features/multi_launcher.py#L235
However, it seems that this information is never used in atomate/atomate2. So, if this is not really useful, I would avoid adding this and eventually save this in case an explicit request comes in in the future. Can anybody confirm that this is not used for typical fireworks multi launch executions?
At the moment there is no way to monitor the SLURM/PBS/... jobs (i.e., the jobs that will run multiple jobflow Jobs) that have been submitted and have finished running. These do not fit in the regular jobs documents, since there is not a 1->1 correspondence between the SLURM jobs and the jobflow Job. It is instead possible to get a list of those that are currently running. This could be mainly a problem if some issues happen inside the batch execution and one has to track down its logs (the logs of the SLURM job, not the jobflow one. The latter will still be easily accessible) or check that the batch submission is actually working fine. The questions here are: how important would be to have the full history of all the batch jobs submitted to SLURM/PBS? The most straightforward way would be to add a new collection to contain these data, is it worth adding this? Any alternative suggestion?

I think @Andrew-S-Rosen, @JaGeo, @utf may be interested in the feature.

Fixes #86
Fixes #96

TODO

update the documentation.
Tests? I still have to figure out if it would be possible to effectively test that the jobs really run in parallel. Maybe just adding test so that the code is tested and avoid trivial regressions.

codecov-commenter · 2024-09-05T10:12:14Z

Codecov Report

Attention: Patch coverage is 39.15663% with 101 lines in your changes missing coverage. Please review.

Project coverage is 72.82%. Comparing base (6d6ed9d) to head (5e88519).

Files with missing lines	Patch %	Lines
src/jobflow_remote/jobs/run.py	5.40%	34 Missing and 1 partial ⚠️
src/jobflow_remote/jobs/runner.py	64.15%	18 Missing and 1 partial ⚠️
src/jobflow_remote/cli/batch.py	41.37%	17 Missing ⚠️
src/jobflow_remote/cli/formatting.py	10.52%	17 Missing ⚠️
src/jobflow_remote/jobs/batch.py	42.10%	9 Missing and 2 partials ⚠️
src/jobflow_remote/jobs/jobcontroller.py	66.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #172      +/-   ##
===========================================
- Coverage    73.65%   72.82%   -0.84%     
===========================================
  Files           46       47       +1     
  Lines         6070     6207     +137     
  Branches       953      979      +26     
===========================================
+ Hits          4471     4520      +49     
- Misses        1252     1337      +85     
- Partials       347      350       +3

Files with missing lines	Coverage Δ
src/jobflow_remote/cli/__init__.py	`100.00% <100.00%> (ø)`
src/jobflow_remote/cli/execution.py	`92.30% <ø> (ø)`
src/jobflow_remote/config/base.py	`83.71% <100.00%> (+0.07%)`	⬆️
src/jobflow_remote/jobs/jobcontroller.py	`78.31% <66.66%> (-0.04%)`	⬇️
src/jobflow_remote/jobs/batch.py	`63.73% <42.10%> (-5.60%)`	⬇️
src/jobflow_remote/cli/batch.py	`41.37% <41.37%> (ø)`
src/jobflow_remote/cli/formatting.py	`80.83% <10.52%> (-4.99%)`	⬇️
src/jobflow_remote/jobs/runner.py	`77.79% <64.15%> (-1.06%)`	⬇️
src/jobflow_remote/jobs/run.py	`42.30% <5.40%> (-11.78%)`	⬇️

Andrew-S-Rosen · 2024-09-05T13:26:32Z

Thank you, @gpetretto, for this very exciting contribution!! Unfortunately, I won't be able to provide much input at this time, but I will be monitoring the progress here.

JaGeo · 2024-09-06T04:36:02Z

I am surely very interested. I have to see when I/we can test it.

I have no real experience with 1.
I think 2 is likely only important for debugging. I don't think we would need a full history of the slurm jobs that ran.

Andrew-S-Rosen · 2024-09-19T22:19:42Z

With this implementation I managed to run multiple atomate2 vasp Jobs in parallel in a single job submitted to SLURM.

One question I have regarding implementation: a common scenario on MP's side is to request a huge allocation (e.g. 1024 nodes) and run concurrent VASP jobs on that allocation (e.g. 4 nodes per VASP job, so 256 max concurrent jobs in one Slurm allocation). When one of those jobs finishes, we don't necessarily (in this use case) want to leave the nodes empty. We want to pull in a new job so all resources are fully utilized until the walltime is hit. Based on the description, I wasn't sure if the "continue pulling in new jobs" aspect was being covered or not.

Is this something within scope for this PR?

gpetretto · 2024-09-20T08:57:54Z

In principle this should allow more or less what is available in fireworks. After one Job finishes another can be started. You can specify a maximum number of jobs to be executed or a timeout after which no more jobs will be started, to avoid hitting the walltime.

Having collected a few more comments, I think that at this point the best strategy would be to proceed with the review and merge of the PR. It would be easier for people to try it and give feedback once it is in an official release. In the documentation it is still specified that this shoule be considered an experimental option, as it has not been extensively tested.

Andrew-S-Rosen · 2024-09-20T11:41:12Z

That sounds wonderful! I think releasing as experimental is probably the best way to get input.

We will definitely try it out here but sadly not for a few months. Waiting on graduate students to join the group and our dedicated MongoDB server to be installed this month. But we will for sure be stress testing things once that's taken care of.

JaGeo · 2024-09-20T11:45:45Z

@gpetretto I agree! We will test this as soon as it is in the main or in am official version

davidwaroquiers

As discussed, I think we can go on with this mentioning it is currently experimental as we need feedback from users. Only thing I would still change is the name of max_jobs of a given batch submission to max_jobs_per_batch as we discussed also.

gpetretto force-pushed the gp/multi branch from a31f351 to 49a4adf Compare September 5, 2024 11:24

gpetretto force-pushed the gp/multi branch from 49a4adf to b0e45ae Compare September 17, 2024 15:53

gpetretto changed the title ~~[WIP] Parallel batch submission~~ Parallel batch submission Sep 20, 2024

gpetretto force-pushed the gp/multi branch from d2dc682 to a06519f Compare October 1, 2024 14:00

davidwaroquiers approved these changes Oct 24, 2024

View reviewed changes

gpetretto added 5 commits October 24, 2024 15:24

Initial implementation of parallel batch submission

24df725

test and docs

bcd712a

improve runner initialization

b443c49

fix test

006e13a

rename max_jobs to max_jobs_per_batch

5e88519

gpetretto force-pushed the gp/multi branch from a06519f to 5e88519 Compare October 24, 2024 14:27

gpetretto merged commit 294e8f4 into develop Oct 24, 2024
5 checks passed

chiang-yuan mentioned this pull request Nov 19, 2024

An Error occurred during the command execution: ValueError No daemon runner document. #211

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel batch submission #172

Parallel batch submission #172

gpetretto commented Sep 5, 2024 •

edited by davidwaroquiers

Loading

codecov-commenter commented Sep 5, 2024 •

edited

Loading

Andrew-S-Rosen commented Sep 5, 2024

JaGeo commented Sep 6, 2024

Andrew-S-Rosen commented Sep 19, 2024 •

edited

Loading

gpetretto commented Sep 20, 2024

Andrew-S-Rosen commented Sep 20, 2024

JaGeo commented Sep 20, 2024

davidwaroquiers left a comment

Parallel batch submission #172

Parallel batch submission #172

Conversation

gpetretto commented Sep 5, 2024 • edited by davidwaroquiers Loading

TODO

codecov-commenter commented Sep 5, 2024 • edited Loading

Codecov Report

Andrew-S-Rosen commented Sep 5, 2024

JaGeo commented Sep 6, 2024

Andrew-S-Rosen commented Sep 19, 2024 • edited Loading

gpetretto commented Sep 20, 2024

Andrew-S-Rosen commented Sep 20, 2024

JaGeo commented Sep 20, 2024

davidwaroquiers left a comment

Choose a reason for hiding this comment

gpetretto commented Sep 5, 2024 •

edited by davidwaroquiers

Loading

codecov-commenter commented Sep 5, 2024 •

edited

Loading

Andrew-S-Rosen commented Sep 19, 2024 •

edited

Loading