Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel batch submission #172

Merged
merged 5 commits into from
Oct 24, 2024
Merged

Parallel batch submission #172

merged 5 commits into from
Oct 24, 2024

Conversation

gpetretto
Copy link
Contributor

@gpetretto gpetretto commented Sep 5, 2024

As a long standing request (see #86), I have started implementing the option to allow the execution of parallel jobs in the batch mode. Up to now only one job at the time could run in a batch job (see docs).
With this implementation I managed to run multiple atomate2 vasp Jobs in parallel in a single job submitted to SLURM.

I have two main doubts about the implementation and I would be interested to have some comments from people that have used this functionality with fireworks.

  1. In fireworks it is possible to specify the name of the nodefile, so that a list of nodes/cores can be generated
    https://github.com/materialsproject/fireworks/blob/96300617c8a3471f83e1b8aed978552e6ac6a15d/fireworks/scripts/mlaunch_run.py#L57
    https://github.com/materialsproject/fireworks/blob/e9f2004d3d861e2de288f0023ecdb5ccd9114939/fireworks/features/multi_launcher.py#L235
    However, it seems that this information is never used in atomate/atomate2. So, if this is not really useful, I would avoid adding this and eventually save this in case an explicit request comes in in the future. Can anybody confirm that this is not used for typical fireworks multi launch executions?
  2. At the moment there is no way to monitor the SLURM/PBS/... jobs (i.e., the jobs that will run multiple jobflow Jobs) that have been submitted and have finished running. These do not fit in the regular jobs documents, since there is not a 1->1 correspondence between the SLURM jobs and the jobflow Job. It is instead possible to get a list of those that are currently running. This could be mainly a problem if some issues happen inside the batch execution and one has to track down its logs (the logs of the SLURM job, not the jobflow one. The latter will still be easily accessible) or check that the batch submission is actually working fine. The questions here are: how important would be to have the full history of all the batch jobs submitted to SLURM/PBS? The most straightforward way would be to add a new collection to contain these data, is it worth adding this? Any alternative suggestion?

I think @Andrew-S-Rosen, @JaGeo, @utf may be interested in the feature.

Fixes #86
Fixes #96

TODO

  • update the documentation.
  • Tests? I still have to figure out if it would be possible to effectively test that the jobs really run in parallel. Maybe just adding test so that the code is tested and avoid trivial regressions.

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2024

Codecov Report

Attention: Patch coverage is 39.15663% with 101 lines in your changes missing coverage. Please review.

Project coverage is 72.82%. Comparing base (6d6ed9d) to head (5e88519).

Files with missing lines Patch % Lines
src/jobflow_remote/jobs/run.py 5.40% 34 Missing and 1 partial ⚠️
src/jobflow_remote/jobs/runner.py 64.15% 18 Missing and 1 partial ⚠️
src/jobflow_remote/cli/batch.py 41.37% 17 Missing ⚠️
src/jobflow_remote/cli/formatting.py 10.52% 17 Missing ⚠️
src/jobflow_remote/jobs/batch.py 42.10% 9 Missing and 2 partials ⚠️
src/jobflow_remote/jobs/jobcontroller.py 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #172      +/-   ##
===========================================
- Coverage    73.65%   72.82%   -0.84%     
===========================================
  Files           46       47       +1     
  Lines         6070     6207     +137     
  Branches       953      979      +26     
===========================================
+ Hits          4471     4520      +49     
- Misses        1252     1337      +85     
- Partials       347      350       +3     
Files with missing lines Coverage Δ
src/jobflow_remote/cli/__init__.py 100.00% <100.00%> (ø)
src/jobflow_remote/cli/execution.py 92.30% <ø> (ø)
src/jobflow_remote/config/base.py 83.71% <100.00%> (+0.07%) ⬆️
src/jobflow_remote/jobs/jobcontroller.py 78.31% <66.66%> (-0.04%) ⬇️
src/jobflow_remote/jobs/batch.py 63.73% <42.10%> (-5.60%) ⬇️
src/jobflow_remote/cli/batch.py 41.37% <41.37%> (ø)
src/jobflow_remote/cli/formatting.py 80.83% <10.52%> (-4.99%) ⬇️
src/jobflow_remote/jobs/runner.py 77.79% <64.15%> (-1.06%) ⬇️
src/jobflow_remote/jobs/run.py 42.30% <5.40%> (-11.78%) ⬇️

@Andrew-S-Rosen
Copy link
Collaborator

Thank you, @gpetretto, for this very exciting contribution!! Unfortunately, I won't be able to provide much input at this time, but I will be monitoring the progress here.

@JaGeo
Copy link
Collaborator

JaGeo commented Sep 6, 2024

I am surely very interested. I have to see when I/we can test it.

I have no real experience with 1.
I think 2 is likely only important for debugging. I don't think we would need a full history of the slurm jobs that ran.

@Andrew-S-Rosen
Copy link
Collaborator

Andrew-S-Rosen commented Sep 19, 2024

With this implementation I managed to run multiple atomate2 vasp Jobs in parallel in a single job submitted to SLURM.

One question I have regarding implementation: a common scenario on MP's side is to request a huge allocation (e.g. 1024 nodes) and run concurrent VASP jobs on that allocation (e.g. 4 nodes per VASP job, so 256 max concurrent jobs in one Slurm allocation). When one of those jobs finishes, we don't necessarily (in this use case) want to leave the nodes empty. We want to pull in a new job so all resources are fully utilized until the walltime is hit. Based on the description, I wasn't sure if the "continue pulling in new jobs" aspect was being covered or not.

Is this something within scope for this PR?

@gpetretto
Copy link
Contributor Author

In principle this should allow more or less what is available in fireworks. After one Job finishes another can be started. You can specify a maximum number of jobs to be executed or a timeout after which no more jobs will be started, to avoid hitting the walltime.

Having collected a few more comments, I think that at this point the best strategy would be to proceed with the review and merge of the PR. It would be easier for people to try it and give feedback once it is in an official release. In the documentation it is still specified that this shoule be considered an experimental option, as it has not been extensively tested.

@Andrew-S-Rosen
Copy link
Collaborator

That sounds wonderful! I think releasing as experimental is probably the best way to get input.

We will definitely try it out here but sadly not for a few months. Waiting on graduate students to join the group and our dedicated MongoDB server to be installed this month. But we will for sure be stress testing things once that's taken care of.

@JaGeo
Copy link
Collaborator

JaGeo commented Sep 20, 2024

@gpetretto I agree! We will test this as soon as it is in the main or in am official version

@gpetretto gpetretto changed the title [WIP] Parallel batch submission Parallel batch submission Sep 20, 2024
Copy link
Member

@davidwaroquiers davidwaroquiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I think we can go on with this mentioning it is currently experimental as we need feedback from users. Only thing I would still change is the name of max_jobs of a given batch submission to max_jobs_per_batch as we discussed also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing docs: batch mode Does jobflow-remote support the pilot job model?
5 participants