Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic partitions #1855

Open
matt-chan opened this issue Feb 22, 2024 · 4 comments
Open

Dynamic partitions #1855

matt-chan opened this issue Feb 22, 2024 · 4 comments
Labels
kind/feature New feature request

Comments

@matt-chan
Copy link
Contributor

In what area(s)?

/area administration
/area ansible
/area autoscaling
/area configuration
/area cyclecloud
/area documentation
/area image
/area job-scheduling
/area monitoring
/area ood
/area remote-visualization
/area user-management

Describe the feature

Do we expose the dynamic partitions that CC adds in 8.4? I think it would be useful if we could allocate smaller nodes if the job is smaller. E.g. running a 4 cpu job on HB120 vs HB16.

@matt-chan matt-chan added the kind/feature New feature request label Feb 22, 2024
@matt-chan
Copy link
Contributor Author

cc @ltalirz

@xpillons
Copy link
Collaborator

I'm not sure about the exact scenario. It adds lots of complexity, and I'm not sure of the value provided

@ltalirz
Copy link
Contributor

ltalirz commented Feb 22, 2024

I think what Matt is saying here is:

For those VM series where Azure provides breakdowns into different sizes (e.g. NC24ads A100 v4, NC48ads A100 v4, NC96ads A100 v4), bundle those in one partition and then, based on the number of cpus/gpus requested, have slurm request the smallest one that fulfils the requirements of the job.

It does not really apply to the HB series, since the smaller versions here are just restricted CPUs with the same price, but it would e.g. also apply to the F series.

@matt-chan
Copy link
Contributor Author

Ah I forgot about the HB series carrying the same price across all sizes. Yes, for the scenarios where you only want part of the node I think this might be useful. Although under heavy load I think this cost savings effect will disappear/get small. It can still provide better isolation between jobs though (one bad job can't fill up /tmp anymore etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature request
Projects
None yet
Development

No branches or pull requests

3 participants