Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: enforce max resource limits across an association's running jobs #559

Open
3 of 7 tasks
cmoussa1 opened this issue Jan 7, 2025 · 1 comment
Open
3 of 7 tasks
Labels
feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin

Comments

@cmoussa1
Copy link
Member

cmoussa1 commented Jan 7, 2025

Creating a tracking issue here to outline the idea for enforcing a max number of resources used across an association's set of running jobs. I already have a couple of open issues similar to this but it would probably be useful to re-organize some thoughts after some helpful offline discussion.

The need here is to be able to limit how many resources (e.g nodes, cores) an association can have at any given time across all of their running jobs. As noted in flux-config-policy(5), the limit checks take place before the scheduler sees the request because [the plugin] does not have detailed resource information.

So, it seems a realistic solution here would be to configure a max resources limit that is both a max nodes and a max cores limit. The priority plugin should be able to keep track of both when a job enters RUN state by looking at the jobspec. It can increment/decrement current node and core counts per-association across all of their running jobs. Then, when a submitted job enters DEPEND state, the job's size can be checked to see if adding its resources to the association's currently allocated resources would put them over the max (i.e either over the nodes or cores limits). If so, the job can be held until a currently running job exits.

There are a couple of prerequisites to get this kind of support into flux-accounting:

Tasks

Preview Give feedback
  1. database merge-when-passing new feature plugin

I've done some playing around today with a rough sketch and it looks like the first four tasks listed are pretty straightforward; copying over the jj code from flux-core, I'm able to extract job size counts and add/subtract them from an association's cur_nodes and cur_cores attributes as jobs enter RUN and INACTIVE states.

I'll plan to start opening incremental PRs to add this kind of support into flux-accounting.

@cmoussa1 cmoussa1 added feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin labels Jan 7, 2025
@cmoussa1
Copy link
Member Author

cmoussa1 commented Jan 8, 2025

Had a helpful offline discussion with @ryanday36 about a possible implementation plan for how this might work in the priority plugin:

The priority plugin will have max_nodes, max_cores, cur_nodes, and cur_cores information stored per-association in its internal map. This information will be able to be queried with flux jobtap query to see where an association is at at any given time.

When a job proceeds to job.state.run, its resource information will be extracted from jobspec. It will use the jj code to count both nnodes and ncores and increment the association's cur_nodes and cur_cores count accordingly.

As jobs get submitted and are running, subsequently submitted jobs will have their resource counts checked in job.state.depend. If the resource counts (nnodes or ncores) would put the association over either their max_nodes or max_cores limit, the job will have an accounting-specific dependency added to it describing that the association has hit their max resources limit, and the job will be held.

Jobs will be held until a currently running job transitions to INACTIVE. When the running job transitions to INACTIVE, its resources will again be extracted from jobspec and decremented from the association's cur_nodes and cur_cores count. Then, when the association's cur_running_jobs count is checked to ensure that they are allowed to have a running job at this moment, the held job's resource count (I need to see if I can retrieve a jobspec in a jobtap plugin with just the jobid??) will be checked to ensure that the association would not be over their max. If not, the job can be released and proceed to RUN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature tracking Tracking issue for larger feature made up of smaller issues plugin related to the multi-factor priority plugin
Projects
None yet
Development

No branches or pull requests

1 participant