plugin: enforce max resource limits across an association's running jobs #559
Labels
feature tracking
Tracking issue for larger feature made up of smaller issues
plugin
related to the multi-factor priority plugin
Creating a tracking issue here to outline the idea for enforcing a max number of resources used across an association's set of running jobs. I already have a couple of open issues similar to this but it would probably be useful to re-organize some thoughts after some helpful offline discussion.
The need here is to be able to limit how many resources (e.g nodes, cores) an association can have at any given time across all of their running jobs. As noted in flux-config-policy(5), the limit checks take place before the scheduler sees the request because [the plugin] does not have detailed resource information.
So, it seems a realistic solution here would be to configure a max resources limit that is both a max nodes and a max cores limit. The priority plugin should be able to keep track of both when a job enters RUN state by looking at the jobspec. It can increment/decrement current node and core counts per-association across all of their running jobs. Then, when a submitted job enters DEPEND state, the job's size can be checked to see if adding its resources to the association's currently allocated resources would put them over the max (i.e either over the nodes or cores limits). If so, the job can be held until a currently running job exits.
There are a couple of prerequisites to get this kind of support into flux-accounting:
Tasks
association_table
: addmax_cores
attribute, send information to plugin #560I've done some playing around today with a rough sketch and it looks like the first four tasks listed are pretty straightforward; copying over the jj code from flux-core, I'm able to extract job size counts and add/subtract them from an association's
cur_nodes
andcur_cores
attributes as jobs enter RUN and INACTIVE states.I'll plan to start opening incremental PRs to add this kind of support into flux-accounting.
The text was updated successfully, but these errors were encountered: