Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support partially allocated jobs across scheduler reload #6445

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Nov 16, 2024

This is a proof of concept implementation of the changes to the scheduler hello protocol proposed in flux-framework/rfc#433, in which support is added for reloading the scheduler with housekeeping running and some nodes of job(s) already released.

The scheduler indicates it supports this by calling schedutil_create() with the SCHEDUTIL_HELLO_PARTIAL_OK flag. When it sets that, it agrees to parse an optional allocated key in each hello response. The allocated key is set to an idset representing the subset of ranks of R that are actually allocated. If missing, all ranks are assumed to be allocated.

We need to get some feedback from @milroy, @trws, et al to make sure this approach works for fluxion. I thought working through this with sched-simple would be helpful to illustrate the idea.

@trws
Copy link
Member

trws commented Nov 18, 2024

I think this would work, but as I look over it I find myself wondering if we might be better off either with a list of things that have been released, or actually leaving this out of hello entirely and using a partial release after the hello response from the scheduler. The main reason is that this approach sends what is allocated as R, then we have to work out what to release by taking the difference between what is allocated and what needs to be removed before we can act on it. If instead the released portion is sent, either in hello or as a partial release, we can reuse the same code we already have for partial release without introducing a new code path into what's already proven somewhat complex or fragile.

That said, working through this as I write, it looks like the job-manager has the allocated resources in this format as pending. That makes me wonder if we should send that pending R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.

@garlick
Copy link
Member Author

garlick commented Nov 18, 2024

leaving this out of hello entirely and using a partial release after the hello response from the scheduler

Duh! Why didn't I think of that? Let me explore that one and see how it goes.

In general, I'm in favor of making the job manager offload work from scheduler(s). It seems like your idea would make it "just work" wtihout any changes.

@garlick
Copy link
Member Author

garlick commented Nov 18, 2024

That makes me wonder if we should send that pending R as the allocation R as part of the base protocol rather than sending the initial R. As it is in this PR we're sending both right? We certainly need the whole R in job tracking, but I'm not sure the scheduler actually needs it, and it might make the whole thing a bit easier on both ends.

No, in the current proposal we are just sending the allocated idset (ranks) and some basic job metadata like the id in the RPC responses, then libschedutil looks up the job's original R (including JGF) from the KVS and passes it to the scheduler in its callback.

@garlick
Copy link
Member Author

garlick commented Nov 18, 2024

OK, actually, it's quite the pain to send a free after the hello is finished, because the scheduler has to send the ready request before we can do that, which is handled in a different part of the job manager and pretty inconvenient to synchronize with.

Would sending a free idset make things a little easier than the current proposed allocated set? I'd be fine with that!

@grondo
Copy link
Contributor

grondo commented Nov 18, 2024

Could libschedutil do the work of creating the partial R from the original R and free idset, then directly call the scheduler's free callback if necessary after each hello callback?

Edit: (sorry if this is a naive suggestion, I haven't gone back to look at the actual implementation before suggesting it)

@garlick
Copy link
Member Author

garlick commented Nov 19, 2024

Could libschedutil do the work of creating the partial R from the original R and free idset, then directly call the scheduler's free callback if necessary after each hello callback?

Oh good idea, that makes sense to me. That gets around the fact that the R fragment in the job manager is missing the JGF.

@trws
Copy link
Member

trws commented Nov 19, 2024

Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.

@garlick
Copy link
Member Author

garlick commented Nov 19, 2024

Yeah I think having the "free" idset and possibly also doing as @grondo suggested and having schedutil do the translation into a free call would work really well if it's not too difficult to factor that way.

Great! I'll push a rework of this PR with those changes shortly.

@garlick
Copy link
Member Author

garlick commented Nov 19, 2024

ok, pushed those changes. This is built on top of #6450 currently - will rebase when that gets merged.

The RFC PR will need a small rework for free instead of allocated.

@garlick garlick force-pushed the issue#6089 branch 2 times, most recently from 5566515 to 24436e3 Compare November 20, 2024 15:41
@garlick garlick changed the title WIP: support partially allocated jobs across scheduler reload support partially allocated jobs across scheduler reload Nov 20, 2024
@garlick
Copy link
Member Author

garlick commented Nov 20, 2024

flux-framework/rfc#433 was updated to specify the (optional) free key instead of allocated as originally proposed.

Then schedutil was modified so that the hello callback subtracts the ranks in the free key (if present) from the R that it looks up in the KVS. R should still include the full original scheduling object.

The current behavior is preserved unless the scheduler sets the SCHEDUTIL_HELLO_PARTIAL_OK flag.

@garlick
Copy link
Member Author

garlick commented Dec 17, 2024

Was just prepping a PR for the scheduler side, and it does look like there might be a bit of work to do over there. Could we perhaps get this merged on the core side? It should behave as before until the scheduler sets the new flag.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, seems fine to merge now as long as no issues with the implementation were discovered on the sched side while working support over there!

@garlick
Copy link
Member Author

garlick commented Dec 17, 2024

I don't think we have any design issues over there. It didn't work in my testing but I think it's likely something that needs fixing on the fluxion side. I'll set MWP, thanks.

Problem: RFC 27 allows the scheduler to send a partial-ok flag
in the hello request, and then receive partially allocated jobs
in hello responses.

If the hello request includes this flag, pass it on to housekeeping.
For each partially released housekeeping job, include the 'free'
idset in the response per RFC 27.
Problem: libschedutil provides no way for the scheduler to
indicate that the partial-ok flag should be set in the hello
request.

Add the SCHEDUTIL_HELLO_PARTIAL_OK flag which is passed to
schedutil_create().
Problem: when processing hello responses, all schedulers now need
to process R - free for partial releases.

As a convenience, change the libschedutil hello callback to subtract
the free idset from the R it fetched from the KVS.

Note that the scheduling key, if present, remains the full object
which is opaque to flux-core.
Problem: sched-simple does not support partial hello responses.

Set the SCHEDUTIL_HELLO_PARTIAL_OK flag.
Add a 'test-hello-nopartial' module option to get the old behavior.

Set test-hello-nopartial in the current test of partial housekeeping
release.
Problem: there is no coverage of reloading the scheduler with
partially released jobs in housekeeping.

Add a test.
Problem: when the hello protocol cannot process a job, it logs
the name of the wrong rlist function.

Make the log message a little more high level.
@mergify mergify bot merged commit 09d70fa into flux-framework:master Dec 17, 2024
34 of 35 checks passed
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 80.88235% with 13 lines in your changes missing coverage. Please review.

Project coverage is 83.63%. Comparing base (280e509) to head (f580637).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
src/modules/job-manager/housekeeping.c 72.72% 9 Missing ⚠️
src/common/libschedutil/hello.c 86.95% 3 Missing ⚠️
src/modules/job-manager/alloc.c 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6445      +/-   ##
==========================================
+ Coverage   83.61%   83.63%   +0.02%     
==========================================
  Files         522      522              
  Lines       87680    87734      +54     
==========================================
+ Hits        73310    73380      +70     
+ Misses      14370    14354      -16     
Files with missing lines Coverage Δ
src/modules/sched-simple/sched.c 77.34% <100.00%> (+0.21%) ⬆️
src/modules/job-manager/alloc.c 75.26% <83.33%> (-0.01%) ⬇️
src/common/libschedutil/hello.c 74.64% <86.95%> (+4.64%) ⬆️
src/modules/job-manager/housekeeping.c 82.46% <72.72%> (-0.21%) ⬇️

... and 11 files with indirect coverage changes

@garlick garlick deleted the issue#6089 branch December 17, 2024 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants