-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rv1exec reader update to facilitate Fluxion reload #1176
Conversation
Fantastic! I'm going to do a little manual testing and see if I can come up with some ideas for automated testing. So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload? Does that matter? My understand is if you ask for say one core of a node with 32 cores but set the exclusive flag (or if the scheduler is configured for node exclusivity), you get all 32 cores. So on reload, the scheduler would just see R with the whole node allocated and it shouldn't matter if the node was marked exclusive or not because the "unused" cores are not really unused as far as the scheduler is concerned. |
@grondo and I both gave this a quick sanity check where we submitted a
but not when queues are configured:
|
This same error was noted with |
Hmm, my job that failed with queues configured did not set |
See also: I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue? |
Actually, this PR should handle that case. I am concerned that there are some edge cases where the exclusivity flags per edge aren't set correctly, and then the traversal here:
doesn't set the filters correctly. The implication is probably just loss of traversal optimization rather than a problem with the allocation. |
That makes sense to me. Should I add logic to this PR to insert the job in the default queue regardless of whatever |
I guess if it makes sense to you! Since I guess the default queue name isn't defined when named queues are being used I just did this and it seems to work. diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index ae8b68f7..205e1101 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -125,10 +125,7 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
{
int rc = 0;
- json_t *o = NULL;
- json_error_t err;
std::string R_out;
- char *qn_attr = NULL;
std::string queue_name;
std::shared_ptr<queue_policy_base_t> queue;
std::shared_ptr<job_t> running_job = nullptr;
@@ -148,28 +145,8 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
goto out;
}
- if ( (o = json_loads (R, 0, &err)) == NULL) {
- rc = -1;
- errno = EPROTO;
- flux_log (h, LOG_ERR, "%s: parsing R for job (id=%jd): %s %s@%d:%d",
- __FUNCTION__, static_cast<intmax_t> (id),
- err.text, err.source, err.line, err.column);
- goto out;
- }
- if ( (rc = json_unpack (o, "{ s?:{s?:{s?:{s?:s}}} }",
- "attributes",
- "system",
- "scheduler",
- "queue", &qn_attr)) < 0) {
- json_decref (o);
- errno = EPROTO;
- flux_log (h, LOG_ERR, "%s: json_unpack for attributes", __FUNCTION__);
- goto out;
- }
-
- queue_name = qn_attr? qn_attr : ctx->opts.get_opt ().get_default_queue_name ();
- json_decref (o);
- queue = ctx->queues.at (queue_name);
+ queue_name = ctx->queues.begin()->first;
+ queue = ctx->queues.begin()->second;
running_job = std::make_shared<job_t> (job_state_kind_t::RUNNING,
id, uid, calc_priority (prio),
ts, R); |
I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are
|
I'm a bit confused. Those tests pass for me before applying the patch above and fail after applying the patch. |
Oh, hmm, maybe I wasn't testing the right thing. Will double check. |
I reteseted with your branch and that test is not failing. Sorry about that! |
@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.
I guess that test breakage is mine :-( |
Thanks for the help @garlick! |
This problem appears to be due to the default queue name ( |
It looks like the default name getter is returning garbage: queue_it = ctx->queues.find (ctx->opts.get_opt ().get_default_queue_name ());
if (queue_it != ctx->queues.end ()) {
queue = queue_it->second;
queue_name = queue_it->first;
} else {
flux_log (h, LOG_ERR, "%s: queue %s not recognized; default is %s ",
__FUNCTION__, queue_name,
ctx->opts.get_opt ().get_default_queue_name ());
rc = -1;
goto out;
} And the log output: Apr 16 01:17:52.765851 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: queue ???L?? not recognized; default is ??L?? The default is set here: flux-sched/qmanager/modules/qmanager_opts.hpp Line 137 in cb95173
and there doesn't appear to be a way to update the value. The opt getter appears to be corrupting the value somehow:
|
In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory) In flux-core, when no named queues are configured, there is a single anonymous queue. The default queue is instantiated in fluxion when no named queues are configured in flux core. When named queues are configured in flux-core, there is no anonymous queue, and only the named queues are instantiated in fluxion. The default queue is undefined. To make matters more confusing, flux-core has a concept of a default queue. However in this context, the default queue is a jobspec modification at job ingest time. If a default queue is not configured then jobs that don't specify a queue when named queues are configured are rejected at ingest. |
The tests that are failing after my patch was added may indicate a problem with the approach of placing allocated jobs in a random (first) queue, particularly this test, where a job in the same queue as a job that was "recovered" is supposed to be scheduled when the recovered job is canceled:
Possibly the problem is that a queue that has reached a point where nothing in it can be scheduled is not revisited until the allocated jobs are removed from it. Thus placing a job in a random queue on restart may prevent that queue from making progress when the job is freed. This comment in /*! Return true if this queue has become schedulable since
* its state had been reset with set_schedulability (false).
* "Being schedulable" means one or more job or resource events
* have occurred such a way that the scheduler should run the
* scheduling loop for the pending jobs: e.g., a new job was
* inserted into the pending job queue or a job was removed from
* the running job queue so that its resource was released.
*/
bool is_schedulable (); |
Since "hello" is not on a critical performance path, maybe the simplest thing would be to look up the jobspec to obtain the queue name. I'll experiment with that. |
BTW I think the garbage queue name from |
I think I've got a more workable solution where the queue name is picked up from the jobspec, however, my new test is failing in interesting ways that I need to understand. I'll post an update soon once I've figured that out. |
I just realized several error paths lead to success in that hello callback, which could cause double booking of resources. So I've got a couple more commits: bfc1617 qmanager: fix error handling during hello 7f11e19 manager: improve unknown queue hello error The first one at least ought to be included in this PR since it's somewhat serious. |
You're right. I should have seen that. |
The background does help- thanks for the additional details. Now I understand what was going on with the missing |
NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently? |
Yikes, thanks for catching this! For the sake of clarity, prior to this PR there's one error path that can lead to a spurious success: flux-sched/qmanager/modules/qmanager_callbacks.cpp Lines 141 to 148 in 8895bea
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part of this in qmanager that I understand looks good. Could we have another sched developer review the resource part?
I did install this on my test cluster and verify that I can now reload the scheduler with running jobs and the resource counts look good (both in core and in sched by setting FLUX_RESOURCE_LIST_RPC=sched-fluxion-resource
.
Great work @milroy !
Reviewing now, hopefully I can get through a pass before I get to the airport. |
goto out; | ||
} | ||
} | ||
if (json_unpack (jobspec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need a json path wrapper for things like this. Definitely not in this PR but mentioning it to remind myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. There are a couple of style or performance issues I noted, but as soon as those get tweaked I'm all good with this. Awesome work tracking all this down @milroy.
@garlick, @grondo I'm almost to the airport to hop a plane to Oregon, in case I don't have internet, feel free to dismiss this and approve without waiting for me to take another pass.
|
||
// Search graph metadata for vertex | ||
auto vtx_iter = m.by_path.find (path); | ||
if (vtx_iter != m.by_path.end ()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nitpicky, but as a style note, try to have checks like this with a long span use early exits rather than wrap the whole span. For example, if (vtx_iter == m.by_path.end()) return vtx;
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I changes a couple other places where this comment applied as well.
Thanks for the review and feedback @trws! It may be worth taking a look at the changes I made to address your comment on early exits. |
I updated the early exit code sections to be more aligned with how Fluxion currently handles conditional exits. For example: const auto &vtx_iter = m.by_path.find (path);
// Not found; return null_vertex
if (vtx_iter == m.by_path.end ())
return null_vtx;
// Found in by_path
for (vtx_t v : vtx_iter->second) {
if (g[v].rank == rank
&& g[v].id == id
&& g[v].size == size
&& g[v].type == type) {
return v; Versus: const auto &vtx_iter = m.by_path.find (path);
// Not found; return null_vertex
if (vtx_iter == m.by_path.end ()) {
return null_vtx;
} else {
// Found in by_path
for (vtx_t v : vtx_iter->second) {
if (g[v].rank == rank
[. . .] Do you have a preference between these? |
The former. It's a style I'm pretty sure I learned from @garlick and @grondo in flux-core. We have a lot of code that has a lot of potential error propagation conditions, if we do the second style then every time we check one we have to increase a nesting level, where the former leaves everything flat. To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution. Avoiding nesting isn't a hard and fast rule, but this case is about as close as it gets for me. If the then branch of the condition ends in either a return or a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks really good, thanks @milroy. Added one comment just on factoring, but it's entirely optional.
That makes sense and I agree, but I thought I'd check. |
72ef733
to
1cae444
Compare
Problem: issue flux-framework#991 identified the need for an `rv1exec` implementation of update (). The need for the implementation is described in detail in the issue, but the primary motivation is to enable reloading Fluxion when using RV1 without the scheduling key and payload. The reader was not originally implemented due to the lack of information in the format. Examples include edges, exclusivity, paths with subsystems, and vertex sizes. To create a workable implementation, strong assumptions need to be made about resource exclusivity and edges. Add support for update () through helper functions that update vertex and edge metadata and infer exclusivity of node-level resources.
Problem: the sched-fluxion-resource.update RPC only supports JGF. Add a check for an empty scheduling key and fall back to rv1exec.
Problem: the `resource-query` tool currently only supports update via JGF. Add support for rv1exec, including additional options handling. This enables sharness tests for reload and updates under rv1exec.
Problem: the current tests for reloading Fluxion with rv1exec fail due to lack of update support. Now that the support is added, we need tests for correct behavior. Add new tests to check for ability to reload Fluxion and update a resource graph with rv1exec. Update existing tests to use the additional option required by resource-query for update.
Problem: when the fluxion modules are reloaded with running jobs, and the match format is "rv1_nosched", and queues are enabled, running jobs are killed with a fatal scheduler-restart exception. During the hello handshake defined by RFC 27, the scheduler is informed during its initialization of jobs that are holding resources. The hello callback in qmanager retrieves the job's queue name from the R key "attributes.system.scheduler.queue". The queue name is used to locate the proper queue for the job to be inserted into. This attribute is not being set in R when the match format is "rv1_nosched" so the default queue is assumed. Since the default queue is not instantiated when named queues are defined, a fatal job exception is raised when the queue lookup fails. There has been a proposal to deprecate the R attribute (flux-framework#1108), so rather than ensure that it is set in this case, determine the queue instead by fetching the jobspec from the KVS. Fixes flux-framework#1108
Problem: a few specific use cases related to reloading fluxion when rv1_nosched is set are not fully covered by existing tests. Check the following - jobs are not killed when modules are reloaded (no queues configured) - jobs are not killed when modules are reloaded (queues configured) - no resources are double booked in those cases - jobs submitted to the anon queue are killed when the scheduler is reloaded with queues configured - jobs submitted to a named queue are killed when the scheduler is reloaded without queues configured
Problem: several error paths in jobmanager_hello_cb() do not result in the function returning a failure. I think the only ones that do trigger an error are a C++ exception (due to the wrapper) and if (queue->reconstruct() returns -1. The callback returns 0 (success) in other cases which could result in double booked resources.
Problem: when a hello response is received for a job with an unkown queue, the error is not super helpful. Instead of: jobmanager_hello_cb: ENOENT: map::at: No such file or directory make it: jobmanager_hello_cb: unknown queue name (id=41137733632 queue=default) This can occur when queues are reconfigured and the scheduler is reloaded, so it's nice to have the job id and the offending queue name.
Bounced the go test runner, looks like it failed to fetch the image for some reason. Feel free to set MWP @milroy, great stuff! |
This PR adds support for
update ()
withrv1exec
/rv1_nosched
format. Previously, only JGF was implement and supported for Fluxion reload. SinceRV1
without thescheduling
key doesn't contain information about edges, exclusivity, subsystems, sizes, etc., assumptions need to be made about the graph structure. The only supported subsystem is currentlycontainment
. Ranks are assumed to exist at the node level, and node-to-child edges are assumed to be marked exclusive. Applying the reader to a resource graph created from another format (e.g., JGF) may result in undesirable behavior due to the mismatch between graph granularities.Handling ranks in the unit tests is different than other
resource-query
-based tests. That is due to the fact thatresource-query
simulates the behavior of the scheduler independent of the resource manager and execution system. To enablesharness
tests withresource-query
, the files need to explicity map ranks>=0
to resources.Note that the implementation in this PR does not update the aggregate or exclusivity filters. That functionality is performed by the traverser here:
flux-sched/resource/traversers/dfu_impl_update.cpp
Lines 613 to 644 in cb95173
Note that there is one TODOs:find_vertex ()
function when creating vertices results in errors in existing unit tests. I'm reasonably sure this is due to the fact that only one rank exists (-1
) for the unit tests, but I need to double-check.Resolves issue #991.