Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement rv1exec reader update to facilitate Fluxion reload #1176

Merged
merged 8 commits into from
Apr 24, 2024

Conversation

milroy
Copy link
Member

@milroy milroy commented Apr 15, 2024

This PR adds support for update () with rv1exec/rv1_nosched format. Previously, only JGF was implement and supported for Fluxion reload. Since RV1 without the scheduling key doesn't contain information about edges, exclusivity, subsystems, sizes, etc., assumptions need to be made about the graph structure. The only supported subsystem is currently containment. Ranks are assumed to exist at the node level, and node-to-child edges are assumed to be marked exclusive. Applying the reader to a resource graph created from another format (e.g., JGF) may result in undesirable behavior due to the mismatch between graph granularities.

Handling ranks in the unit tests is different than other resource-query-based tests. That is due to the fact that resource-query simulates the behavior of the scheduler independent of the resource manager and execution system. To enable sharness tests with resource-query, the files need to explicity map ranks >=0 to resources.

Note that the implementation in this PR does not update the aggregate or exclusivity filters. That functionality is performed by the traverser here:

if (m_graph_db->metadata.v_rt_edges[dom].get_trav_token ()
!= m_best_k_cnt) {
// This condition occurs when the subgraph came from a
// traverver different from this traverser, for example,
// a traverser whose dominant subsystem is different than this.
return 0;
}
excl = m_graph_db->metadata.v_rt_edges[dom].get_exclusive ();
x = (excl == 0)? false : true;
needs = static_cast<unsigned int>(m_graph_db->metadata
.v_rt_edges[dom].get_needs ());
m_color.reset ();
bool emit_shadow = modify_traversal (root, false);
if ( (rc = upd_dfv (root, writers, needs,
x, jobmeta, false, dfu, emit_shadow)) > 0) {
uint64_t starttime = jobmeta.at;
uint64_t endtime = jobmeta.at + jobmeta.duration;
if (writers->emit_tm (starttime, endtime) == -1) {
m_err_msg += __FUNCTION__;
m_err_msg += ": emit_tm returned -1.\n";
}
if (jobmeta.is_queue_set ()) {
if (writers->emit_attrs ("queue", jobmeta.get_queue ()) == -1) {
m_err_msg += __FUNCTION__;
m_err_msg += ": emit_attrs returned -1.\n";
}
}
}
return (rc > 0)? 0: -1;
}

Note that there is one TODOs:

  • calling the find_vertex () function when creating vertices results in errors in existing unit tests. I'm reasonably sure this is due to the fact that only one rank exists (-1) for the unit tests, but I need to double-check.
  • may need more test cases; reviewer feedback is welcome here

Resolves issue #991.

@milroy milroy linked an issue Apr 15, 2024 that may be closed by this pull request
@milroy
Copy link
Member Author

milroy commented Apr 15, 2024

@grondo or @garlick I don't think the tests I've added are sufficient to test the desired reload behavior. I'm going to devise more, but I'd like your input if you have suggestions.

@garlick
Copy link
Member

garlick commented Apr 15, 2024

Fantastic! I'm going to do a little manual testing and see if I can come up with some ideas for automated testing.

So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload? Does that matter? My understand is if you ask for say one core of a node with 32 cores but set the exclusive flag (or if the scheduler is configured for node exclusivity), you get all 32 cores. So on reload, the scheduler would just see R with the whole node allocated and it shouldn't matter if the node was marked exclusive or not because the "unused" cores are not really unused as far as the scheduler is concerned.

@garlick
Copy link
Member

garlick commented Apr 15, 2024

@grondo and I both gave this a quick sanity check where we submitted a sleep inf job then reloaded the fluxion modules. We found that it works for a simple config (yay!)

sched-fluxion-qmanager.debug[0]: requeue success (queue=default id=378547973585371136)

but not when queues are configured:

jobmanager_hello_cb: ENOENT: map::at: No such file or directory

@grondo
Copy link
Contributor

grondo commented Apr 15, 2024

This same error was noted with rv1 earlier in this comment. Possibly I should have opened a separate issue on it, sorry!

@garlick
Copy link
Member

garlick commented Apr 15, 2024

Hmm, my job that failed with queues configured did not set attributes.system.scheduler.queue in R. That appears to be necessary for fluxion to find the right queue:

https://github.com/flux-framework/flux-sched/blob/master/qmanager/modules/qmanager_callbacks.cpp#L159

@garlick
Copy link
Member

garlick commented Apr 15, 2024

See also:

I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?

@milroy
Copy link
Member Author

milroy commented Apr 15, 2024

So, the expected shortcoming here is an node-exclusive allocation may not remain exclusive after a scheduler reload?

Actually, this PR should handle that case. I am concerned that there are some edge cases where the exclusivity flags per edge aren't set correctly, and then the traversal here:

if ( (rc = upd_dfv (root, writers, needs,

doesn't set the filters correctly. The implication is probably just loss of traversal optimization rather than a problem with the allocation.

@milroy
Copy link
Member Author

milroy commented Apr 15, 2024

I wonder why it is important to put the job into the correct queue after it's already been allocated resources? Assuming that's just the container that is available, what if we just always put these jobs in the default queue?

That makes sense to me. Should I add logic to this PR to insert the job in the default queue regardless of whatever attributes.system.scheduler.queue says since it's being deprecated?

@garlick
Copy link
Member

garlick commented Apr 15, 2024

I guess if it makes sense to you! Since I guess the default queue name isn't defined when named queues are being used I just did this and it seems to work.

diff --git a/qmanager/modules/qmanager_callbacks.cpp b/qmanager/modules/qmanager_callbacks.cpp
index ae8b68f7..205e1101 100644
--- a/qmanager/modules/qmanager_callbacks.cpp
+++ b/qmanager/modules/qmanager_callbacks.cpp
@@ -125,10 +125,7 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
 
 {
     int rc = 0;
-    json_t *o = NULL;
-    json_error_t err;
     std::string R_out;
-    char *qn_attr = NULL;
     std::string queue_name;
     std::shared_ptr<queue_policy_base_t> queue;
     std::shared_ptr<job_t> running_job = nullptr;
@@ -148,28 +145,8 @@ int qmanager_cb_t::jobmanager_hello_cb (flux_t *h, const flux_msg_t *msg,
         goto out;
     }
 
-    if ( (o = json_loads (R, 0, &err)) == NULL) {
-        rc = -1;
-        errno = EPROTO;
-        flux_log (h, LOG_ERR, "%s: parsing R for job (id=%jd): %s %s@%d:%d",
-                  __FUNCTION__, static_cast<intmax_t> (id),
-                  err.text, err.source, err.line, err.column);
-        goto out;
-    }
-    if ( (rc = json_unpack (o, "{ s?:{s?:{s?:{s?:s}}} }",
-                                   "attributes",
-                                       "system",
-                                           "scheduler",
-                                                "queue", &qn_attr)) < 0) {
-        json_decref (o);
-        errno = EPROTO;
-        flux_log (h, LOG_ERR, "%s: json_unpack for attributes", __FUNCTION__);
-        goto out;
-    }
-
-    queue_name = qn_attr? qn_attr : ctx->opts.get_opt ().get_default_queue_name ();
-    json_decref (o);
-    queue = ctx->queues.at (queue_name);
+    queue_name = ctx->queues.begin()->first;
+    queue = ctx->queues.begin()->second;
     running_job = std::make_shared<job_t> (job_state_kind_t::RUNNING,
                                                    id, uid, calc_priority (prio),
                                                    ts, R);

@garlick
Copy link
Member

garlick commented Apr 15, 2024

I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are

17/92 Test #17: t1009-recovery-multiqueue.t ...........***Failed   20.74 sec
[snip]
not ok 6 - recovery: works when both modules restart (rv1)
#	
#		flux dmesg -C &&
#		remove_qmanager &&
#		reload_resource match-format=rv1 policy=high &&
#		load_qmanager_sync &&
#		check_requeue ${jobid1} batch &&
#		check_requeue ${jobid2} debug &&
#		test_must_fail flux job wait-event -t 0.5 ${jobid3} start &&
#		test_expect_code 3 flux ion-resource info ${jobid3}
#	

expecting success: 
	flux cancel ${jobid2} &&
	flux job wait-event -t 10 ${jobid4} start

flux-job: wait-event timeout on event 'start'
not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#	
#		flux cancel ${jobid2} &&
#		flux job wait-event -t 10 ${jobid4} start
#	

expecting success: 
	flux cancel ${jobid1} &&
	flux cancel ${jobid3} &&
	flux cancel ${jobid4} &&
	flux job wait-event -t 10 ${jobid4} release

1713212995.938612 release ranks="all" final=true
[snip]
# failed 2 among 10 test(s)

@milroy
Copy link
Member Author

milroy commented Apr 15, 2024

I'll see if I can work up a test script that covers the basics here. The tests that are failing (before the patch I posted above) are

I'm a bit confused. Those tests pass for me before applying the patch above and fail after applying the patch.

@garlick
Copy link
Member

garlick commented Apr 15, 2024

Oh, hmm, maybe I wasn't testing the right thing. Will double check.

@garlick
Copy link
Member

garlick commented Apr 15, 2024

I reteseted with your branch and that test is not failing. Sorry about that!

@garlick
Copy link
Member

garlick commented Apr 15, 2024

@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.

  • b116c9a testsuite: add test for reloading with rv1_nosched
  • 88e469a qmanager: fix module reload with rv1, queues, jobs

I guess that test breakage is mine :-(

@milroy
Copy link
Member Author

milroy commented Apr 15, 2024

@milroy - I'm out of time for today but I did push a new test script to my rv1-update-jim branch.

Thanks for the help @garlick!

@milroy
Copy link
Member Author

milroy commented Apr 16, 2024

jobmanager_hello_cb: ENOENT: map::at: No such file or directory

This problem appears to be due to the default queue name ("default") key not existing in the ctx->queues map. Apparently it never gets added.

@milroy
Copy link
Member Author

milroy commented Apr 16, 2024

It looks like the default name getter is returning garbage:

    queue_it = ctx->queues.find (ctx->opts.get_opt ().get_default_queue_name ());
    if (queue_it != ctx->queues.end ()) {
        queue = queue_it->second;
        queue_name = queue_it->first;
    } else {
        flux_log (h, LOG_ERR, "%s: queue %s not recognized; default is %s ",
                  __FUNCTION__, queue_name,
                  ctx->opts.get_opt ().get_default_queue_name ());
        rc = -1;
        goto out;
    }

And the log output:

Apr 16 01:17:52.765851 sched-fluxion-qmanager.err[0]: jobmanager_hello_cb: queue ???L?? not recognized; default is ??L?? 

The default is set here:

std::string m_default_queue_name = "default";

and there doesn't appear to be a way to update the value. The opt getter appears to be corrupting the value somehow:

const T &get_opt () const {

@garlick
Copy link
Member

garlick commented Apr 16, 2024

In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)

In flux-core, when no named queues are configured, there is a single anonymous queue. The default queue is instantiated in fluxion when no named queues are configured in flux core.

When named queues are configured in flux-core, there is no anonymous queue, and only the named queues are instantiated in fluxion. The default queue is undefined.

To make matters more confusing, flux-core has a concept of a default queue. However in this context, the default queue is a jobspec modification at job ingest time. If a default queue is not configured then jobs that don't specify a queue when named queues are configured are rejected at ingest.

@garlick
Copy link
Member

garlick commented Apr 16, 2024

The tests that are failing after my patch was added may indicate a problem with the approach of placing allocated jobs in a random (first) queue, particularly this test, where a job in the same queue as a job that was "recovered" is supposed to be scheduled when the recovered job is canceled:

not ok 7 - recovery: a cancel leads to a job schedule (rv1)
#	
#		flux cancel ${jobid2} &&
#		flux job wait-event -t 10 ${jobid4} start
#	

Possibly the problem is that a queue that has reached a point where nothing in it can be scheduled is not revisited until the allocated jobs are removed from it. Thus placing a job in a random queue on restart may prevent that queue from making progress when the job is freed.

This comment in queue_policy_base.hpp bolsters this somewhat

  /*! Return true if this queue has become schedulable since
     *  its state had been reset with set_schedulability (false).
     *  "Being schedulable" means one or more job or resource events
     *  have occurred such a way that the scheduler should run the
     *  scheduling loop for the pending jobs: e.g., a new job was
     *  inserted into the pending job queue or a job was removed from
     *  the running job queue so that its resource was released.
     */
    bool is_schedulable ();

@garlick
Copy link
Member

garlick commented Apr 16, 2024

Since "hello" is not on a critical performance path, maybe the simplest thing would be to look up the jobspec to obtain the queue name. I'll experiment with that.

@garlick
Copy link
Member

garlick commented Apr 16, 2024

BTW I think the garbage queue name from flux_log() may be due ot the fact that it is a std:string and needs the c_str() method called on it before being passed to flux_log().

@garlick
Copy link
Member

garlick commented Apr 16, 2024

I think I've got a more workable solution where the queue name is picked up from the jobspec, however, my new test is failing in interesting ways that I need to understand. I'll post an update soon once I've figured that out.

@garlick
Copy link
Member

garlick commented Apr 16, 2024

OK, those tests are passing now. Feel free to cherry pick these commits from my branch (instead of the originals)

caec1a4 - qmanager: look up job queue name in KVS

2e2ae23 - testsuite: add test for reloading with rv1_nosched

@garlick
Copy link
Member

garlick commented Apr 16, 2024

I just realized several error paths lead to success in that hello callback, which could cause double booking of resources. So I've got a couple more commits:

bfc1617 qmanager: fix error handling during hello

7f11e19 manager: improve unknown queue hello error

The first one at least ought to be included in this PR since it's somewhat serious.

@milroy
Copy link
Member Author

milroy commented Apr 17, 2024

BTW I think the garbage queue name from flux_log() may be due ot the fact that it is a std:string and needs the c_str() method called on it before being passed to flux_log().

You're right. I should have seen that.

@milroy
Copy link
Member Author

milroy commented Apr 17, 2024

In case it helps, here is some background on the default queue - it is confusing in this context because queues in fluxion predate queues in flux-core. (I hope my fluxion assertions are correct as I'm making them from memory)

The background does help- thanks for the additional details. Now I understand what was going on with the missing "default" queue.

@garlick
Copy link
Member

garlick commented Apr 17, 2024

NP! Did you want me to push those commits to your branch and then you can review them a little more conveniently?

@milroy
Copy link
Member Author

milroy commented Apr 17, 2024

I just realized several error paths lead to success in that hello callback, which could cause double booking of resources.

Yikes, thanks for catching this! For the sake of clarity, prior to this PR there's one error path that can lead to a spurious success:

if (flux_msg_unpack (msg,
"{s:I s:i s:i s:f}",
"id", &id,
"priority", &prio,
"userid", &uid,
"t_submit", &ts) < 0) {
flux_log_error (h, "%s: flux_msg_unpack", __FUNCTION__);
goto out;

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part of this in qmanager that I understand looks good. Could we have another sched developer review the resource part?

I did install this on my test cluster and verify that I can now reload the scheduler with running jobs and the resource counts look good (both in core and in sched by setting FLUX_RESOURCE_LIST_RPC=sched-fluxion-resource.

Great work @milroy !

@trws
Copy link
Member

trws commented Apr 22, 2024

Reviewing now, hopefully I can get through a pass before I get to the airport.

goto out;
}
}
if (json_unpack (jobspec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need a json path wrapper for things like this. Definitely not in this PR but mentioning it to remind myself.

Copy link
Member

@trws trws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. There are a couple of style or performance issues I noted, but as soon as those get tweaked I'm all good with this. Awesome work tracking all this down @milroy.

@garlick, @grondo I'm almost to the airport to hop a plane to Oregon, in case I don't have internet, feel free to dismiss this and approve without waiting for me to take another pass.

resource/readers/resource_reader_rv1exec.hpp Outdated Show resolved Hide resolved

// Search graph metadata for vertex
auto vtx_iter = m.by_path.find (path);
if (vtx_iter != m.by_path.end ()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nitpicky, but as a style note, try to have checks like this with a long span use early exits rather than wrap the whole span. For example, if (vtx_iter == m.by_path.end()) return vtx;.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I changes a couple other places where this comment applied as well.

resource/readers/resource_reader_rv1exec.cpp Outdated Show resolved Hide resolved
resource/readers/resource_reader_rv1exec.cpp Outdated Show resolved Hide resolved
@milroy
Copy link
Member Author

milroy commented Apr 23, 2024

Thanks for the review and feedback @trws! It may be worth taking a look at the changes I made to address your comment on early exits.

@milroy
Copy link
Member Author

milroy commented Apr 23, 2024

I updated the early exit code sections to be more aligned with how Fluxion currently handles conditional exits. For example:

    const auto &vtx_iter = m.by_path.find (path);
    // Not found; return null_vertex
    if (vtx_iter == m.by_path.end ())
        return null_vtx;
    // Found in by_path
    for (vtx_t v : vtx_iter->second) {
        if (g[v].rank == rank
                && g[v].id == id
                && g[v].size == size
                && g[v].type == type) {
            return v;

Versus:

    const auto &vtx_iter = m.by_path.find (path);
    // Not found; return null_vertex
    if (vtx_iter == m.by_path.end ()) {
        return null_vtx;
    } else {
        // Found in by_path
        for (vtx_t v : vtx_iter->second) {
            if (g[v].rank == rank
            [. . .]

Do you have a preference between these?

@trws
Copy link
Member

trws commented Apr 23, 2024

Do you have a preference between these?

The former. It's a style I'm pretty sure I learned from @garlick and @grondo in flux-core. We have a lot of code that has a lot of potential error propagation conditions, if we do the second style then every time we check one we have to increase a nesting level, where the former leaves everything flat. To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.

Avoiding nesting isn't a hard and fast rule, but this case is about as close as it gets for me. If the then branch of the condition ends in either a return or a goto cleanup or similar, it shouldn't have an else because the execution is identical whether it nests or not.

Copy link
Member

@trws trws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks really good, thanks @milroy. Added one comment just on factoring, but it's entirely optional.

resource/readers/resource_reader_rv1exec.cpp Outdated Show resolved Hide resolved
@milroy
Copy link
Member Author

milroy commented Apr 23, 2024

To me it makes the code much more readable, because you can easily tell the difference between a condition that's an early exit to propagate an error or unknown condition and code where there are actually two different code-paths during normal execution.

That makes sense and I agree, but I thought I'd check.

@milroy milroy force-pushed the rv1-update branch 3 times, most recently from 72ef733 to 1cae444 Compare April 24, 2024 01:03
milroy and others added 8 commits April 23, 2024 19:27
Problem: issue flux-framework#991 identified the need for an `rv1exec`
implementation of update (). The need for the implementation is
described in detail in the issue, but the primary motivation is to
enable reloading Fluxion when using RV1 without the scheduling key and
payload. The reader was not originally implemented due to the lack of
information in the format. Examples include edges, exclusivity, paths
with subsystems, and vertex sizes. To create a workable implementation,
strong assumptions need to be made about resource exclusivity and
edges.

Add support for update () through helper functions that update vertex
and edge metadata and infer exclusivity of node-level resources.
Problem: the sched-fluxion-resource.update RPC only supports JGF.

Add a check for an empty scheduling key and fall back to rv1exec.
Problem: the `resource-query` tool currently only supports update via
JGF.

Add support for rv1exec, including additional options handling. This
enables sharness tests for reload and updates under rv1exec.
Problem: the current tests for reloading Fluxion with rv1exec fail due
to lack of update support. Now that the support is added, we need tests
for correct behavior.

Add new tests to check for ability to reload Fluxion and update a
resource graph with rv1exec. Update existing tests to use the
additional option required by resource-query for update.
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
Problem: a few specific use cases related to reloading fluxion
when rv1_nosched is set are not fully covered by existing tests.

Check the following
- jobs are not killed when modules are reloaded (no queues configured)
- jobs are not killed when modules are reloaded (queues configured)
- no resources are double booked in those cases
- jobs submitted to the anon queue are killed when the scheduler is reloaded
  with queues configured
- jobs submitted to a named queue are killed when the scheduler is reloaded
  without queues configured
Problem: several error paths in jobmanager_hello_cb() do not
result in the function returning a failure.

I think the only ones that do trigger an error are a C++ exception
(due to the wrapper) and if (queue->reconstruct() returns -1. The
callback returns 0 (success) in other cases which could result in
double booked resources.
Problem: when a hello response is received for a job with an
unkown queue, the error is not super helpful.

Instead of:
  jobmanager_hello_cb: ENOENT: map::at: No such file or directory

make it:
  jobmanager_hello_cb: unknown queue name (id=41137733632 queue=default)

This can occur when queues are reconfigured and the scheduler is reloaded,
so it's nice to have the job id and the offending queue name.
@trws
Copy link
Member

trws commented Apr 24, 2024

Bounced the go test runner, looks like it failed to fetch the image for some reason. Feel free to set MWP @milroy, great stuff!

@milroy
Copy link
Member Author

milroy commented Apr 24, 2024

Thanks for the detailed and helpful feedback @trws and @garlick! Setting MWP.

@milroy milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Apr 24, 2024
@mergify mergify bot merged commit 143020d into flux-framework:master Apr 24, 2024
21 checks passed
@milroy milroy deleted the rv1-update branch April 24, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-passing mergify.io - merge PR automatically once CI passes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fluxion can't restart running jobs with match-format rv1_nosched
4 participants