Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rv1 attributes section will be deprecated #1108

Closed
grondo opened this issue Nov 8, 2023 · 2 comments
Closed

Rv1 attributes section will be deprecated #1108

grondo opened this issue Nov 8, 2023 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Nov 8, 2023

flux-framework/rfc#402 proposes to remove the optional attributes section from Rv1. Fluxion currently uses this section to store the queue in attributes.system.scheduler.queue, though it isn't clear why this is needed since this duplicates the queue set in current jobspec.

Additionally, the attributes section does not appear to be set on our production systems with configured queues. Perhaps this is a holdover from before flux-core supported queues and Fluxion had its own queue implementation?

In any event, if stashing the queue in R is still necessary, there is already the opaque scheduling key provided for the scheduler's use which is perhaps better used for this purpose.

@garlick
Copy link
Member

garlick commented Nov 8, 2023

I can't think why this would be required.

In the hello handshake where only R is provided, the resources need to be re-allocated immediately so no queueing should be needed. Also, the resources are already selected at that point so the queue constraint would be ignored.

@grondo
Copy link
Contributor Author

grondo commented Nov 8, 2023

I agree, but I attempted to remove the code that adds the queue "meta" attribute to jobs and this caused t1009-recovery-multiqueue.t to fail. I didn't look into why (all other tests appeared to pass)

diff --git a/resource/traversers/dfu_impl_update.cpp b/resource/traversers/dfu_impl_update.cpp
index 2abb2a22..3c05e325 100644
--- a/resource/traversers/dfu_impl_update.cpp
+++ b/resource/traversers/dfu_impl_update.cpp
@@ -568,12 +568,6 @@ int dfu_impl_t::update (vtx_t root, std::shared_ptr<match_writers_t> &writers,
              m_err_msg += __FUNCTION__;
              m_err_msg += ": emit_tm returned -1.\n";
          }
-         if (jobmeta.is_queue_set ()) {
-             if (writers->emit_attrs ("queue", jobmeta.get_queue ()) == -1) {
-                 m_err_msg += __FUNCTION__;
-                 m_err_msg += ": emit_attrs returned -1.\n";
-             }
-         }
      }
 
     return (rc > 0)? 0 : -1;
@@ -632,12 +626,6 @@ int dfu_impl_t::update (vtx_t root, std::shared_ptr<match_writers_t> &writers,
              m_err_msg += __FUNCTION__;
              m_err_msg += ": emit_tm returned -1.\n";
          }
-         if (jobmeta.is_queue_set ()) {
-             if (writers->emit_attrs ("queue", jobmeta.get_queue ()) == -1) {
-                 m_err_msg += __FUNCTION__;
-                 m_err_msg += ": emit_attrs returned -1.\n";
-             }
-         }
     }
 
     return (rc > 0)? 0: -1;

garlick added a commit to garlick/flux-sched that referenced this issue Apr 15, 2024
Problem: when the fluxion modules are reloaded with running jobs,
the jobs are killed if the match format is set to "rv1_nosched"
and queues are enabled.

hello_cb(), which informs the scheduler of jobs that are still holding
resources from before the reload, looks for a special key in R
(attributes.system.scheduler.queue).  If set, it calls queue->reconstruct()
on the named queue.  If not set, it calls queue->reconstruct() on the
the default queue.  This key is not being set by fluxion, at least when
"rv1_nosched" is in use.

Since there has been a proposal to deprecate that key anyway (flux-framework#1108),
and since the queue must only be utilized as a container for the job
since its resource request has already been fulfilled, put the job
in a random queue - the first one returned by the ctx->queues iterator.

Note that there should always be at least one queue in the map.  If no
named queues are defined, a default one is instantiated.
garlick added a commit to garlick/flux-sched that referenced this issue Apr 16, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is "rv1_nosched"
so the default queue is assumed.  Since the default queue is not instantiated
when named queues are defined, a fatal job exception is raised when the
queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so rather than
ensure that it is set in this case, determine the queue instead by fetching
the jobspec from the KVS.
garlick added a commit to garlick/flux-sched that referenced this issue Apr 16, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 17, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 20, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 21, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 21, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 23, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 23, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
milroy pushed a commit to milroy/flux-sched that referenced this issue Apr 24, 2024
Problem: when the fluxion modules are reloaded with running jobs,
and the match format is "rv1_nosched", and queues are enabled,
running jobs are killed with a fatal scheduler-restart exception.

During the hello handshake defined by RFC 27, the scheduler is informed
during its initialization of jobs that are holding resources.  The hello
callback in qmanager retrieves the job's queue name from the R key
"attributes.system.scheduler.queue".  The queue name is used to locate
the proper queue for the job to be inserted into.

This attribute is not being set in R when the match format is
"rv1_nosched" so the default queue is assumed.  Since the default queue
is not instantiated when named queues are defined, a fatal job exception
is raised when the queue lookup fails.

There has been a proposal to deprecate the R attribute (flux-framework#1108), so
rather than ensure that it is set in this case, determine the queue
instead by fetching the jobspec from the KVS.

Fixes flux-framework#1108
@mergify mergify bot closed this as completed in 2f8ada9 Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants