-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rfc: add notification service design doc
Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.
- Loading branch information
Showing
5 changed files
with
271 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
attributes: | ||
system: | ||
notify: | ||
include: "{id.f58} {event} {return_code}" | ||
service: "slack" | ||
handle: "elvis" | ||
events: "FINISH" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
attributes: | ||
system: | ||
notify: "default" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
.. github display | ||
GitHub is NOT the preferred viewer for this file. Please visit | ||
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html | ||
44/Flux Library for Adaptable Notifications Version 1 | ||
########################################################### | ||
|
||
This specification describes the Flux service that allows users to | ||
receive external notifications for events in a Flux job. | ||
|
||
.. list-table:: | ||
:widths: 25 75 | ||
|
||
* - **Name** | ||
- github.com/flux-framework/rfc/spec_44.rst | ||
* - **Editor** | ||
- William Hobbs <[email protected]> | ||
* - **State** | ||
- raw | ||
|
||
Language | ||
******** | ||
|
||
.. include:: common/language.rst | ||
|
||
Related Standards | ||
***************** | ||
|
||
- :doc:`spec_19` | ||
- :doc:`spec_21` | ||
- :doc:`spec_25` | ||
|
||
Background | ||
********** | ||
|
||
Towards the goal of supporting users who run batch jobs with variable end time | ||
dependent on queues, runtime, and other factors, the Flux Library for Adaptable | ||
Notifications (FLAN) provides event-driven functionality that sends external | ||
notifications of job events. | ||
|
||
Terminology | ||
*********** | ||
|
||
These terms may have broader meaning in other RFCs or the Flux project. To | ||
avoid confusion, below is a glossary of terms as they apply in this document. | ||
|
||
Notification | ||
An email, Slack message, Mattermost message, etc. triggered by FLAN but | ||
ultimately external to the FLAN service. | ||
|
||
Notification-enabled jobs | ||
Jobs that include a jobspec attribute requesting a notification for certain | ||
events in the job's life cycle. For a more detailed definition of job events, | ||
refer to :doc:`spec_21`. | ||
|
||
|
||
Requirements | ||
************ | ||
|
||
- By default in a system-instance, do not notify a user of any job events. | ||
Allow the user to override this default with a jobspec attribute, | ||
system.notify. | ||
- Support notification after any event of the job, where events are defined in | ||
:doc:`spec_21`. | ||
- Support email as the primary end user notification delivery. | ||
- Build a driver capable of sending POST requests to chat services for | ||
notification delivery, provided they have an API capable of accepting such | ||
requests. Examples include but are not limited to Mattermost and Slack. | ||
- Utilize as few resources as possible in the Flux job-manager. Under no | ||
circumstances will a notification block any stage or event of a Flux job. | ||
- Provide configurable rate-limiting to ensure users can never be overwhelmed | ||
by notifications. | ||
|
||
Implementation | ||
************** | ||
|
||
FLAN SHALL be implemented in two parts: | ||
|
||
The jobtap plugin | ||
A shared library based on the API defined in | ||
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html>`_ | ||
which streams the jobids of notification-enabled jobs to the python driver. | ||
|
||
The python driver | ||
A python process used for tracking notification-enabled jobs through the job | ||
life cycle. Started by the flux user on the node containing the rank 0 broker | ||
in a cluster, it asynchronously monitors the events of all notification-enabled | ||
jobs. It attaches callbacks to certain events and sends notifications. | ||
|
||
Initial Request | ||
--------------- | ||
|
||
After the jobtap plugin has been loaded in the job-manager, the python driver | ||
SHALL send a ``notify.enable`` streaming RPC request at initialization. | ||
|
||
The ``notify.enable`` request has no payload. | ||
|
||
At initialization the python driver SHALL create a kvs subdirectory, ``notify``. | ||
|
||
Initial Response | ||
---------------- | ||
|
||
Multiple responses may be sent to the initial ``notify.enable`` RPC request. | ||
The jobtap plugin SHALL keep a hash table of jobids that are ACTIVE and | ||
notification-enabled. On initialization, all of the jobids in the hash table | ||
SHALL be sent as individual responses to the Python driver. | ||
|
||
jobid | ||
As defined in :doc:`spec_19`, a single jobid for a notification-enabled job. | ||
|
||
.. note:: | ||
The hash table is intended to ensure that, should the python driver crash, | ||
upon restart it can "catch up" with all of the jobs that have been submitted | ||
and send users the notifications they have requested. | ||
|
||
Additional Responses | ||
-------------------- | ||
|
||
The jobtap plugin SHALL continue to send responses to the initial | ||
``notify.enable`` RPC request whenever notification-enabled jobs enter the | ||
DEPEND state. The jobtap plugin SHALL add these jobids to its hash table | ||
of ACTIVE, notification-enabled jobs. | ||
|
||
For each response received by the python driver, the driver SHALL create a | ||
KVS subdirectory, ``notify.<jobid>``. In this directory the driver SHALL | ||
insert keys representing the job events for which users have requested a | ||
notification. These keys values SHALL be empty. The key SHALL be deleted | ||
after the corresponding notification is sent. | ||
|
||
The python driver MUST then asynchronously monitor the job as it reaches | ||
events of interest. | ||
|
||
When the job reaches an event of interest, FLAN SHALL generate an email | ||
and send it to the user. (FLAN MAY eventually support other means of | ||
notification delivery, such as chat services.) FLAN SHALL subsequently | ||
delete the corresponding key in the KVS, ``notify.<jobid>.<state>``. | ||
|
||
The ``notify.<jobid>`` KVS subdirectory SHALL be deleted when the job reaches | ||
an INACTIVE state. If the ``notify.<jobid>`` directory is non-empty upon | ||
reaching the INACTIVE state, this indicates some notifications have been missed. | ||
The python driver SHALL send a final notification to the user documenting | ||
that their notification-enabled job has reached an INACTIVE state. | ||
|
||
.. note:: | ||
This design is intended to ensure that no double-notifications are sent upon | ||
the restart of the Python script, the jobtap plugin, or the job-manager. | ||
|
||
Error Response | ||
-------------- | ||
|
||
If an error response is returned to ``notify.enable``, this indicates that the | ||
jobtap plugin is not loaded in the job-manager. The python driver SHALL exit | ||
immediately, and print an appropriate error message. | ||
|
||
Disconnect Request | ||
------------------ | ||
|
||
If a disconnect request is received by the jobtap plugin, this indicates the | ||
python driver has exited. The jobtap plugin SHALL continue to add notification- | ||
enabled jobs to its hash table as they enter the DEPEND state. When the python | ||
driver reconnects, the jobtap plugin SHALL respond to its initial ``notify.enable`` | ||
RPC request with a response RPC for each jobid that is being watched. | ||
|
||
User Interface | ||
************** | ||
|
||
Users SHALL create notification-enabled jobs by specifying an attribute in their | ||
job's jobspec. Jobspec attributes are defined in :doc:`spec_25` | ||
|
||
Basic Use Case | ||
-------------- | ||
|
||
Users SHALL add the following attribute to their jobspec: | ||
|
||
.. literalinclude:: data/spec_44/example2.yaml | ||
:language: yaml | ||
|
||
The default behavior SHALL be to send a notification to the users' primary email | ||
address, as provided by an LDAP query, when the job reaches the START and FINISH | ||
events. | ||
|
||
Advanced Use Cases | ||
------------------ | ||
|
||
Only the basic use case SHALL be supported in v1. | ||
|
||
The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some | ||
or all of the following values: | ||
|
||
.. literalinclude:: data/spec_44/example1.yaml | ||
:language: yaml | ||
|
||
For System Administrators | ||
------------------------- | ||
|
||
The webhooks and other secrets required to connect to chat services SHALL be included | ||
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN | ||
python driver on initialization. Note that best practice for managing webhooks is | ||
to keep them secret. | ||
|
||
Example Life Cycle of a Notification-Enabled Job | ||
************************************************ | ||
|
||
Coming soon! | ||
|
||
Edge Cases | ||
********** | ||
|
||
These edge cases MAY be supported in FLAN v1. | ||
|
||
Restarting the job-manager | ||
-------------------------- | ||
|
||
In the event the job-manager crashes or is shut down the python driver SHALL exit | ||
immediately and log an error. | ||
|
||
Flux does not currently support restarting with running jobs. However, on a system | ||
restart, all events for all ACTIVE jobs are replayed. This means that when each | ||
notification-enabled job reaches the DEPEND event, the jobtap plugin SHALL | ||
send a streaming RPC response and insert the job's jobid into its hash table. The | ||
python driver, upon receiving a new jobid MUST ensure that the jobid does not have | ||
a previous entry in the KVS. Since the KVS is reloaded on a restart, any outstanding | ||
notifications SHALL have corresponding keys there. If a jobid received by the python | ||
driver already has a KVS subdirectory, the python driver SHALL ignore the job's | ||
event notification requests in the jobspec and only send notifications for that | ||
correspond with the keys in the KVS. This prevents a double-notification of the user | ||
for the same job state on a restart of the job-manger or FLAN service. | ||
|
||
Expiration of notifications | ||
--------------------------- | ||
|
||
In certain cases, a restart of the service may be delayed such that events of interest | ||
on notification-enabled jobs are long past. FLAN MAY support an "expiration" setting | ||
which would stop any notification from final delivery if a set amount of time had | ||
passed since the event. | ||
|
||
Subinstance notifications | ||
------------------------- | ||
|
||
Due to the recursive launch feature of Flux, users may wish to have notifications | ||
for states of batch jobs that are not at the system-instance level. This MAY NOT | ||
be supported in FLAN v1. | ||
|
||
Invalid jobspec attributes | ||
-------------------------- | ||
|
||
FLAN MAY eventually provide a frobnicator plugin for validating the advanced use | ||
cases detailed above. In the interim, if a user tries to utilize the advanced | ||
case and provide junk keys or values, FLAN SHALL defer to default mode. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters