-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rfc: add notification service design doc
Problem: no design currently exists for the Flux email service as noted in flux-framework/flux-core#4435. Add a RFC-style document detailing this.
- Loading branch information
Showing
4 changed files
with
252 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
attributes: | ||
system: | ||
notify: | ||
include: "{id.f58} {event} {return_code}" | ||
service: "slack" | ||
handle: "elvis" | ||
events: "FINISH" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
attributes: | ||
system: | ||
notify: "default" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,235 @@ | ||
.. github display | ||
GitHub is NOT the preferred viewer for this file. Please visit | ||
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html | ||
44/Flux Library for Adaptable Notifications Version 1 | ||
########################################################### | ||
|
||
This specification describes the Flux service that allows users to | ||
receive external notifications when their batch jobs enter certain | ||
events, as described in :doc:`spec_21`. | ||
|
||
.. list-table:: | ||
:widths: 25 75 | ||
|
||
* - **Name** | ||
- github.com/flux-framework/rfc/spec_44.rst | ||
* - **Editor** | ||
- William Hobbs <[email protected]> | ||
* - **State** | ||
- raw | ||
|
||
Language | ||
******** | ||
|
||
.. include:: common/language.rst | ||
|
||
Related Standards | ||
***************** | ||
|
||
- :doc:`spec_19` | ||
- :doc:`spec_21` | ||
- :doc:`spec_25` | ||
|
||
Background | ||
********** | ||
|
||
Towards the goal of supporting users who run batch jobs with variable end time | ||
dependent on queues, runtime, and other variable factors, Flux SHALL provide the | ||
Flux Library for Adaptable Notifications (FLAN). FLAN SHALL be capable of | ||
sending email notifications to users upon the completion of their batch job. | ||
FLAN shall be a shared library jobtap plugin loaded in the Flux job manager, | ||
with an accompanying Python driver to orchestrate job monitoring and | ||
notification transmission. | ||
|
||
Terminology | ||
*********** | ||
|
||
These terms may have broader meaning in other RFCs or the Flux project. To | ||
avoid confusion, below is a glossary of terms as they apply in this document. | ||
|
||
Notification | ||
An email, Slack message, Mattermost message, etc. triggered by FLAN but | ||
ultimately external to the FLAN service. | ||
|
||
Chat services | ||
Slack, Mattermost, etc. Any service for which an API can receive a POST request | ||
and retransmit this in a human-readable form to a user. | ||
|
||
Notification-enabled jobs | ||
Jobs that include a jobspec attribute requesting a notification for certain | ||
events in the job's lifecycle. | ||
|
||
The python driver | ||
A python process used for tracking notification-enabled jobs through the job | ||
lifecycle. Started by the flux user on the node containing the rank 0 broker | ||
in a cluster, it asynchronously monitors the events for all jobs in the cluster | ||
requesting notification. It attaches callbacks to certain events and sends | ||
notifications. | ||
|
||
The jobtap plugin | ||
A shared library based on the API defined in | ||
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#jobtap-plugin-names>`_ | ||
which streams the jobids of notification-enabled jobs to the python driver. | ||
|
||
Requirements | ||
************ | ||
|
||
- By default in a system-instance, do not notify a user of any job events. | ||
Allow the user to override this default with a jobspec attribute, | ||
system.notify. | ||
- Support notification after any event of the job, where events are defined in | ||
:doc:`spec_21`. | ||
- Support email notifications, as well as a driver capable of sending POST | ||
requests to any chat service, provided they have an API capable of accepting | ||
such requests. Example services include, but are not limited to, Mattermost and | ||
Slack. | ||
- Utilize as few resources as possible in the Flux job-manager. Under no | ||
circumstances should a notification block any stage or event of a Flux job. | ||
- Provide configurable rate-limiting to ensure users can never be overwhelmed | ||
by a deluge of notifications, regardless of the number of jobs they submit. | ||
|
||
Implementation | ||
************** | ||
|
||
After the jobtap plugin has been loaded in the job-manager, the python driver | ||
SHALL send a ``notify.enable`` streaming RPC request at initialization. | ||
|
||
The ``notify.enable`` request has no payload. | ||
|
||
At initialization the python driver SHALL create a kvs subdirectory, ``notify``. | ||
|
||
Initial Response | ||
---------------- | ||
|
||
Multiple responses may be sent to the initial ``notify.enable`` RPC request. | ||
The jobtap plugin SHALL keep a hash table of jobids that are ACTIVE and | ||
notification-enabled. | ||
|
||
jobid | ||
As defined in :doc:`spec_19`, a single jobid for a notification-enabled job. | ||
|
||
.. note:: | ||
The hash table is intended to ensure that, should the python driver crash, | ||
upon restart it can "catch up" with all of the jobs that have been submitted | ||
and send users' the notifications they have requested. | ||
|
||
Additional Responses | ||
-------------------- | ||
|
||
The jobtap plugin SHALL continue to send responses to the initial | ||
``notify.enable`` RPC request whenever notification-enabled jobs enter the | ||
DEPEND state. The jobtap plugin shall add these job's jobids to its hash | ||
table of ACTIVE, notification-enabled jobs. | ||
|
||
For each response received by the python driver, the driver SHALL create a | ||
KVS subdirectory, ``notify.<jobid>``. In this directory the driver SHALL | ||
insert keys representing the job events for which users have requested a | ||
notification. These keys values SHALL be empty. The key SHALL be deleted | ||
after the corresponding notification is sent. | ||
|
||
The ``notify.<jobid>`` subdirectory SHALL be deleted when the job reaches an INACTIVE state. | ||
If the ``notify.<jobid>`` directory is non-empty upon reaching the INACTIVE | ||
state, this indicates some notifications have been missed. The python driver | ||
SHALL send a final notification to the user documenting that their | ||
notification-enabled job has reached an inactive state. | ||
|
||
.. note:: | ||
This design is intended to ensure that no double-notifications are sent upon | ||
the restart of the Python script, the jobtap plugin, or the job-manager. | ||
|
||
Error Response | ||
-------------- | ||
|
||
If an error response is returned to ``notify.enable``, this indicates that the | ||
jobtap plugin is not loaded in the job-manager. The python driver SHALL exit | ||
immediately, and print an appropriate error message. | ||
|
||
Disconnect Request | ||
------------------ | ||
|
||
If a disconnect request is received by the jobtap plugin, this indicates the | ||
python driver has exited. The jobtap plugin SHALL continue to add notification- | ||
enabled jobs to its hash table as they enter the DEPEND state. When the python | ||
driver reconnects, the jobtap plugin shall respond to its initial ``notify.enable`` | ||
RPC request with a response RPC for each jobid that is being watched. | ||
|
||
User Interface | ||
************** | ||
|
||
Users SHALL create notification-enabled jobs by specifying an attribute in their | ||
job's jobspec. Jobspec attributes are defined in :doc:`spec_25` | ||
|
||
Basic Use Case | ||
-------------- | ||
|
||
Users SHALL add the following attribute to their jobspec: | ||
|
||
.. literalinclude:: data/spec_44/example2.yaml | ||
:language: yaml | ||
|
||
The default behavior SHALL be to send a notification to the users' primary email | ||
address, as provided by an LDAP query, when the job reaches the START and FINISH | ||
events. | ||
|
||
Advanced Use Cases | ||
------------------ | ||
|
||
Only the basic use case SHALL be supported in v1. | ||
|
||
The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some | ||
or all of the following values: | ||
|
||
.. literalinclude:: data/spec_44/example1.yaml | ||
:language: yaml | ||
|
||
For System Administrators | ||
------------------------- | ||
|
||
The webhooks and other secrets required to connect to chat services SHALL be included | ||
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN | ||
python driver on initialization. Note that best practice for managing webhooks is | ||
to keep them secret. | ||
|
||
Example Lifecycle of a Notification-Enabled Job | ||
*********************************************** | ||
|
||
Coming soon! | ||
|
||
Edge Cases | ||
********** | ||
|
||
These edge cases MAY be supported in FLAN v1. | ||
|
||
Restarting the job-manager | ||
-------------------------- | ||
|
||
In the event the job-manager crashes or is shut down the python driver SHALL exit | ||
immediately and log an error. | ||
|
||
Flux does not currently support restarting with running jobs. However, on a system | ||
restart, all events for all ACTIVE jobs are replayed. This means that when each | ||
notification-enabled active job reaches the DEPEND event, the jobtap plugin SHALL | ||
send a streaming RPC response and insert the job's jobid into its hash table. The | ||
python driver, upon receiving a new jobid MUST ensure that the jobid does not have | ||
a previous entry in the KVS. Since the KVS is reloaded on a restart, any outstanding | ||
notifications shall have corresponding keys there. If a jobid received by the python | ||
driver already has a KVS subdirectory, the python driver shall ignore the job's | ||
event notification requests in the jobspec and only send notifications for that | ||
correspond with the keys in the KVS. This prevents a double-notification of the user | ||
for the same job state on a restart of the job-manger or FLAN service. | ||
|
||
Subinstance notifications | ||
------------------------- | ||
|
||
Due to the recursive launch feature of Flux, users may wish to have notifications | ||
for states of batch jobs that are not at the system-instance level. This will not | ||
be supported in FLAN v1. | ||
|
||
Invalid Jobspec Attributes | ||
-------------------------- | ||
|
||
FLAN MAY eventually provide a frobnicator plugin for validating the advanced use | ||
cases detailed above. In the interim, should a user try to utilize the advanced | ||
case and provide junk keys or values, FLAN shall defer to default mode. | ||
|