Skip to content

Commit

Permalink
rfc: add notification service design doc
Browse files Browse the repository at this point in the history
Problem: no design currently exists for the Flux
email service as noted in flux-framework/flux-core#4435.

Add a RFC-style document detailing this.
  • Loading branch information
wihobbs committed May 13, 2024
1 parent f2ea6c3 commit d16577d
Show file tree
Hide file tree
Showing 4 changed files with 252 additions and 0 deletions.
7 changes: 7 additions & 0 deletions data/spec_44/example1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
attributes:
system:
notify:
include: "{id.f58} {event} {return_code}"
service: "slack"
handle: "elvis"
events: "FINISH"
3 changes: 3 additions & 0 deletions data/spec_44/example2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
attributes:
system:
notify: "default"
7 changes: 7 additions & 0 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,12 @@ standard I/O management of remote processes.

The Flux Job List Service provides read-only summary information for jobs.

:doc:`spec_44`
~~~~~~~~~~~~~~

The Flux Library for Adaptable Notifications (FLAN) provides a connection to
external notification services (such as email) for steps in a batch job.

.. Each file must appear in a toctree
.. toctree::
:hidden:
Expand Down Expand Up @@ -328,3 +334,4 @@ The Flux Job List Service provides read-only summary information for jobs.
spec_41
spec_42
spec_43
spec_44
235 changes: 235 additions & 0 deletions spec_44.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
.. github display
GitHub is NOT the preferred viewer for this file. Please visit
https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_28.html
44/Flux Library for Adaptable Notifications Version 1
###########################################################

This specification describes the Flux service that allows users to
receive external notifications when their batch jobs enter certain
events, as described in :doc:`spec_21`.

.. list-table::
:widths: 25 75

* - **Name**
- github.com/flux-framework/rfc/spec_44.rst
* - **Editor**
- William Hobbs <[email protected]>
* - **State**
- raw

Language
********

.. include:: common/language.rst

Related Standards
*****************

- :doc:`spec_19`
- :doc:`spec_21`
- :doc:`spec_25`

Background
**********

Towards the goal of supporting users who run batch jobs with variable end time
dependent on queues, runtime, and other variable factors, Flux SHALL provide the
Flux Library for Adaptable Notifications (FLAN). FLAN SHALL be capable of
sending email notifications to users upon the completion of their batch job.
FLAN shall be a shared library jobtap plugin loaded in the Flux job manager,
with an accompanying Python driver to orchestrate job monitoring and
notification transmission.

Terminology
***********

These terms may have broader meaning in other RFCs or the Flux project. To
avoid confusion, below is a glossary of terms as they apply in this document.

Notification
An email, Slack message, Mattermost message, etc. triggered by FLAN but
ultimately external to the FLAN service.

Chat services
Slack, Mattermost, etc. Any service for which an API can receive a POST request
and retransmit this in a human-readable form to a user.

Notification-enabled jobs
Jobs that include a jobspec attribute requesting a notification for certain
events in the job's lifecycle.

The python driver
A python process used for tracking notification-enabled jobs through the job
lifecycle. Started by the flux user on the node containing the rank 0 broker
in a cluster, it asynchronously monitors the events for all jobs in the cluster
requesting notification. It attaches callbacks to certain events and sends
notifications.

The jobtap plugin
A shared library based on the API defined in
`flux-jobtap-plugins(7) <https://flux-framework.readthedocs.io/projects/flux-core/en/latest/man7/flux-jobtap-plugins.html#jobtap-plugin-names>`_
which streams the jobids of notification-enabled jobs to the python driver.

Requirements
************

- By default in a system-instance, do not notify a user of any job events.
Allow the user to override this default with a jobspec attribute,
system.notify.
- Support notification after any event of the job, where events are defined in
:doc:`spec_21`.
- Support email notifications, as well as a driver capable of sending POST
requests to any chat service, provided they have an API capable of accepting
such requests. Example services include, but are not limited to, Mattermost and
Slack.
- Utilize as few resources as possible in the Flux job-manager. Under no
circumstances should a notification block any stage or event of a Flux job.
- Provide configurable rate-limiting to ensure users can never be overwhelmed
by a deluge of notifications, regardless of the number of jobs they submit.

Implementation
**************

After the jobtap plugin has been loaded in the job-manager, the python driver
SHALL send a ``notify.enable`` streaming RPC request at initialization.

The ``notify.enable`` request has no payload.

At initialization the python driver SHALL create a kvs subdirectory, ``notify``.

Initial Response
----------------

Multiple responses may be sent to the initial ``notify.enable`` RPC request.
The jobtap plugin SHALL keep a hash table of jobids that are ACTIVE and
notification-enabled.

jobid
As defined in :doc:`spec_19`, a single jobid for a notification-enabled job.

.. note::
The hash table is intended to ensure that, should the python driver crash,
upon restart it can "catch up" with all of the jobs that have been submitted
and send users' the notifications they have requested.

Additional Responses
--------------------

The jobtap plugin SHALL continue to send responses to the initial
``notify.enable`` RPC request whenever notification-enabled jobs enter the
DEPEND state. The jobtap plugin shall add these job's jobids to its hash
table of ACTIVE, notification-enabled jobs.

For each response received by the python driver, the driver SHALL create a
KVS subdirectory, ``notify.<jobid>``. In this directory the driver SHALL
insert keys representing the job events for which users have requested a
notification. These keys values SHALL be empty. The key SHALL be deleted
after the corresponding notification is sent.

The ``notify.<jobid>`` subdirectory SHALL be deleted when the job reaches an INACTIVE state.
If the ``notify.<jobid>`` directory is non-empty upon reaching the INACTIVE
state, this indicates some notifications have been missed. The python driver
SHALL send a final notification to the user documenting that their
notification-enabled job has reached an inactive state.

.. note::
This design is intended to ensure that no double-notifications are sent upon
the restart of the Python script, the jobtap plugin, or the job-manager.

Error Response
--------------

If an error response is returned to ``notify.enable``, this indicates that the
jobtap plugin is not loaded in the job-manager. The python driver SHALL exit
immediately, and print an appropriate error message.

Disconnect Request
------------------

If a disconnect request is received by the jobtap plugin, this indicates the
python driver has exited. The jobtap plugin SHALL continue to add notification-
enabled jobs to its hash table as they enter the DEPEND state. When the python
driver reconnects, the jobtap plugin shall respond to its initial ``notify.enable``
RPC request with a response RPC for each jobid that is being watched.

User Interface
**************

Users SHALL create notification-enabled jobs by specifying an attribute in their
job's jobspec. Jobspec attributes are defined in :doc:`spec_25`

Basic Use Case
--------------

Users SHALL add the following attribute to their jobspec:

.. literalinclude:: data/spec_44/example2.yaml
:language: yaml

The default behavior SHALL be to send a notification to the users' primary email
address, as provided by an LDAP query, when the job reaches the START and FINISH
events.

Advanced Use Cases
------------------

Only the basic use case SHALL be supported in v1.

The ``system.notify`` jobspec attribute SHALL accept a dictionary containing some
or all of the following values:

.. literalinclude:: data/spec_44/example1.yaml
:language: yaml

For System Administrators
-------------------------

The webhooks and other secrets required to connect to chat services SHALL be included
in a ``config.toml`` file. The path to this file MUST be provided to the FLAN
python driver on initialization. Note that best practice for managing webhooks is
to keep them secret.

Example Lifecycle of a Notification-Enabled Job
***********************************************

Coming soon!

Edge Cases
**********

These edge cases MAY be supported in FLAN v1.

Restarting the job-manager
--------------------------

In the event the job-manager crashes or is shut down the python driver SHALL exit
immediately and log an error.

Flux does not currently support restarting with running jobs. However, on a system
restart, all events for all ACTIVE jobs are replayed. This means that when each
notification-enabled active job reaches the DEPEND event, the jobtap plugin SHALL
send a streaming RPC response and insert the job's jobid into its hash table. The
python driver, upon receiving a new jobid MUST ensure that the jobid does not have
a previous entry in the KVS. Since the KVS is reloaded on a restart, any outstanding
notifications shall have corresponding keys there. If a jobid received by the python
driver already has a KVS subdirectory, the python driver shall ignore the job's
event notification requests in the jobspec and only send notifications for that
correspond with the keys in the KVS. This prevents a double-notification of the user
for the same job state on a restart of the job-manger or FLAN service.

Subinstance notifications
-------------------------

Due to the recursive launch feature of Flux, users may wish to have notifications
for states of batch jobs that are not at the system-instance level. This will not
be supported in FLAN v1.

Invalid Jobspec Attributes
--------------------------

FLAN MAY eventually provide a frobnicator plugin for validating the advanced use
cases detailed above. In the interim, should a user try to utilize the advanced
case and provide junk keys or values, FLAN shall defer to default mode.

0 comments on commit d16577d

Please sign in to comment.