Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job lifecycle notifications #30

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft

Job lifecycle notifications #30

wants to merge 21 commits into from

Conversation

AdrianoKF
Copy link
Collaborator

@AdrianoKF AdrianoKF commented Jul 16, 2024

The watcher Go application uses the Kubernetes client API to monitor jobs in the current namespace and send notifications for major lifecycle events.

See the README for additional details.

Original PR description follows:


Jobs need to be annotated in order to be eligible for notifications:

Use x-jobby.io/notify-channel: [slack|webhook] to determine the notification channel (HTTP webhook or Slack), and then control the destinations with x-jobby.io/slack-channel-ids (comma-separated) or x-jobby.io/webhook-urls (comma-separated) annotations.

If using the Slack notification channel, pass the Slack API token through the WATCHER_SLACK_API_TOKEN env variable.

Notification Channels

The code uses the notify package, which supports are variety of destinations.
So far, the following have been implemented:

  • webhook: HTTP (POST) webhooks to a URL
  • slack: Slack message to one or multiple channels, requires a Slack API token

Notification Types

The following notifications have been implemented (so far, only Kubernetes/Kueue Jobs are supported):

  • Job goes from being suspended to running: implicated that the workload has been admitted by Kueue
    image
  • Job is completed: logs for all pods associated with the job are included in the message
    image
  • Job has any failed pods: logs the pods and their failed containers with additional information
    image
  • Workload preemption monitoring

Usage Example

The following job options set up Slack notifications for lifecycle events in a job:

@job(
    options=JobOptions(
        ...,
        labels={
           "x-jobby.io/notify-channel": "slack",
           "x-jobby.io/slack-channel-ids": "mlops-test",
        ),
    )
)
def my_job(): ...

Deployment

See watcher/README.md for details.

@AdrianoKF AdrianoKF self-assigned this Jul 16, 2024
@AdrianoKF AdrianoKF changed the title feat(watcher): Add rudimentary notification tooling Job lifecycle monitoring and notifications Jul 16, 2024
@AdrianoKF AdrianoKF changed the title Job lifecycle monitoring and notifications Job lifecycle notifications Jul 16, 2024
The `watcher` Go application uses the Kubernetes client API
to monitor jobs in the current namespace and send notifications
for major lifecycle events.

Jobs need to be annotated in order to be eligible for notifications:

Use`x-jobby.io/notify-channel: [slack|webhook]` to determine the
notification channel (HTTP webhook or Slack), and then control the
destinations with `x-jobby.io/slack-channel-ids` (comma-separated)
or `x-jobby.io/webhook-urls` (comma-separated) annotations.

If using the Slack notification channel, pass the Slack API token through
the `WATCHER_SLACK_API_TOKEN` env variable.
This allows setting metadata for jobs submitted to
Kubernetes (i.e., Kueue `Job`, `RayJob` resources).

In particular, the feature can be used to configure
lifecycle notifications for those jobs.
This commit changes the monitoring logic to observe
the lifecycle of Kueue workloads instead of raw K8s
Job resources.

This allows the code to gracefully handle other workload
types, such as Ray jobs, without any changes.

Also, the code has been decomposed into separate
packages for improved readability (although there are
still a lot of rough edges).
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 2.07612% with 283 lines in your changes missing coverage. Please review.

Project coverage is 51.64%. Comparing base (4b383b3) to head (957c71f).

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
watcher/pkg/util/kueue.go 0.00% 90 Missing ⚠️
watcher/pkg/compose/compose.go 0.00% 71 Missing ⚠️
watcher/main.go 0.00% 49 Missing ⚠️
watcher/pkg/util/k8s.go 0.00% 34 Missing ⚠️
watcher/pkg/notify/notify.go 0.00% 24 Missing ⚠️
watcher/pkg/util/util.go 37.50% 10 Missing ⚠️
watcher/pkg/util/slices.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
- Coverage   56.42%   51.64%   -4.78%     
==========================================
  Files          61       68       +7     
  Lines        2997     3286     +289     
==========================================
+ Hits         1691     1697       +6     
- Misses       1306     1589     +283     
Flag Coverage Δ
backend 88.27% <ø> (ø)
client 47.92% <ø> (ø)
watcher 2.07% <2.07%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AdrianoKF AdrianoKF linked an issue Sep 11, 2024 that may be closed by this pull request
@AdrianoKF AdrianoKF added enhancement New feature or request backend Related to the backend / server component. labels Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to the backend / server component. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Watcher Functionality
2 participants