Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Omit DAGs that are known to fail from alerts #643

Merged
merged 15 commits into from
Aug 10, 2022

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Aug 1, 2022

Fixes

Fixes WordPress/openverse#1609

Description

Sometimes DAGs with known errors are left active in order to continue consuming partial data. In this case an issue should be opened to address the error, but we may want to temporarily turn off Slack alerts for the DAG to prevent (1) cluttering the channel and (2) causing time to be wasted by team members investigating already tracked issues.

This PR introduces a new Airflow variable, silenced_slack_alerts, used to configure this. Example configuration:
Screen Shot 2022-08-02 at 5 41 05 PM

Each key is the dag_id of a DAG who should have Slack alerts turned off. The value is the URL of a GitHub issue tracking the known failure.

This PR:

  • Adds a check to slack#send_alert to skip sending the notification if the DAG is in this dict
  • Adds a new check_silenced_dags DAG which runs @weekly and verifies that for each silenced dag, the associated github issue is still open. If the issue has been closed, it sends a slack alert reminding developers to turn alerts back on.

Screen Shot 2022-08-01 at 2 37 35 PM

I can split the new DAG out into a separate PR if folks would prefer.

Testing Instructions

  • Create an Airflow variable named silenced_slack_alerts in the Admin > Variables UI. Silence alerts for a few DAGs, making sure to link at least one to an open issue, and at least one to a closed issue. For my tests I used this configuration:
{
  # Turns off alerts for Finnish Museums, linking to a closed issue
  "finnish_museums_workflow": "https://github.com/WordPress/openverse-catalog/issues/229",
  # Turns off alerts for Cleveland Museum, linking to an open issue
  "cleveland_museum_workflow": "https://github.com/WordPress/openverse/issues/1609"
}
  • For the DAGs you've configured above, force an error to be thrown. I did this by editing the code to manually raise an Exception somewhere in the pull_data task.
  • Make sure you have slack_message_override set to true in your Airflow variable, to force slack messages to send in your local environment.
  • For each of the configured DAGs, run them. You should see the error, but no Slack message should be sent. In the logs of the failed task, you should see something like Skipping Slack alert for finnish_museums_workflow. This should be true regardless of whether the associated issue is open or closed.

Now test the check_silenced_dags DAG:

  • Run the DAG. You should see a Slack alert like the screenshot above, listing all of the DAGs linked to closed github issues.
  • Edit the silenced_slack_alerts dict to remove any DAGs associated to closed github issues, and run the DAG again. You should see no Slack alert.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Aug 1, 2022
@stacimc stacimc requested a review from a team as a code owner August 1, 2022 21:55
@stacimc stacimc self-assigned this Aug 1, 2022
Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I still need to test this locally, but one question that comes to mind immediately is whether DAGs that need to silence errors could export a predicate that is passed the error and decides whether to alert or not. I'm not sure how this would work with the issue checking DAG (which is cool, and a nice accountability tool) but it would allow silencing some errors from Slack without silencing all errors. Silencing all errors raised in a DAG seems like it might work against the goal of stabilizing DAGs and making DAG alerts more meaningful.

tests/dags/maintenance/test_check_silenced_alerts.py Outdated Show resolved Hide resolved
tests/dags/maintenance/test_check_silenced_alerts.py Outdated Show resolved Hide resolved
GITHUB_PAT = Variable.get("GITHUB_API_KEY", default_var="not_set")


dag = DAG(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I keep saying this so apologies if this comes across as pushy, but we could definitely use the TaskFlow API for this DAG! 😄

tests/dags/maintenance/test_check_silenced_alerts.py Outdated Show resolved Hide resolved
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic!! I love the extra DAG as well, that's a great check for us to have for when we use this feature. The bonus of using GitHub issue links as a check is so smart.

One note though, I had to change my variable to cleveland_museum_workflow (from cleveland_museums_workflow), otherwise the worfklow actually did report the error. Other than that,
I was able to test this locally and everything worked as expected! 🚀 🤖

@stacimc
Copy link
Contributor Author

stacimc commented Aug 9, 2022

one question that comes to mind immediately is whether DAGs that need to silence errors could export a predicate that is passed the error and decides whether to alert or not

Yeah, I definitely agree that we'll need to add this feature long term. We can expand the configuration variable and make it possible to support the issue checking DAG as well. I'd prefer to add it as a fast follow; otherwise I'll get to this as soon as I can.

@sarayourfriend
Copy link
Contributor

That sounds good. I'll finish reviewing this PR this afternoon.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one suggestion for how to improve the accountability around this process. This is great though. I love seeing our different systems come together like this.

Comment on lines +48 to +56
message = (
"The following DAGs have Slack messages silenced, but the associated issue is"
f" closed. Please remove them from the `{airflow_variable}` Airflow variable"
" or assign a new issue."
)
for (dag, issue) in dags_to_reenable:
message += f"\n - <{issue}|{dag}>"
send_alert(message, username="Silenced DAG Check", unfurl_links=False)
return message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if this created a GitHub issue or something that could actually be assigned and tracked or have the maintainers pinged. Or maybe even just left a comment on the issue itself like "Please un-silence the DAG errors". Or maybe both, a new issue and a ping on the old, with the new issue just tracking the work of actually updating the prod configuration.

Just worried a slack ping could easily get lost (especially if lots of people are on vacation or distracted by something else, for example) in a way that a GitHub issue won't, as it acts more like a formal "todo" item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I love this idea! I'm going to create a follow-up issue and link back, this would be fantastic.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement for our Slack channels :)
It would be nice to have a more visible record of silenced DAGs, somewhere in a GitHub issue or a file in the repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flag DAGs that are expected to fail to omit them from alerts
4 participants