Omit DAGs that are known to fail from alerts #643

stacimc · 2022-08-01T21:55:21Z

Fixes

Description

Sometimes DAGs with known errors are left active in order to continue consuming partial data. In this case an issue should be opened to address the error, but we may want to temporarily turn off Slack alerts for the DAG to prevent (1) cluttering the channel and (2) causing time to be wasted by team members investigating already tracked issues.

This PR introduces a new Airflow variable, silenced_slack_alerts, used to configure this. Example configuration:

Each key is the dag_id of a DAG who should have Slack alerts turned off. The value is the URL of a GitHub issue tracking the known failure.

This PR:

Adds a check to slack#send_alert to skip sending the notification if the DAG is in this dict
Adds a new check_silenced_dags DAG which runs @weekly and verifies that for each silenced dag, the associated github issue is still open. If the issue has been closed, it sends a slack alert reminding developers to turn alerts back on.

I can split the new DAG out into a separate PR if folks would prefer.

Testing Instructions

Create an Airflow variable named silenced_slack_alerts in the Admin > Variables UI. Silence alerts for a few DAGs, making sure to link at least one to an open issue, and at least one to a closed issue. For my tests I used this configuration:

{
  # Turns off alerts for Finnish Museums, linking to a closed issue
  "finnish_museums_workflow": "https://github.com/WordPress/openverse-catalog/issues/229",
  # Turns off alerts for Cleveland Museum, linking to an open issue
  "cleveland_museum_workflow": "https://github.com/WordPress/openverse/issues/1609"
}

For the DAGs you've configured above, force an error to be thrown. I did this by editing the code to manually raise an Exception somewhere in the pull_data task.
Make sure you have slack_message_override set to true in your Airflow variable, to force slack messages to send in your local environment.
For each of the configured DAGs, run them. You should see the error, but no Slack message should be sent. In the logs of the failed task, you should see something like Skipping Slack alert for finnish_museums_workflow. This should be true regardless of whether the associated issue is open or closed.

Now test the check_silenced_dags DAG:

Run the DAG. You should see a Slack alert like the screenshot above, listing all of the DAGs linked to closed github issues.
Edit the silenced_slack_alerts dict to remove any DAGs associated to closed github issues, and run the DAG again. You should see no Slack alert.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend

This LGTM. I still need to test this locally, but one question that comes to mind immediately is whether DAGs that need to silence errors could export a predicate that is passed the error and decides whether to alert or not. I'm not sure how this would work with the issue checking DAG (which is cool, and a nice accountability tool) but it would allow silencing some errors from Slack without silencing all errors. Silencing all errors raised in a DAG seems like it might work against the goal of stabilizing DAGs and making DAG alerts more meaningful.

tests/dags/maintenance/test_check_silenced_alerts.py

…ifications as well as alerts

openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags.py

openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags_dag.py

AetherUnbound · 2022-08-03T00:06:48Z

openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags_dag.py

+GITHUB_PAT = Variable.get("GITHUB_API_KEY", default_var="not_set")
+
+
+dag = DAG(


I know I keep saying this so apologies if this comes across as pushy, but we could definitely use the TaskFlow API for this DAG! 😄

tests/dags/maintenance/test_check_silenced_alerts.py

AetherUnbound

This is fantastic!! I love the extra DAG as well, that's a great check for us to have for when we use this feature. The bonus of using GitHub issue links as a check is so smart.

One note though, I had to change my variable to cleveland_museum_workflow (from cleveland_museums_workflow), otherwise the worfklow actually did report the error. Other than that,
I was able to test this locally and everything worked as expected! 🚀 🤖

tests/dags/maintenance/test_check_silenced_dags.py

stacimc · 2022-08-09T18:21:30Z

one question that comes to mind immediately is whether DAGs that need to silence errors could export a predicate that is passed the error and decides whether to alert or not

Yeah, I definitely agree that we'll need to add this feature long term. We can expand the configuration variable and make it possible to support the issue checking DAG as well. I'd prefer to add it as a fast follow; otherwise I'll get to this as soon as I can.

sarayourfriend · 2022-08-09T18:22:35Z

That sounds good. I'll finish reviewing this PR this afternoon.

sarayourfriend

LGTM! Just one suggestion for how to improve the accountability around this process. This is great though. I love seeing our different systems come together like this.

sarayourfriend · 2022-08-09T18:38:37Z

openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags.py

+    message = (
+        "The following DAGs have Slack messages silenced, but the associated issue is"
+        f" closed. Please remove them from the `{airflow_variable}` Airflow variable"
+        " or assign a new issue."
+    )
+    for (dag, issue) in dags_to_reenable:
+        message += f"\n  - <{issue}|{dag}>"
+    send_alert(message, username="Silenced DAG Check", unfurl_links=False)
+    return message


It'd be nice if this created a GitHub issue or something that could actually be assigned and tracked or have the maintainers pinged. Or maybe even just left a comment on the issue itself like "Please un-silence the DAG errors". Or maybe both, a new issue and a ping on the old, with the new issue just tracking the work of actually updating the prod configuration.

Just worried a slack ping could easily get lost (especially if lots of people are on vacation or distracted by something else, for example) in a way that a GitHub issue won't, as it acts more like a formal "todo" item.

Oh I love this idea! I'm going to create a follow-up issue and link back, this would be fantastic.

obulat

This is a great improvement for our Slack channels :)
It would be nice to have a more visible record of silenced DAGs, somewhere in a GitHub issue or a file in the repository.

stacimc added 8 commits July 29, 2022 15:33

Skip slack alerting for dags configured in Airflow variable

07a4e15

Always log message before sending Slack notification

572a8b7

Add method to fetch GitHub issue

d016bb2

Add DAG to check for DAGs that need alerts reenabled

5188d2b

Add tests

3ad3629

Fix Slack tests

4245edd

Add test that send_alert skips

710d1f7

Update message format to prevent links unfurling

7f0682e

stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository labels Aug 1, 2022

stacimc requested a review from a team as a code owner August 1, 2022 21:55

stacimc self-assigned this Aug 1, 2022

stacimc requested review from obulat and sarayourfriend August 1, 2022 21:55

sarayourfriend reviewed Aug 2, 2022

View reviewed changes

tests/dags/maintenance/test_check_silenced_alerts.py Outdated Show resolved Hide resolved

tests/dags/maintenance/test_check_silenced_alerts.py Outdated Show resolved Hide resolved

Rename files and small refactor to make it easier to add silenced not…

78aea5a

…ifications as well as alerts

stacimc mentioned this pull request Aug 2, 2022

Use Airflow variable to omit DAGs from any Slack notification #644

Merged

7 tasks

AetherUnbound reviewed Aug 3, 2022

View reviewed changes

stacimc mentioned this pull request Aug 3, 2022

Data refresh record difference reporting #636

Merged

7 tasks

AetherUnbound approved these changes Aug 3, 2022

View reviewed changes

stacimc added 4 commits August 2, 2022 17:29

Skip reporting task when all DAGs are configured correctly

04e643b

Reverse method order, update variable

3101544

Improve formatting

9416388

Update test to test skip when no dags to reenable

ef9951d

AetherUnbound reviewed Aug 3, 2022

View reviewed changes

tests/dags/maintenance/test_check_silenced_dags.py Show resolved Hide resolved

stacimc added 2 commits August 3, 2022 11:36

Add and parameters to send functions

2b2a496

Do not unfurl GitHub links in check_silenced_dags slack alerts

d0041cc

sarayourfriend approved these changes Aug 9, 2022

View reviewed changes

obulat added this to the v1.3.0 - Operational improvements/streamlining milestone Aug 10, 2022

obulat approved these changes Aug 10, 2022

View reviewed changes

stacimc merged commit 9fea655 into main Aug 10, 2022

stacimc deleted the feature/omit-dags-from-alerts branch August 10, 2022 20:14

This was referenced Apr 17, 2023

Create Github issues to re-enable Slack alerts in check_silenced_dags DAG WordPress/openverse#1483

Open

Allow DAGs to silence only errors matching predicate #654

Merged

AetherUnbound mentioned this pull request Apr 17, 2023

Allow execution timeouts to be overridden by Variables WordPress/openverse#1437

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Omit DAGs that are known to fail from alerts #643

Omit DAGs that are known to fail from alerts #643

stacimc commented Aug 1, 2022 •

edited

Loading

sarayourfriend left a comment

AetherUnbound Aug 3, 2022

AetherUnbound left a comment

stacimc commented Aug 9, 2022

sarayourfriend commented Aug 9, 2022

sarayourfriend left a comment

sarayourfriend Aug 9, 2022

stacimc Aug 10, 2022

obulat left a comment

		GITHUB_PAT = Variable.get("GITHUB_API_KEY", default_var="not_set")


		dag = DAG(

Omit DAGs that are known to fail from alerts #643

Omit DAGs that are known to fail from alerts #643

Conversation

stacimc commented Aug 1, 2022 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment

Choose a reason for hiding this comment

AetherUnbound Aug 3, 2022

Choose a reason for hiding this comment

AetherUnbound left a comment

Choose a reason for hiding this comment

stacimc commented Aug 9, 2022

sarayourfriend commented Aug 9, 2022

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend Aug 9, 2022

Choose a reason for hiding this comment

stacimc Aug 10, 2022

Choose a reason for hiding this comment

obulat left a comment

Choose a reason for hiding this comment

stacimc commented Aug 1, 2022 •

edited

Loading