-
Notifications
You must be signed in to change notification settings - Fork 54
Omit DAGs that are known to fail from alerts #643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I still need to test this locally, but one question that comes to mind immediately is whether DAGs that need to silence errors could export a predicate that is passed the error and decides whether to alert or not. I'm not sure how this would work with the issue checking DAG (which is cool, and a nice accountability tool) but it would allow silencing some errors from Slack without silencing all errors. Silencing all errors raised in a DAG seems like it might work against the goal of stabilizing DAGs and making DAG alerts more meaningful.
…ifications as well as alerts
openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/maintenance/check_silenced_dags/check_silenced_dags_dag.py
Outdated
Show resolved
Hide resolved
GITHUB_PAT = Variable.get("GITHUB_API_KEY", default_var="not_set") | ||
|
||
|
||
dag = DAG( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I keep saying this so apologies if this comes across as pushy, but we could definitely use the TaskFlow API for this DAG! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic!! I love the extra DAG as well, that's a great check for us to have for when we use this feature. The bonus of using GitHub issue links as a check is so smart.
One note though, I had to change my variable to cleveland_museum_workflow
(from cleveland_museums_workflow
), otherwise the worfklow actually did report the error. Other than that,
I was able to test this locally and everything worked as expected! 🚀 🤖
Yeah, I definitely agree that we'll need to add this feature long term. We can expand the configuration variable and make it possible to support the issue checking DAG as well. I'd prefer to add it as a fast follow; otherwise I'll get to this as soon as I can. |
That sounds good. I'll finish reviewing this PR this afternoon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one suggestion for how to improve the accountability around this process. This is great though. I love seeing our different systems come together like this.
message = ( | ||
"The following DAGs have Slack messages silenced, but the associated issue is" | ||
f" closed. Please remove them from the `{airflow_variable}` Airflow variable" | ||
" or assign a new issue." | ||
) | ||
for (dag, issue) in dags_to_reenable: | ||
message += f"\n - <{issue}|{dag}>" | ||
send_alert(message, username="Silenced DAG Check", unfurl_links=False) | ||
return message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice if this created a GitHub issue or something that could actually be assigned and tracked or have the maintainers pinged. Or maybe even just left a comment on the issue itself like "Please un-silence the DAG errors". Or maybe both, a new issue and a ping on the old, with the new issue just tracking the work of actually updating the prod configuration.
Just worried a slack ping could easily get lost (especially if lots of people are on vacation or distracted by something else, for example) in a way that a GitHub issue won't, as it acts more like a formal "todo" item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I love this idea! I'm going to create a follow-up issue and link back, this would be fantastic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement for our Slack channels :)
It would be nice to have a more visible record of silenced DAGs, somewhere in a GitHub issue or a file in the repository.
Fixes
Fixes WordPress/openverse#1609
Description
Sometimes DAGs with known errors are left active in order to continue consuming partial data. In this case an issue should be opened to address the error, but we may want to temporarily turn off Slack alerts for the DAG to prevent (1) cluttering the channel and (2) causing time to be wasted by team members investigating already tracked issues.
This PR introduces a new Airflow variable,
silenced_slack_alerts
, used to configure this. Example configuration:Each key is the
dag_id
of a DAG who should have Slack alerts turned off. The value is the URL of a GitHub issue tracking the known failure.This PR:
slack#send_alert
to skip sending the notification if the DAG is in this dictcheck_silenced_dags
DAG which runs@weekly
and verifies that for each silenced dag, the associated github issue is still open. If the issue has been closed, it sends a slack alert reminding developers to turn alerts back on.I can split the new DAG out into a separate PR if folks would prefer.
Testing Instructions
silenced_slack_alerts
in the Admin > Variables UI. Silence alerts for a few DAGs, making sure to link at least one to an open issue, and at least one to a closed issue. For my tests I used this configuration:pull_data
task.slack_message_override
set totrue
in your Airflow variable, to force slack messages to send in your local environment.Skipping Slack alert for finnish_museums_workflow
. This should be true regardless of whether the associated issue is open or closed.Now test the
check_silenced_dags
DAG:silenced_slack_alerts
dict to remove any DAGs associated to closed github issues, and run the DAG again. You should see no Slack alert.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin