Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Gracefully handle errors when retrieving task document on rule disable #118024

Closed
ymao1 opened this issue Nov 9, 2021 · 2 comments · Fixed by #118618
Closed

[Alerting] Gracefully handle errors when retrieving task document on rule disable #118024

ymao1 opened this issue Nov 9, 2021 · 2 comments · Fixed by #118618
Assignees
Labels
bug Fixes for quality problems that affect the customer experience estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0

Comments

@ymao1
Copy link
Contributor

ymao1 commented Nov 9, 2021

When rules are disabled, active alerts for that rule are auto-resolved to avoid zombie active alerts. I noticed while writing a functional test for another issue that if the task manager doc for the rule is missing, the disable call throws a 404

{
    "statusCode": 404,
    "error": "Not Found",
    "message": "Saved object [task/329798f0-b0b0-11ea-9510-fdf248d5f2a4] not found"
}

and does not disable the rule. We may want to handle this case better, as if the task document is missing and we can't disable the rule, we can't then re-enable it to create a new task document.

We should figure out what the expected behavior should be. Maybe only catch 404s and continue disabling the rule and throw other types of errors? Or allow rule to be disabled regardless (possibly leading to zombie active alerts?)

@ymao1 ymao1 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Nov 9, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Nov 10, 2021

For reference, the auto-recover PR #111671 was merged in 7.16.0

This feels pretty important, as we do see cases where rules reference task documents that don't exist, and our remediation is to disable and then re-enable the rule. So if we can't do that because of this problem, the only remediation for these cases would be to delete the rule and re-create it.

In terms of behaviour, I think we'll want to have disable do as much as it possibly can - delete the task, invalidate the API key, recover instances, etc. When we run into issues, log and continue. If we can't recover instances because we can't get the task document, log that and continue. Etc.

@gmmorris gmmorris added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility bug Fixes for quality problems that affect the customer experience v7.16.0 labels Nov 15, 2021
@ymao1 ymao1 self-assigned this Nov 15, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:medium Medium Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants