Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] stop deleting / re-creating tasks when rules are disabled / re-enabled #110096

Closed
pmuellr opened this issue Aug 25, 2021 · 6 comments · Fixed by #139826
Closed

[alerting] stop deleting / re-creating tasks when rules are disabled / re-enabled #110096

pmuellr opened this issue Aug 25, 2021 · 6 comments · Fixed by #139826
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Aug 25, 2021

Today, when a rule is disabled, we delete the associated task manager document; and then re-create it when the rule is re-enabled.

This has a couple of problems:

I think one thing we could do is add a new capability to task manager to "disable/enable" a task. From a brief look a while back, this seems pretty easy. The claim call would add a new filter to ignore task docs that have enabled: false on them. And we'd add that new field to the task SO, as well as some way to query/set it. Then when we disable a rule, we set the enabled:false flag on the task, and when we enable the rule, we set it to enabled:true.

One downside of this is keeping the stored state of the rule in the TM index, whether the rule is disabled or not. Today, you can save disk space by disabling rules; if we do something like what's suggested above, we wouldn't.

There are some older issues that were raised about "losing state when rules are disabled" that we should look into to see if there are other downsides with maintain this state longer than we do today.

@pmuellr pmuellr added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework labels Aug 25, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Sep 1, 2021

We should check with solution teams if this would become breaking behaviour.

@pmuellr
Copy link
Member Author

pmuellr commented Apr 7, 2022

I believe the problem "race conditions can cause > 1 task manager documents to exist for a single rule" has been resolved in this PR: #117397 (shipped in 8.1.0). Now that tasks use the same id as the rule, it's not possible for a rule to have multiple tasks associated with it, for newly created rules.

@ymao1 ymao1 self-assigned this Aug 29, 2022
@ymao1
Copy link
Contributor

ymao1 commented Aug 31, 2022

It is straightforward to add an enabled field to the task document and update the claim query to only claim enabled tasks. This seems like a positive addition to the task manager scheduling API.

For alerting rules specifically, if update enable/disable to just update this flag on the task document instead of deleting and recreating the task document, we will see this change in behavior:

Previously, if a rule had active alerts A,B,C and gets disabled, the task document containing state about these alerts is deleted.

When the rule gets re-enabled, if it generates alerts A and B, actions would be scheduled because A and B are considered new alerts, so regardless of the notification config (onActiveAlert, onActionGroupChange or onThrottleInterval) a notification would be sent out for A and B.

By not deleting the task document, when the rule gets re-enabled, if it generates alerts A and B, a notification will only be sent out if the notification config is set to onActiveAlert or if the throttle interval has passed for onThrottleInterval. Notifications will not be issued if the config is onActionGroupChange because the task state prior to disable is now accessible so the framework would not consider these alerts new or changed.

@mikecote @kobelb Would you consider this a breaking change?

ETA or do we consider the prior behavior to be a bug and this would fix the bug?

@kobelb
Copy link
Contributor

kobelb commented Aug 31, 2022

@mikecote @kobelb Would you consider this a breaking change?

I would treat this change in behavior as a bug-fix, and not a breaking change.

@mikecote
Copy link
Contributor

mikecote commented Sep 1, 2022

@mikecote @kobelb Would you consider this a breaking change?

I would treat this change in behavior as a bug-fix, and not a breaking change.

+1 from me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants