-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Active alerts do not recover after re-enabling a rule #110080
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Some random thoughts:
|
In the following #106292 (comment), I noted:
I don't think we have an issue open on this, but I think this would fix this problem, since the state would persist even if the alert is disabled. The main reason we want to not delete/re-create the tasks is that due to the race conditions mentioned in that other issue, we are sometimes creating > 1 task associated with a rule, and we don't really have any way to clean these up. I've not seen this happen in practice, only with some very targetted ad hoc testing, but it can happen. I'll open a new issue JUST for that - keeping the task around even if the rule is disabled (since I don't see one open on that right now). update: issue opened: #110096 |
The thing I like about the disabled task approach is it then becomes a 1:1 relationship between rule and task. We create a task at the creation of the rule and then synchronize enable/disable accordingly. |
May relate to #110096 |
Relates to #105046 and may resolve it at the same time? |
Yes, this issues describe the same problem.
Task was created with state of the alert instances.
Task was deleted with state of the alert instances.
Here we create a new task instance with the own state. But the Event Log (which is giving us a rule summary with the alerts instances) was not updated after the first task was removed. |
The event log uses ILM to delete documents older than a certain amount of time, would the bug come back when the event log doc gets removed? |
My read of Yuliia's comment was to "finish up" the instances at that point, by writing a Just checked Yuliia's POC, that's exactly what she did :-) Super interesting it passed all FT, but failed some jest tests!!! Of course, ideally it would be nice to show the state at the time it was disabled, with enough UI to make it obvious it's disabled and the info isn't current. Beyond our own "lost state" issues with deleted task documents, rule types lose their state as well! I think we really want to maintain this state over a disable, and we have two choices - stop deleting task documents on disable, or move the state into the rule. And we should really stop deleting task documents because of the race conditions with rapid enable/disable calls. If we need to do something sooner, Yuliia's POC would probably be fine, but we might want to think about writing some I wonder if we're going to run into any problems with existing rule types that depend on the state being cleared on disable? Guessing no. But I'm guessing we may want a new method for alerts to "reset" them, since our old way of doing this disable/enable won't work any more; of course most of the time we do that to reset the API key, which it would also be nice to persist, but different problem :-) |
Yes, as @pmuellr mentioned above, my suggestion is an alternative "easy" solution before we decide to do something more complicated like moving the state to rule or stop deleting tasks. |
ya, I think implementing the short-term fix sooner than the longer-term fix makes sense to me. I would only really be concerned with the short-term fix if we showed something for a disabled rule, but we don't today. The only effect would be that you won't see those already recovered alerts as still active, when re-enabled - or it seems like that should be the only effect. Also forgot to mention, presumably this only happens if the |
Have we thought about rules that set up actions on recovery? What should happen to PagerDuty incidents when disabling a rule? What should happen to PagerDuty incidents when enabling a rule that detects some recoveries. I'm ok expanding on scope if we feel another approach is a bit bigger in scope but the ultimate solution. |
Ya, I think these recovered-because-rule-disabled probably deserve something other than the usual recovered story, long-term. Does that mean a new "built-in" action group? Maybe it's a recovered state with a reason field associated. How do you even describe this situation to a human? Seems like it could get complicated, so we at least need to create an issue to research / track it. |
As the result of the current research was raised a few product questions, which should make the shape of the next steps to solve the current "bug" behavior:
@arisonl and @alexfrancoeur need your help on the defining the requirements mentioned above. |
Here is a visual capture of how this behaves right now. Some initial thoughts (that will need validation): Increasing the threshold and this way causing the condition to stop being met is not a recovery. If you change the threshold, you have reset the rule. A recovery is defined against a certain threshold, if you change the threshold to a higher value, the alert has not recovered against the previous threshold, you just changed the rule. (That said @gmmorris has noted that admins may disable/enable a rule in order to reset an API key, in which case the state should probably not be lost -#106876)). We have two things coming together in this issue: Resetting the threshold value and disabling the rule and we probably need to think about how they reflect to the event log and the external system independently as well as when they are combined (like in this bug). |
Good insight! That means we should be resetting the rule state on update, which we are NOT doing today! I think we'd just be considering rule-type specific fields to count as an update where "the rule state will be reset" - setting mute/throttle/name, etc, shouldn't reset the rule state. If we were to implement this today, we'd have to end up marking those alert instances as "recovered", but of course that doesn't feel quite right. Do we need a different "alert instance no longer firing" case? No longer firing because it has actually recovered, or no longer firing for other reasons (eg, rule was updated). So either in the event log, a new peer of |
@arisonl will investigate the product expectations further with O11y and SecuirtySolution |
Re-looking at this after ~3 weeks ... thought I'd summarize the short-term options we could look at:
I'm not sure which of these is actually easier, or what other side effects they might cause. The second is probably harder, but presumably removes some existing code (deleting and re-creating the task document on disable/enable). It also will help with the race condition where if you enable/disable very quickly (via the HTTP API or presumably rule client), you can end up with multiple task documents for a single rule, which are impossible to delete. From the view of the event log:
The other aspect of this is that the instances WILL "recover" in the sense that they won't show up in the rule details page, eventually, because the query to get the instances from the event log is time-based (60 * rule interval), so things become "eventually consistent" :-). Problem is if the interval is 24 hr, that's 60 days (link to code below). For alerts that would eventually recover while the rule is disabled (if the rule had not be disabled), the alerts will end up with a "recovered" message at the "wrong time" in both cases - either when the rule is disabled (first option) or when the rule is re-enabled (second option). Neither is correct, but the first feels somewhat better. It's also probably likely that we will need to add a new "reset" API if we go with the second option, since "disable then re-enable" is the general way to "reset" a rule, and aspects of that "reset" will stop working if we stop deleting the task record - we'd need a new HTTP API and likely UX gesture to explicitly "reset" the rule. I really want to do the second, because it fixes the race condition (which we've only seen in artificial situations), but it's feeling heavier than the first option. Here's the code that uses the 60 * rule interval date range to read the event log: kibana/x-pack/plugins/alerting/server/rules_client/rules_client.ts Lines 502 to 532 in 04497da
|
Should have pointed out - it's buried in other comments - that Yuliia has a POC for the first option of emitting recovered events on disable, in PR 111671. |
We decided to move forward with the recovery option for the short term, with a solution that will not trigger resolve-on-recovery for disabled alerts that are set up to resolve incidents on recovery. We also discussed to flag that somehow on the event log, if possible (I understand from Patrick's comment that it is currently not, but it can be added either as an event or attribute). |
When disabling a rule with active alerts, they will not recover if no longer active after re-enabling the rule.
Steps to reproduce:
Notice the previously detected alert remains active and will not recover.
The text was updated successfully, but these errors were encountered: