-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discuss] Alert only after the metrics threshold is met X times - Customer request #89152
Comments
We've discussed this in the past as a parallel of throttling. A "debounce" setting would indicate how long to allow the alert to be "active" before scheduling actions - it's a way of limiting actions being run at the "start" of an alert, where "throttling" is a way of limiting actions being run "during" an alert. Currently throttling is time-based, which can be a little clunky given the other intervals in play - mainly the alert interval. We've discussed (not sure we have an issue for though), the ability to indicate some of these "time-based" values as multipliers of the alert interval. So instead of setting the throttle to 10 minutes, you might set it to 2 intervals - if the alert interval is 5 minutes, it would end up throttling for 10 minutes. But if you change the alert interval to 10 minutes, it would be 20 minutes. If we want "interval"-based debounce (per the current issue title "Alert only after the metrics threshold is met X times"), then we'll need to add something like this. It would be possible for specific alerts to add this functionality themselves, by keeping some additional |
I think to handle this, we'll need to add some data to the instance state we persist for rules. Namely, the number of times the alert has "triggered" for this instance, since it started triggering. Might be useful to have this in the event log as well, but there's already an issue to add the "start time" of these "trigger spans" to the event log, and that's probably good enough for now. The generic alert params will need to be extended to handle this, presumably at the same level as the threshold value. And we'll probably want to have design take a look at where we should squeeze this new field into the existing generic alert params form. |
Exactly, I was hoping the throttle listener would be able to easily listen for debounce scenarios as well and a setting could just be simply added to the fly outs. Notify every (throttle) vs Notify after x consecutive violations (debounce). I see your point now, though, that we use a time value for throttling. Maybe the action scheduler could be debounce-wrapped on behalf of the executors so it'd be invisible but if someone set that setting, calls wouldn't fire right away? I guess the problem is that it isn't really a debounce (it isn't meant to trigger once the calls stop long enough), it's something slightly different. How can we help with these requirements? |
For "alerts as data", the rule executor needs to know what is an alert and what isn't. If that is encapsulated in the framework, there's a disconnect, so then either the framework needs to handle "alerts as data", or it needs to tell the executor whether a violation will result in an alert. Or, we accept that the "thing gets notified on" is not what we show in the alerts table and vice-versa. In the sense that we might display an alert if the framework has not notified, or we might recover the alert before the framework sends it recovery notification (maybe it never does, and we create a new alert, even though the user only got one notification). Do they want to apply this only before the alert is activated, or also when it should recover? And would they want different values for either? It also feels similar to a recovery threshold - should that be treated differently? Or, what happens when the rule runs at a different interval than its "resolution"? E.g., you have a rule that runs every 10s, but processes buckets of 1m - in order for it to give results about a bucket of time as quickly as possible. The rule will alert every 10s, but I imagine you'd want to count one violation per bucket, not per execution. |
Yeah, the biggest difference between this functionality and, say, throttling, is that throttling is only about notification frequency once an alert has been triggered, whereas this is about holding off on triggering an alert until a more complex set of criteria are met. I still think this is the framework's job and can't be left to rule types to figure out independently, or at least not without some kind of framework-provided helper(s) to consolidate the logic.
That sounds reasonable, actually, to let the framework own the criteria-matching around whether an alert is triggered, has changed its severity, or has recovered. The alerts-as-data helper(s) would then just be responsible for indexing that data accordingly, in that case.
Yes, recovery should probably be treated the same way, and it may need its own number. We're also discussing a separate threshold for recovery, as a separate topic (but related). I think we may even have a requirement for "x1 consecutive times at y1 threshold OR x2 consecutive times at y2 threshold" for both triggering and recovery. But possibly not at first. :)
If we wanted to support something this complex, I don't think it would necessarily be the same feature as the one we're describing here, or at least not available in the first version. I think it's valuable enough if it's just counting executions/violations, and not trying to derive if those violations came from distinct buckets. We can explore whether that's simple to detect but my sense is that it isn't. |
Yea, I think the requirements might make it too complicated to handle at the framework level (as in: outside of the rule executor). What I would like to see is a super-generic, super-flexible threshold rule type that can be configured this way. But that's just me and my wild imagination. |
I think I just want to avoid having many different rule types in the system defining this logic in wildly different ways. I imagine our alert creation flyouts eventually settling into a UX for setting this, something like "Wait for n" or something along those lines to represent the most common case for this functionality, and it would be good to have that common case accounted for without custom handling. I'm sure there will be rule types that do need to do custom handling beyond the common case, but my utopian mind hopes that most rule types might get away with the basic kind (similar to basic threshold handling). |
Reading through the thread it feels like this is more likely a framework level thing than specific rule type so relabelling appropriately. |
Closing as done with #173009 |
As part of alert flapping mitigation, customers have requested the following feature: Alert only after a metrics threshold is met X times (the request is made in the context of observability metrics alerting).
Let's use this issue to determine across the observability and the alerting services team what is the best path for this feature. Is it a metrics alert type feature or something we may provide on the framework level? Is it an alert feature or an action (perhaps advanced throttling) feature? Tagging both teams.
cc @mukeshelastic @sorantis @elastic/kibana-alerting-services
The text was updated successfully, but these errors were encountered: