Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Alert only after the metrics threshold is met X times - Customer request #89152

Closed
arisonl opened this issue Jan 25, 2021 · 9 comments
Assignees
Labels
discuss enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting NeededFor:logs-metrics-ui Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@arisonl
Copy link
Contributor

arisonl commented Jan 25, 2021

As part of alert flapping mitigation, customers have requested the following feature: Alert only after a metrics threshold is met X times (the request is made in the context of observability metrics alerting).

Let's use this issue to determine across the observability and the alerting services team what is the best path for this feature. Is it a metrics alert type feature or something we may provide on the framework level? Is it an alert feature or an action (perhaps advanced throttling) feature? Tagging both teams.

cc @mukeshelastic @sorantis @elastic/kibana-alerting-services

@arisonl arisonl added Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) discuss labels Jan 25, 2021
@pmuellr
Copy link
Member

pmuellr commented Jan 25, 2021

We've discussed this in the past as a parallel of throttling. A "debounce" setting would indicate how long to allow the alert to be "active" before scheduling actions - it's a way of limiting actions being run at the "start" of an alert, where "throttling" is a way of limiting actions being run "during" an alert.

Currently throttling is time-based, which can be a little clunky given the other intervals in play - mainly the alert interval.

We've discussed (not sure we have an issue for though), the ability to indicate some of these "time-based" values as multipliers of the alert interval. So instead of setting the throttle to 10 minutes, you might set it to 2 intervals - if the alert interval is 5 minutes, it would end up throttling for 10 minutes. But if you change the alert interval to 10 minutes, it would be 20 minutes. If we want "interval"-based debounce (per the current issue title "Alert only after the metrics threshold is met X times"), then we'll need to add something like this.

It would be possible for specific alerts to add this functionality themselves, by keeping some additional state around in the alert to determine when to actually schedule actions, but seems like it would be better to have this in the framework, available to all alerts.

@pmuellr
Copy link
Member

pmuellr commented Apr 29, 2021

I think to handle this, we'll need to add some data to the instance state we persist for rules. Namely, the number of times the alert has "triggered" for this instance, since it started triggering.

Might be useful to have this in the event log as well, but there's already an issue to add the "start time" of these "trigger spans" to the event log, and that's probably good enough for now.

The generic alert params will need to be extended to handle this, presumably at the same level as the threshold value.

And we'll probably want to have design take a look at where we should squeeze this new field into the existing generic alert params form.

@jasonrhodes
Copy link
Member

It would be possible for specific alerts to add this functionality themselves, by keeping some additional state around in the alert to determine when to actually schedule actions, but seems like it would be better to have this in the framework, available to all alerts.

Exactly, I was hoping the throttle listener would be able to easily listen for debounce scenarios as well and a setting could just be simply added to the fly outs. Notify every (throttle) vs Notify after x consecutive violations (debounce).

I see your point now, though, that we use a time value for throttling. Maybe the action scheduler could be debounce-wrapped on behalf of the executors so it'd be invisible but if someone set that setting, calls wouldn't fire right away? I guess the problem is that it isn't really a debounce (it isn't meant to trigger once the calls stop long enough), it's something slightly different.

How can we help with these requirements?

@dgieselaar
Copy link
Member

For "alerts as data", the rule executor needs to know what is an alert and what isn't. If that is encapsulated in the framework, there's a disconnect, so then either the framework needs to handle "alerts as data", or it needs to tell the executor whether a violation will result in an alert. Or, we accept that the "thing gets notified on" is not what we show in the alerts table and vice-versa. In the sense that we might display an alert if the framework has not notified, or we might recover the alert before the framework sends it recovery notification (maybe it never does, and we create a new alert, even though the user only got one notification).

Do they want to apply this only before the alert is activated, or also when it should recover? And would they want different values for either? It also feels similar to a recovery threshold - should that be treated differently? Or, what happens when the rule runs at a different interval than its "resolution"? E.g., you have a rule that runs every 10s, but processes buckets of 1m - in order for it to give results about a bucket of time as quickly as possible. The rule will alert every 10s, but I imagine you'd want to count one violation per bucket, not per execution.

@jasonrhodes
Copy link
Member

jasonrhodes commented Apr 30, 2021

For "alerts as data", the rule executor needs to know what is an alert and what isn't.

Yeah, the biggest difference between this functionality and, say, throttling, is that throttling is only about notification frequency once an alert has been triggered, whereas this is about holding off on triggering an alert until a more complex set of criteria are met. I still think this is the framework's job and can't be left to rule types to figure out independently, or at least not without some kind of framework-provided helper(s) to consolidate the logic.

so then either the framework needs to handle "alerts as data", or it needs to tell the executor whether a violation will result in an alert.

That sounds reasonable, actually, to let the framework own the criteria-matching around whether an alert is triggered, has changed its severity, or has recovered. The alerts-as-data helper(s) would then just be responsible for indexing that data accordingly, in that case.

Do they want to apply this only before the alert is activated, or also when it should recover? And would they want different values for either? It also feels similar to a recovery threshold - should that be treated differently?

Yes, recovery should probably be treated the same way, and it may need its own number. We're also discussing a separate threshold for recovery, as a separate topic (but related). I think we may even have a requirement for "x1 consecutive times at y1 threshold OR x2 consecutive times at y2 threshold" for both triggering and recovery. But possibly not at first. :)

Or, what happens when the rule runs at a different interval than its "resolution"? ... The rule will alert every 10s, but I imagine you'd want to count one violation per bucket, not per execution.

If we wanted to support something this complex, I don't think it would necessarily be the same feature as the one we're describing here, or at least not available in the first version. I think it's valuable enough if it's just counting executions/violations, and not trying to derive if those violations came from distinct buckets. We can explore whether that's simple to detect but my sense is that it isn't.

@dgieselaar
Copy link
Member

Yea, I think the requirements might make it too complicated to handle at the framework level (as in: outside of the rule executor). What I would like to see is a super-generic, super-flexible threshold rule type that can be configured this way. But that's just me and my wild imagination.

@jasonrhodes
Copy link
Member

I think I just want to avoid having many different rule types in the system defining this logic in wildly different ways. I imagine our alert creation flyouts eventually settling into a UX for setting this, something like "Wait for n" or something along those lines to represent the most common case for this functionality, and it would be good to have that common case accounted for without custom handling. I'm sure there will be rule types that do need to do custom handling beyond the common case, but my utopian mind hopes that most rule types might get away with the basic kind (similar to basic threshold handling).

@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@gmmorris gmmorris added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework and removed Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types labels Aug 13, 2021
@gmmorris
Copy link
Contributor

Reading through the thread it feels like this is more likely a framework level thing than specific rule type so relabelling appropriately.
I suspect this issue will have to wait for the result of research into alerts as data becoming framework level itself, that way we can explore addressing this duplication of effort (and logic) across the rules data service and the alerting plugin.

@gmmorris gmmorris added enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@shanisagiv1 shanisagiv1 self-assigned this Oct 18, 2022
@mikecote
Copy link
Contributor

Closing as done with #173009

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting NeededFor:logs-metrics-ui Team:Observability Team label for Observability Team (for things that are handled across all of observability) Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

8 participants