-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert muting #46034
Comments
Pinging @elastic/kibana-stack-services |
Some quick takes:
Seems easy to allow an alert muting to never expire, but allowing alert-instance muting to never expire seems potentially problematic, depending on how alert-type implementors use alert instances. Eg, I believe alert instance data is deleted, if on a turn of the alert-type function, no Is there a difference between a mute with an expiration, and a throttle? What about "snooze"?
Ya, I think we need both. Long term. I could totally see not having alert-instance level muting till a little later; if we at least had alert level muting. hmmm ... after writing down ^^^ I'm realizing we need to have some pretty crisp (and hopefully concise) human-readable description of all these things, alerts/alert-instances, muting, throttling, snoozing, describing how it works at a high level anyway. Maybe we do already? |
I've designed that the muted alert instance ids get saved with the alert in an array (so far in #43712). With this approach, there won't be the problem if
I'm not sure on this one. So far the design would be to unmute via UI otherwise the array will keep growing. I'm curious what others think on this point.
Snooze and mute are about the same. From what I hear snooze will just be "mute for x period". The difference between mute and throttle would be: 1) mute applies even if
This was requested by the solution teams (can't recall which one, maybe stack monitoring?) where they want to mute a problematic server out of the cluster until its fixed.
We do! I will send you the link to the glossary. |
I think Clint mentioned "mute with a timeout", which I think would make sense for cleaning up the garbage automatically. We'd stick a time in there with the alertInstanceId, and can clean them up the next time they're read/written. So, alert instances would only have a snooze (mute with time), not a mute.
Ya, that's a great use case! thx for fixing that for me :-) |
I think muting is a state, and changing the state over time should be considered separately, it fits into enhancements like:
As for alert-level or alert-instance-level, my opinion is we need to support both there are use cases for each:
Having an unbounded list of muted instances is a concern. If it's really problematic, we could place limits on the number of instances that can be muted. I think the expected usage would be you have a handful of instances muted that you are working on. At a certain point enough things are firing you mute the entire alert. Through testing we could probably find some reasonable number we could limit it. If there's 1M instances in the muted list it feels like the system isn't being used correctly. |
I think that would be very confusing, difficult to diagnose when you crossed the line. Would much prefer to say alert instances can only be snoozed (muted for some fixed amount of time) and not indefinitely muted. There's some potential for abuse there - setting snoozes on an alert instances for a century, but at least we'd have an explicit record they did that (the date would end up in the alert document). It would be difficult to diagnose issues around culled instances because of a limit - some clues would only show up in the kibana logs (assuming hitting limits would end up logging some info). |
I've added a 3rd question that goes in hand with unbounded concerns. I'm not sure how we can bind it to the alert instance state if it clears up after not firing. Unless we don't clean them up because they're muted.
|
If the system degrades or falls over when crossing the line though - wouldn't that also be confusing and hard to diagnose? Snoozing might help but still leaves the system vulnerable in my opinion. Perhaps we figure out what the limits are first, then figure out the best way to address them? Are we talking 100 instances, 1k, 100k? Whether it's changing the operation from mute to snooze, or imposing some hard cap, or something else, it can likely be done at a later point rather than trying to built it in up front. Alert instances in general and the state/operations on them feels like an area where we'll hit practical limits pretty quickly, so maybe it warrants a separate discussion. Personally I'd much rather take an alerting system into production that has safeguards in place and enforces limits, GCP alerting limits for example has a alert-instance equivalent "Simultaneously open incidents per alerting policy" cap of 5000. |
I would say no. If you've set the state on an instance it should stay until you change the state. For example, on a noisy threshold alert where the value fluctuates above and below the threshold, you'd want the muting state to persist because you know the alert instance is going to fire again. |
Added a 4th question
|
I think what makes sense is that it clears all muting state, similar to what happens in a table when I select a couple of rows, I click select all, then I unclick select all. |
++, and maybe we consider calling it "mute-all" and "unmute-all" at the alert level? That clarifies the expected behaviour. |
I like that. I can rename the APIs (and all terminology) to be more explicit too |
Muting is designed to opt-out of executing the actions for a given alert or alert instance. There's been discussion about how exactly this should work so I have created this discuss issue for tracking purposes.
Questions
The text was updated successfully, but these errors were encountered: