-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724
Comments
Pinging @elastic/stack-monitoring (Team:Monitoring) |
I'm curious to hear from the Alerting team about a scenario like this (which is what prompted the instance ID solution Igor mentions above): A user creates an alert that checks every 5 minutes but only notifies once every 24 hours. Inside the alert executor for that alert type, we query all hosts to see how many are over some threshold ("red"). If more than x are red, we fire the alert. However, the hosts that are red may change. If the threshold was, say, 10 hosts need to be red to fire the alert, at first you may find 12, then an hour later you may find 112. Firing the alert the 2nd time (an hour later) won't do anything because the "notify every" throttle is set to 24 hours. What do we think a user expects in this case? My instinct tells me that a user may just want a choice between:
|
@jasonrhodes It seems to me that users may want a combination of those two. They may care about host-specific state, but they don't want to be notified for every single host individually. It seems valid to me that they'd want a single alert which aggregates all affected hosts and re-notifies if any hosts are newly affected, or newly recovered. This is how we have the alert built out now. |
@chrisronline I agree they might want a combination of both (or strictly one or the other). But, more to @jasonrhodes's point why even have the throttle option in the UI if we're bypassing it anyways? |
Some roadmap items could help in these scenarios:
We're still trying to figure out how the throttle functionality would work with those. The issue above is a good example where we may want to exclude throttled instances from a summary or something like that. |
@mikecote What about:
You think it's a valid concern? |
I believe calling If ever creating many objects is a concern, you can always use the alert's state that is always returned at the next execution to persist an array of ids or something. That way you don't have state per alert instance but a single object per alert. (See |
That's actually an interesting idea. Maybe we can just return the list of "firing" ids in the |
I think we have concluded this discussion. Closing, but feel free to re-open. Thanks everyone! |
Our alert instance ids (in Stack Monitoring) are currently dynamic (based on firing nodes uuids) eg:
kibana/x-pack/plugins/monitoring/server/alerts/cpu_usage_alert.ts
Line 420 in da134f3
The reason for this is to always alert/notify the current number of firing nodes, and avoid (our default) throttle period of
1d
. This means that we will almost always get a new/unique instance id, since the node ids will be shifted and the order will be different. Ideally we would still want to get the previous state somehow and detect "resolved" alerts (no longer firing)My concern is since we're doing
replaceState
on every new/unique instance id it gets saved in alert's saved object state, and when I doGET /api/alerts/alert/{id}/state
I see all the created instances (and their states), which we will never use again (since we won't be able to find that state on the next round).I attempted to solve this problem in different PR by always using a fixed/static instance id (with
replaceState
), and only re-notify any changes/deltas by using a different alert instance id (and withoutreplaceState
)Couple of questions given the context above:
repalceState
? (currently in7.9
)InstanceState
passed toexecute()
can be useful somehow?)cc: @chrisronline @jasonrhodes @elastic/kibana-alerting-services
The text was updated successfully, but these errors were encountered: