[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

igoristic · 2020-09-29T05:36:07Z

Our alert instance ids (in Stack Monitoring) are currently dynamic (based on firing nodes uuids) eg:

kibana/x-pack/plugins/monitoring/server/alerts/cpu_usage_alert.ts

Line 420 in da134f3

    
           const instanceId = `${this.type}:${cluster.clusterUuid}:${firingNodeUuids.join(',')}`;

The reason for this is to always alert/notify the current number of firing nodes, and avoid (our default) throttle period of 1d. This means that we will almost always get a new/unique instance id, since the node ids will be shifted and the order will be different. Ideally we would still want to get the previous state somehow and detect "resolved" alerts (no longer firing)

My concern is since we're doing replaceState on every new/unique instance id it gets saved in alert's saved object state, and when I do GET /api/alerts/alert/{id}/state I see all the created instances (and their states), which we will never use again (since we won't be able to find that state on the next round).

I attempted to solve this problem in different PR by always using a fixed/static instance id (with replaceState), and only re-notify any changes/deltas by using a different alert instance id (and without replaceState)

Couple of questions given the context above:

Is it okay to always create new/unique instance ids with repalceState? (currently in 7.9)
If not, what would be the best way to solve this? (maybe Ability to fire actions when an alert instance is resolved #49405 or InstanceState passed to execute() can be useful somehow?)

cc: @chrisronline @jasonrhodes @elastic/kibana-alerting-services

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-29T05:36:10Z

Pinging @elastic/stack-monitoring (Team:Monitoring)

jasonrhodes · 2020-09-29T13:38:20Z

I'm curious to hear from the Alerting team about a scenario like this (which is what prompted the instance ID solution Igor mentions above):

A user creates an alert that checks every 5 minutes but only notifies once every 24 hours. Inside the alert executor for that alert type, we query all hosts to see how many are over some threshold ("red"). If more than x are red, we fire the alert.

However, the hosts that are red may change. If the threshold was, say, 10 hosts need to be red to fire the alert, at first you may find 12, then an hour later you may find 112. Firing the alert the 2nd time (an hour later) won't do anything because the "notify every" throttle is set to 24 hours.

What do we think a user expects in this case?

My instinct tells me that a user may just want a choice between:

Cluster state: tell me when the cluster is red, i.e. there are more than x hosts in a red state. I don't care how far over x that number is or which hosts they are, I just want to know if the cluster is red. Notify every works as expected now because I only get one of these alerts every 24 hours, regardless of the specifics of which hosts, how many hosts, etc.
Host state: I want to know the specific hosts that have gone down, how many, etc. In this case, I want an alert for each host that goes into that state, which will result in many more alerts if I have 100 hosts that go down (but I want that), and then I can set up check every/notify every that corresponds to that granularity. In this case, the new "alerting workflow" changes being planned will probably help a user handle this kind of set up a little better in the future.

chrisronline · 2020-09-29T14:18:38Z

@jasonrhodes It seems to me that users may want a combination of those two. They may care about host-specific state, but they don't want to be notified for every single host individually. It seems valid to me that they'd want a single alert which aggregates all affected hosts and re-notifies if any hosts are newly affected, or newly recovered. This is how we have the alert built out now.

igoristic · 2020-09-29T15:20:17Z

@chrisronline I agree they might want a combination of both (or strictly one or the other). But, more to @jasonrhodes's point why even have the throttle option in the UI if we're bypassing it anyways?

mikecote · 2020-09-30T01:42:57Z

Some roadmap items could help in these scenarios:

Alert summaries [Alerting] Research alert summaries / bulk-able actions #68828
Alert digest / scheduled reports Alert digest / scheduled reports #50257

We're still trying to figure out how the throttle functionality would work with those. The issue above is a good example where we may want to exclude throttled instances from a summary or something like that.

igoristic · 2020-09-30T22:10:30Z

@mikecote What about:

My concern is since we're doing replaceState on every new/unique instance id it gets saved in alert's saved object state, and when I do GET /api/alerts/alert/{id}/state I see all the created instances (and their states), which we will never use again (because we won't be able to find that state on the next round)

You think it's a valid concern?

mikecote · 2020-10-01T00:42:39Z

I believe calling replaceState without scheduleActions causes the data to not persist. It would be the same as not calling replaceState at all so this could be improved.

If ever creating many objects is a concern, you can always use the alert's state that is always returned at the next execution to persist an array of ids or something. That way you don't have state per alert instance but a single object per alert. (See lastChecked example here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/alerts/README.md#example)

chrisronline · 2020-10-01T13:55:17Z

That's actually an interesting idea. Maybe we can just return the list of "firing" ids in the executor and then compare the current ones to the ones within state and determine which ones were added/removed and fire events accordingly.

igoristic · 2020-10-01T15:30:26Z

I think we have concluded this discussion. Closing, but feel free to re-open.

Thanks everyone!

igoristic added discuss Team:Monitoring Stack Monitoring team Feature:Alerting labels Sep 29, 2020

igoristic closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

igoristic commented Sep 29, 2020

elasticmachine commented Sep 29, 2020

jasonrhodes commented Sep 29, 2020

chrisronline commented Sep 29, 2020

igoristic commented Sep 29, 2020

mikecote commented Sep 30, 2020

igoristic commented Sep 30, 2020

mikecote commented Oct 1, 2020

chrisronline commented Oct 1, 2020

igoristic commented Oct 1, 2020

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

Comments

igoristic commented Sep 29, 2020

elasticmachine commented Sep 29, 2020

jasonrhodes commented Sep 29, 2020

chrisronline commented Sep 29, 2020

igoristic commented Sep 29, 2020

mikecote commented Sep 30, 2020

igoristic commented Sep 30, 2020

mikecote commented Oct 1, 2020

chrisronline commented Oct 1, 2020

igoristic commented Oct 1, 2020