Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring][Alerting] Investigate a solution to avoid always creating new instances with replaceState #78724

Closed
igoristic opened this issue Sep 29, 2020 · 9 comments

Comments

@igoristic
Copy link
Contributor

Our alert instance ids (in Stack Monitoring) are currently dynamic (based on firing nodes uuids) eg:

const instanceId = `${this.type}:${cluster.clusterUuid}:${firingNodeUuids.join(',')}`;

The reason for this is to always alert/notify the current number of firing nodes, and avoid (our default) throttle period of 1d. This means that we will almost always get a new/unique instance id, since the node ids will be shifted and the order will be different. Ideally we would still want to get the previous state somehow and detect "resolved" alerts (no longer firing)

My concern is since we're doing replaceState on every new/unique instance id it gets saved in alert's saved object state, and when I do GET /api/alerts/alert/{id}/state I see all the created instances (and their states), which we will never use again (since we won't be able to find that state on the next round).

I attempted to solve this problem in different PR by always using a fixed/static instance id (with replaceState), and only re-notify any changes/deltas by using a different alert instance id (and without replaceState)

Couple of questions given the context above:

  1. Is it okay to always create new/unique instance ids with repalceState? (currently in 7.9)
  2. If not, what would be the best way to solve this? (maybe Ability to fire actions when an alert instance is resolved #49405 or InstanceState passed to execute() can be useful somehow?)

cc: @chrisronline @jasonrhodes @elastic/kibana-alerting-services

@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@jasonrhodes
Copy link
Member

I'm curious to hear from the Alerting team about a scenario like this (which is what prompted the instance ID solution Igor mentions above):

A user creates an alert that checks every 5 minutes but only notifies once every 24 hours. Inside the alert executor for that alert type, we query all hosts to see how many are over some threshold ("red"). If more than x are red, we fire the alert.

However, the hosts that are red may change. If the threshold was, say, 10 hosts need to be red to fire the alert, at first you may find 12, then an hour later you may find 112. Firing the alert the 2nd time (an hour later) won't do anything because the "notify every" throttle is set to 24 hours.

What do we think a user expects in this case?

My instinct tells me that a user may just want a choice between:

  1. Cluster state: tell me when the cluster is red, i.e. there are more than x hosts in a red state. I don't care how far over x that number is or which hosts they are, I just want to know if the cluster is red. Notify every works as expected now because I only get one of these alerts every 24 hours, regardless of the specifics of which hosts, how many hosts, etc.
  2. Host state: I want to know the specific hosts that have gone down, how many, etc. In this case, I want an alert for each host that goes into that state, which will result in many more alerts if I have 100 hosts that go down (but I want that), and then I can set up check every/notify every that corresponds to that granularity. In this case, the new "alerting workflow" changes being planned will probably help a user handle this kind of set up a little better in the future.

@chrisronline
Copy link
Contributor

@jasonrhodes It seems to me that users may want a combination of those two. They may care about host-specific state, but they don't want to be notified for every single host individually. It seems valid to me that they'd want a single alert which aggregates all affected hosts and re-notifies if any hosts are newly affected, or newly recovered. This is how we have the alert built out now.

@igoristic
Copy link
Contributor Author

@chrisronline I agree they might want a combination of both (or strictly one or the other). But, more to @jasonrhodes's point why even have the throttle option in the UI if we're bypassing it anyways?

@mikecote
Copy link
Contributor

Some roadmap items could help in these scenarios:

We're still trying to figure out how the throttle functionality would work with those. The issue above is a good example where we may want to exclude throttled instances from a summary or something like that.

@igoristic
Copy link
Contributor Author

@mikecote What about:

My concern is since we're doing replaceState on every new/unique instance id it gets saved in alert's saved object state, and when I do GET /api/alerts/alert/{id}/state I see all the created instances (and their states), which we will never use again (because we won't be able to find that state on the next round)

You think it's a valid concern?

@mikecote
Copy link
Contributor

mikecote commented Oct 1, 2020

I believe calling replaceState without scheduleActions causes the data to not persist. It would be the same as not calling replaceState at all so this could be improved.

If ever creating many objects is a concern, you can always use the alert's state that is always returned at the next execution to persist an array of ids or something. That way you don't have state per alert instance but a single object per alert. (See lastChecked example here: https://github.com/elastic/kibana/blob/master/x-pack/plugins/alerts/README.md#example)

@chrisronline
Copy link
Contributor

That's actually an interesting idea. Maybe we can just return the list of "firing" ids in the executor and then compare the current ones to the ones within state and determine which ones were added/removed and fire events accordingly.

@igoristic
Copy link
Contributor Author

I think we have concluded this discussion. Closing, but feel free to re-open.

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants