-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Research alert summaries / bulk-able actions #68828
Comments
Pinging @elastic/uptime (Team:uptime) |
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
My original thought on "batching" was to run a single action for all the instances that had actions scheduled for an action group, rather than the current level of code where we run an action per each of those instances. Eg, 1 email with a list of 100 hosts, vs 100 emails with a single host. I may be misinterpreting, but the "items to batch by" sounds like it could be "sub-batching", where you might want to have an multiple actions invoked, partitioning the instances somehow. Action groups already handles this, although it may not be a good fit to handle from a practical standpoint (and could be confusing to customers - lots more knobs and dials!). So curious if you're suggesting explicit sub-batching or not. |
Perhaps the batching choice should be defaulted by the action type itself. Email you probably want batch, PagerDuty you probably want single. |
I would really like to see this feature for all alert output types. I'd also argue that I think the default should be to not batch them too. While 1000 emails when 1000 hosts are down isn't ideal, I think I would still prefer it over getting an email with 1000 hosts in it. Here are some more of my thoughts on why I can see users wanting to not have batched alerts for email and other similar output types: Email may be used as input for another alerting systemSome alerting systems take email as the input for events. At a previous job our alerting system (think a home grown PagerDuty) recieved its input only via emails and had no other API. We would have all monitoring systems send alerts to it because being able to send an email was something that all of our tooling and environments could do. Funny enough Pagerduty also offers an Email Integration for this exact use case. Users running in airgapped environments might use similar notification strategies. Since it is unlikely their internal systems are allowed to talk directly to an external service like PagerDuty however they might be allowed to talk to a local email server. Clear ownership and discussionsImagine a team is using email as their sole alerting mechanism. In the event that multiple hosts are down having them in separate emails is ideal. It will mean:
This is something we still do for our Slack (and email) based alerting that isn't yet hooked up to PagerDuty. Each Slack alert will have a thread created to discuss the alert, normally ending with a link to a GitHub issue where we can assign ownership and add more details. Throttle periodsWith batched emails you will only be sending them every X amount of time. Let's use 60 minutes as an example throttle period. In the situation that 1000 hosts go down in the span of 5 minutes things get tricky. If the first host goes down and the alert fires you won't know about the other 999 hosts for another 55 minutes. Acknowledgements and mutingHow do acknowledgments and muting of notifications work with batched alerts? If a single test environment host is down can I mute that single host? Or will I only be able to mute all future alerts for all hosts? It's very normal for large environments to have muted hosts all the time. If someone were to mute that notification they may accidentally disable alerts for important production hosts. |
This is a duplicate of #50258. Because there is more use cases here, I will close the other one. |
Thanks for the detailed writeup @Crazybus good points about email, I agree with you that the batch should not be the default. |
Linking this issue with alert digest / scheduled reports #50257. There may be some overlap. |
This came up again as a potential solution for the large volume of alerts generated by geo containment alerts. The problemUnless throttled, geo containment alerts can produce a large number of alert instances for each interval. In local tests of busses moving in Manhattan, upwards of 3000 alert instances were generated for each 20 second interval, however the task manager was only able to process ~10 actions/4 seconds, creating a growing backlog with each batch. The conversationThe ability to execute actions in bulk is also interesting to us for this scenario. These actions might include indexing for later display in Maps, standard logging, email, etc. The idea mentioned above "Email may be used as input for another alerting system" is relevant here as geo alerts evolve to handle more IoT use cases. Think of tracking an item in a building with the item pinging out its location as it moves from room to room (i.e.- asset tracking). Other solutions to this issue focused more on ways we might reduce the number of alert instances or more optimal ways to index data. While this might work fine for some use cases, I still think a solution such batched actions should be part of the picture. |
Depending on how badly this is needed, and when alerting can deliver something, "bulk" processing could be done by the alert itself. It could be an option on the alert itself (a new param) - so it could process alerts individually, or in bulk. I don't think any alerts have done this yet, so we'd want to design this pretty carefully. And hopefully when we do have the "bulk" feature, we'd want to evolve the alert to use the new "bulk" feature. Another thought is the way the security alerts work. Two levels. First level generates data based on findings, and writes that to one index. The second level reads that index to generate actual alerts that have customer-facing actions (eg, email, slack). |
I am adding this issue to the current iteration. We should have a brainstorming session as a team to ensure we aggregate the problems and requirements that this issue intends to solve prior to creating an RFC (it's been around for a while 😁). |
We built our own 12h summary report of the It would be nice to have the ability to do this natively within Kibana and have the output sent to a connector such as slack or email.
|
Since research for this is done and issues are created, closing this in favor of: #143200 |
Describe the feature:
It would be nice to support in alerting the ability to collapse multiple events into a single one on a per action type basis. For instance, imagine using uptime and creating an alert to monitor 1000+ hosts. You may want email actions to send one email with a summary of everything being down (which is what we do now), but send each outage as a separate event to pager duty.
A complication here is that the alert message would likely need to be on a per-alert-type basis. There would also need to be a checkbox enabling batching / un-batching, and a high level notion of discrete items to batch by.
Describe a specific use case for the feature:
This use case has been described by a few users, including internally from @jarpy and @Crazybus . It would be nice for services like PagerDuty to receive alerts separately. This still doesn't make sense for emails, no one wants to receive 1000+ emails.
CC @drewpost @pmuellr
The text was updated successfully, but these errors were encountered: