Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Research alert summaries / bulk-able actions #68828

Closed
andrewvc opened this issue Jun 10, 2020 · 14 comments
Closed

[Alerting] Research alert summaries / bulk-able actions #68828

andrewvc opened this issue Jun 10, 2020 · 14 comments
Assignees
Labels
enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting NeededFor:Maps NeededFor:Uptime Project:AlertingNotifyEfficiently Alerting team project for reducing the noise created by the alerting framework. R&D Research and development ticket (not meant to produce code, but to make a decision) research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability

Comments

@andrewvc
Copy link
Contributor

andrewvc commented Jun 10, 2020

Describe the feature:

It would be nice to support in alerting the ability to collapse multiple events into a single one on a per action type basis. For instance, imagine using uptime and creating an alert to monitor 1000+ hosts. You may want email actions to send one email with a summary of everything being down (which is what we do now), but send each outage as a separate event to pager duty.

A complication here is that the alert message would likely need to be on a per-alert-type basis. There would also need to be a checkbox enabling batching / un-batching, and a high level notion of discrete items to batch by.

Describe a specific use case for the feature:

This use case has been described by a few users, including internally from @jarpy and @Crazybus . It would be nice for services like PagerDuty to receive alerts separately. This still doesn't make sense for emails, no one wants to receive 1000+ emails.

CC @drewpost @pmuellr

@andrewvc andrewvc added enhancement New value added to drive a business result Feature:Alerting Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability :Alerting labels Jun 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/uptime (Team:uptime)

@pmuellr pmuellr added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed :Alerting labels Jun 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jun 10, 2020

a high level notion of discrete items to batch by

My original thought on "batching" was to run a single action for all the instances that had actions scheduled for an action group, rather than the current level of code where we run an action per each of those instances. Eg, 1 email with a list of 100 hosts, vs 100 emails with a single host.

I may be misinterpreting, but the "items to batch by" sounds like it could be "sub-batching", where you might want to have an multiple actions invoked, partitioning the instances somehow.

Action groups already handles this, although it may not be a good fit to handle from a practical standpoint (and could be confusing to customers - lots more knobs and dials!).

So curious if you're suggesting explicit sub-batching or not.

@pmuellr
Copy link
Member

pmuellr commented Jun 10, 2020

Perhaps the batching choice should be defaulted by the action type itself. Email you probably want batch, PagerDuty you probably want single.

@Crazybus
Copy link

It would be nice for services like PagerDuty to receive alerts separately. This still doesn't make sense for emails, no one wants to receive 1000+ emails.

I would really like to see this feature for all alert output types. I'd also argue that I think the default should be to not batch them too. While 1000 emails when 1000 hosts are down isn't ideal, I think I would still prefer it over getting an email with 1000 hosts in it.

Here are some more of my thoughts on why I can see users wanting to not have batched alerts for email and other similar output types:

Email may be used as input for another alerting system

Some alerting systems take email as the input for events. At a previous job our alerting system (think a home grown PagerDuty) recieved its input only via emails and had no other API. We would have all monitoring systems send alerts to it because being able to send an email was something that all of our tooling and environments could do.

Funny enough Pagerduty also offers an Email Integration for this exact use case.

Users running in airgapped environments might use similar notification strategies. Since it is unlikely their internal systems are allowed to talk directly to an external service like PagerDuty however they might be allowed to talk to a local email server.

Clear ownership and discussions

Imagine a team is using email as their sole alerting mechanism. In the event that multiple hosts are down having them in separate emails is ideal. It will mean:

  • Discussions around a single host go into a dedicated email thread. E.g. "I'll take a look at this one!". If 10 hosts go down at the same time and you have 5 people all replying to the same email it gets messy real fast.
  • The history from previous downtime and discussions are in the same thread
  • Recovery alerts can be sent for each host separately

This is something we still do for our Slack (and email) based alerting that isn't yet hooked up to PagerDuty. Each Slack alert will have a thread created to discuss the alert, normally ending with a link to a GitHub issue where we can assign ownership and add more details.

Throttle periods

With batched emails you will only be sending them every X amount of time. Let's use 60 minutes as an example throttle period. In the situation that 1000 hosts go down in the span of 5 minutes things get tricky. If the first host goes down and the alert fires you won't know about the other 999 hosts for another 55 minutes.

Acknowledgements and muting

How do acknowledgments and muting of notifications work with batched alerts? If a single test environment host is down can I mute that single host? Or will I only be able to mute all future alerts for all hosts? It's very normal for large environments to have muted hosts all the time. If someone were to mute that notification they may accidentally disable alerts for important production hosts.

@mikecote
Copy link
Contributor

This is a duplicate of #50258. Because there is more use cases here, I will close the other one.

@andrewvc
Copy link
Contributor Author

Thanks for the detailed writeup @Crazybus good points about email, I agree with you that the batch should not be the default.

@mikecote
Copy link
Contributor

Linking this issue with alert digest / scheduled reports #50257. There may be some overlap.

@andrewvc andrewvc changed the title [Alerting] Bulk/Un-bulkable alerts [Alerting] Bulk/Un-bulkable alerts + standard fields Jul 15, 2020
@mikecote mikecote added the R&D Research and development ticket (not meant to produce code, but to make a decision) label Sep 9, 2020
@mikecote mikecote changed the title [Alerting] Bulk/Un-bulkable alerts + standard fields [Alerting] Alert summaries / bulk-able actions Dec 16, 2020
@kindsun
Copy link
Contributor

kindsun commented Dec 17, 2020

This came up again as a potential solution for the large volume of alerts generated by geo containment alerts.

The problem

Unless throttled, geo containment alerts can produce a large number of alert instances for each interval. In local tests of busses moving in Manhattan, upwards of 3000 alert instances were generated for each 20 second interval, however the task manager was only able to process ~10 actions/4 seconds, creating a growing backlog with each batch.

The conversation

The ability to execute actions in bulk is also interesting to us for this scenario. These actions might include indexing for later display in Maps, standard logging, email, etc. The idea mentioned above "Email may be used as input for another alerting system" is relevant here as geo alerts evolve to handle more IoT use cases. Think of tracking an item in a building with the item pinging out its location as it moves from room to room (i.e.- asset tracking).

Other solutions to this issue focused more on ways we might reduce the number of alert instances or more optimal ways to index data. While this might work fine for some use cases, I still think a solution such batched actions should be part of the picture.

@pmuellr
Copy link
Member

pmuellr commented Jan 6, 2021

Depending on how badly this is needed, and when alerting can deliver something, "bulk" processing could be done by the alert itself. It could be an option on the alert itself (a new param) - so it could process alerts individually, or in bulk.

I don't think any alerts have done this yet, so we'd want to design this pretty carefully. And hopefully when we do have the "bulk" feature, we'd want to evolve the alert to use the new "bulk" feature.

Another thought is the way the security alerts work. Two levels. First level generates data based on findings, and writes that to one index. The second level reads that index to generate actual alerts that have customer-facing actions (eg, email, slack).

@gmmorris gmmorris added the Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework label Jul 1, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote mikecote moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Jun 27, 2022
@mikecote
Copy link
Contributor

mikecote commented Jun 27, 2022

I am adding this issue to the current iteration. We should have a brainstorming session as a team to ensure we aggregate the problems and requirements that this issue intends to solve prior to creating an RFC (it's been around for a while 😁).

@mikecote mikecote changed the title [Alerting] Alert summaries / bulk-able actions [Alerting] Research alert summaries / bulk-able actions Jun 27, 2022
@aarju
Copy link

aarju commented Jun 28, 2022

We built our own 12h summary report of the low severity alerts using our SOAR system to automate. We have a scheduled script that runs every 12h with the following agg query. We then format the results and send them to our 'threat hunting' slack channel so we can have a summary of the low severity alerts to keep an eye on them for anything strange. We display the alert with the number of total times that alert triggered, then that is broken down by the list of host.name values with the number of times the alerts triggered for each host.

It would be nice to have the ability to do this natively within Kibana and have the output sent to a connector such as slack or email.

{
  "size": "1",
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "signal.rule.severity": "low"
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": "now-12h"
            }
          }
        }
      ],
      "must_not": [
        {
          "match_phrase": {
            "signal.rule.building_block_type": "default"
          }
        }
      ]
    }
  },
  "aggs": {
    "rulename_agg": {
      "terms": {
        "field": "signal.rule.name",
        "order": {
          "_count": "desc"
        },
        "size": 500
      },
      "aggs": {
        "ruleid_agg": {
          "terms": {
            "field": "signal.rule.id",
            "order": {
              "_count": "desc"
            },
            "size": 500
          },
          "aggs": {
            "hostname_agg": {
              "terms": {
                "field": "host.name",
                "order": {
                  "_count": "desc"
                },
                "size": 500
              }
            }
          }
        }
      }
    }
  }
}

@ersin-erdal
Copy link
Contributor

Since research for this is done and issues are created, closing this in favor of: #143200

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Oct 14, 2022
@zube zube bot removed the [zube]: Done label Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting NeededFor:Maps NeededFor:Uptime Project:AlertingNotifyEfficiently Alerting team project for reducing the noise created by the alerting framework. R&D Research and development ticket (not meant to produce code, but to make a decision) research Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Uptime - DEPRECATED Synthetics & RUM sub-team of Application Observability
Projects
No open projects
Development

No branches or pull requests

10 participants