[Alerting] write more alert data to event log when scheduling actions #63257

pmuellr · 2020-04-10T16:00:35Z

Currently, the data we are writing the event log for alerts scheduling actions is pretty spartan:

kibana/x-pack/plugins/alerting/server/task_runner/create_execution_handler.ts

Lines 90 to 102 in 55a3cc4

    
           const event: IEvent = { 
        
             event: { action: EVENT_LOG_ACTIONS.executeAction }, 
        
             kibana: { 
        
               alerting: { 
        
                 instance_id: alertInstanceId, 
        
               }, 
        
               namespace: spaceId, 
        
               saved_objects: [ 
        
                 { type: 'alert', id: alertId }, 
        
                 { type: 'action', id: action.id }, 
        
               ], 
        
             }, 
        
           };

There seems to be some interest in capturing more information here, like instance state.

Two hurdles:

how to add it to the documents - we don't want to have to index this data, for now, so should not be "stored", and only accessible via _source
security aspects - this data could be sensitive, so at a minimum we'd probably want alerts to opt-in to adding alert-specific data to the documents

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-10T16:00:38Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

chrisronline · 2020-04-10T16:03:43Z

Taken from #47713 (comment):

I'd be nice to have access to historical alert instance data. In Stack Monitoring, nearly every single page is governed by a time picker and it'd be great for users to be able to go back in history and view alerts that were firing at that time. It'd also help us build out some kind of historical log of Stack Monitoring alerts.

The other thing that's important is not only showing "firing" alerts in the UI, but also showing "resolved" alerts, as long as the resolution of the alert occurred in the configured time period.

pmuellr · 2020-04-14T20:18:56Z

The other thing that's important is not only showing "firing" alerts in the UI, but also showing "resolved" alerts

We should be able to get this information now - we now write event documents for when an alert instance did not schedule actions on the previous turn, but did schedule actions on the current turn (this instance is now "active"). And also for the opposite - the alert instance did schedule actions on the previous turn, but did not schedule actions on the current turn (this instance is now "resolved").

How this will be rendered in the UI is not clear, but it should be possible to see the "starts" and "resolved" for a particular instance, and get the duration of how long it was active.

chrisronline · 2020-04-16T14:50:55Z

For Stack Monitoring (and perhaps other solutions), we need a way to store opaque data into this stream (perhaps not even indexed). We are building a landing page that users will see once they click a deep-link from an alert email. Right now, this landing page only has the capability to view currently firing alerts (which is fine to start with), but we'd like to support the familiar time picker on the page and allow the user to go back in time to view alerts that fired in the past.

See the screenshots:

An example of the data structure used to power the above:

{
  "cpuUsage": 0,
  "nodeId": "-5XvWwllTW-_tnBZYVyWGw",
  "nodeName": "MonitoringClusterNode1",
  "cluster": {
    "clusterUuid": "1HeA6vq1RgypnZKc8UvK2g",
    "clusterName": "MonitoringCluster"
  },
  "ui": {
    "isFiring": true,
    "message": {
      "text": "Node MonitoringClusterNode1 is reporting cpu usage of 0% at #absolute. #start_linkPlease investigate.#end_link",
      "tokens": [
        {
          "startToken": "#absolute",
          "type": "time",
          "isAbsolute": true,
          "isRelative": false,
          "timestamp": 1587048504639
        },
        {
          "startToken": "#start_link",
          "endToken": "#end_link",
          "type": "link",
          "url": "/elasticsearch/nodes/-5XvWwllTW-_tnBZYVyWGw"
        }
      ]
    },
    "severity": 2001,
    "resolvedMS": 0,
    "triggeredMS": 1587048504639,
    "lastCheckedMS": 0
  }
}

From an API perspective, we are getting this data via:

const states = await alertsClient.getAlertState({ id });

It'd be great if this same API could accept a time range and return the same exact structure for matching entries in the event log for the time period.

pmuellr · 2020-04-17T18:47:57Z

It would be cool to be able to time-travel through alert states via

const states = await alertsClient.getAlertState({ id , dateStart, dateEnd});

I wasn't thinking we'd be keeping the state around for every alertType executor run - that seems like it would be very expensive in terms of document size. It does seem like we should allow for the possibility of allowing an alertType to add extra data when action groups are scheduled via scheduleActions() though. We specifically pass the "context" in that call, so we could store that. I'd think we could arrange to get the instance state (may not be directly accessible today) from that call also.

We'll need to make sure we can measure the size of these things, in aggregation, for cases where the index space usage is growing faster than expected. Perhaps customers should have a say in how much data we store?

Per comments at the top ^^^, seems like the alertType should opt-in to this, via some new actionType flag(s).

In terms of where to store it in the event log - I'll poke around again, but AFAIK there's no good fit for an existing property here, so likely adding a new one to our custom kibana field is the right way to go. Name? Arghhh. I'm going to call it user_data for now, that's what we've called this gloop for decades. :-) It would be a new field with enabled: false, so would be available via the _source, but not indexed.

pmuellr · 2020-04-17T18:54:55Z

As a separate issue, there ARE some existing ECS fields for some of the properties in your example, namely this part:

  "cpuUsage": 0,
  "nodeId": "-5XvWwllTW-_tnBZYVyWGw",
  "nodeName": "MonitoringClusterNode1",
  "cluster": {
    "clusterUuid": "1HeA6vq1RgypnZKc8UvK2g",
    "clusterName": "MonitoringCluster"
  },

The dream has also been to be able to make use of these fields, and my thinking along those lines would be to allow some new object to be passed into scheduleActions() that would be an ECS document. Though we only expose a tiny fraction of ECS fields today, we could easily add more, and even allow them to be indexed. But then we'd also need to allow some rich querying over them, otherwise there's not much value to having them in the first place.

We can always add that capability later - I'm guessing using that as the only way to add additional data the event log is going to be rather limiting, or at least cumbersome, as I'm sure we'd have lots of alertType-specific custom fields we'd have to add for apps/solutions.

YulNaumenko · 2021-08-13T15:09:42Z

We've added recent fields to the event log documents and follow up issues will be created as needed.

pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 10, 2020

mikecote added the discuss label Apr 13, 2020

chrisronline mentioned this issue Apr 21, 2020

[Metrics Alerts][discuss] Alert History #58295

Closed

mikecote added the Feature:EventLog label Aug 17, 2020

gmmorris added Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:EventLog and removed Feature:EventLog labels Jul 1, 2021

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021

YulNaumenko closed this as completed Aug 13, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] write more alert data to event log when scheduling actions #63257

[Alerting] write more alert data to event log when scheduling actions #63257

pmuellr commented Apr 10, 2020

elasticmachine commented Apr 10, 2020

chrisronline commented Apr 10, 2020

pmuellr commented Apr 14, 2020

chrisronline commented Apr 16, 2020

pmuellr commented Apr 17, 2020 •

edited

Loading

pmuellr commented Apr 17, 2020

YulNaumenko commented Aug 13, 2021

[Alerting] write more alert data to event log when scheduling actions #63257

[Alerting] write more alert data to event log when scheduling actions #63257

Comments

pmuellr commented Apr 10, 2020

elasticmachine commented Apr 10, 2020

chrisronline commented Apr 10, 2020

pmuellr commented Apr 14, 2020

chrisronline commented Apr 16, 2020

pmuellr commented Apr 17, 2020 • edited Loading

pmuellr commented Apr 17, 2020

YulNaumenko commented Aug 13, 2021

pmuellr commented Apr 17, 2020 •

edited

Loading