Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] write more alert data to event log when scheduling actions #63257

Closed
pmuellr opened this issue Apr 10, 2020 · 7 comments
Closed

[Alerting] write more alert data to event log when scheduling actions #63257

pmuellr opened this issue Apr 10, 2020 · 7 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@pmuellr
Copy link
Member

pmuellr commented Apr 10, 2020

Currently, the data we are writing the event log for alerts scheduling actions is pretty spartan:

const event: IEvent = {
event: { action: EVENT_LOG_ACTIONS.executeAction },
kibana: {
alerting: {
instance_id: alertInstanceId,
},
namespace: spaceId,
saved_objects: [
{ type: 'alert', id: alertId },
{ type: 'action', id: action.id },
],
},
};

There seems to be some interest in capturing more information here, like instance state.

Two hurdles:

  • how to add it to the documents - we don't want to have to index this data, for now, so should not be "stored", and only accessible via _source
  • security aspects - this data could be sensitive, so at a minimum we'd probably want alerts to opt-in to adding alert-specific data to the documents
@pmuellr pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Apr 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@chrisronline
Copy link
Contributor

Taken from #47713 (comment):

I'd be nice to have access to historical alert instance data. In Stack Monitoring, nearly every single page is governed by a time picker and it'd be great for users to be able to go back in history and view alerts that were firing at that time. It'd also help us build out some kind of historical log of Stack Monitoring alerts.

The other thing that's important is not only showing "firing" alerts in the UI, but also showing "resolved" alerts, as long as the resolution of the alert occurred in the configured time period.

@pmuellr
Copy link
Member Author

pmuellr commented Apr 14, 2020

The other thing that's important is not only showing "firing" alerts in the UI, but also showing "resolved" alerts

We should be able to get this information now - we now write event documents for when an alert instance did not schedule actions on the previous turn, but did schedule actions on the current turn (this instance is now "active"). And also for the opposite - the alert instance did schedule actions on the previous turn, but did not schedule actions on the current turn (this instance is now "resolved").

How this will be rendered in the UI is not clear, but it should be possible to see the "starts" and "resolved" for a particular instance, and get the duration of how long it was active.

@chrisronline
Copy link
Contributor

For Stack Monitoring (and perhaps other solutions), we need a way to store opaque data into this stream (perhaps not even indexed). We are building a landing page that users will see once they click a deep-link from an alert email. Right now, this landing page only has the capability to view currently firing alerts (which is fine to start with), but we'd like to support the familiar time picker on the page and allow the user to go back in time to view alerts that fired in the past.

See the screenshots:
Screen Shot 2020-04-16 at 10 27 31 AM
Screen Shot 2020-04-16 at 10 27 39 AM

An example of the data structure used to power the above:

{
  "cpuUsage": 0,
  "nodeId": "-5XvWwllTW-_tnBZYVyWGw",
  "nodeName": "MonitoringClusterNode1",
  "cluster": {
    "clusterUuid": "1HeA6vq1RgypnZKc8UvK2g",
    "clusterName": "MonitoringCluster"
  },
  "ui": {
    "isFiring": true,
    "message": {
      "text": "Node MonitoringClusterNode1 is reporting cpu usage of 0% at #absolute. #start_linkPlease investigate.#end_link",
      "tokens": [
        {
          "startToken": "#absolute",
          "type": "time",
          "isAbsolute": true,
          "isRelative": false,
          "timestamp": 1587048504639
        },
        {
          "startToken": "#start_link",
          "endToken": "#end_link",
          "type": "link",
          "url": "/elasticsearch/nodes/-5XvWwllTW-_tnBZYVyWGw"
        }
      ]
    },
    "severity": 2001,
    "resolvedMS": 0,
    "triggeredMS": 1587048504639,
    "lastCheckedMS": 0
  }
}

From an API perspective, we are getting this data via:

const states = await alertsClient.getAlertState({ id });

It'd be great if this same API could accept a time range and return the same exact structure for matching entries in the event log for the time period.

@pmuellr
Copy link
Member Author

pmuellr commented Apr 17, 2020

It would be cool to be able to time-travel through alert states via

const states = await alertsClient.getAlertState({ id , dateStart, dateEnd});

I wasn't thinking we'd be keeping the state around for every alertType executor run - that seems like it would be very expensive in terms of document size. It does seem like we should allow for the possibility of allowing an alertType to add extra data when action groups are scheduled via scheduleActions() though. We specifically pass the "context" in that call, so we could store that. I'd think we could arrange to get the instance state (may not be directly accessible today) from that call also.

We'll need to make sure we can measure the size of these things, in aggregation, for cases where the index space usage is growing faster than expected. Perhaps customers should have a say in how much data we store?

Per comments at the top ^^^, seems like the alertType should opt-in to this, via some new actionType flag(s).

In terms of where to store it in the event log - I'll poke around again, but AFAIK there's no good fit for an existing property here, so likely adding a new one to our custom kibana field is the right way to go. Name? Arghhh. I'm going to call it user_data for now, that's what we've called this gloop for decades. :-) It would be a new field with enabled: false, so would be available via the _source, but not indexed.

@pmuellr
Copy link
Member Author

pmuellr commented Apr 17, 2020

As a separate issue, there ARE some existing ECS fields for some of the properties in your example, namely this part:

  "cpuUsage": 0,
  "nodeId": "-5XvWwllTW-_tnBZYVyWGw",
  "nodeName": "MonitoringClusterNode1",
  "cluster": {
    "clusterUuid": "1HeA6vq1RgypnZKc8UvK2g",
    "clusterName": "MonitoringCluster"
  },

The dream has also been to be able to make use of these fields, and my thinking along those lines would be to allow some new object to be passed into scheduleActions() that would be an ECS document. Though we only expose a tiny fraction of ECS fields today, we could easily add more, and even allow them to be indexed. But then we'd also need to allow some rich querying over them, otherwise there's not much value to having them in the first place.

We can always add that capability later - I'm guessing using that as the only way to add additional data the event log is going to be rather limiting, or at least cumbersome, as I'm sure we'd have lots of alertType-specific custom fields we'd have to add for apps/solutions.

@gmmorris gmmorris added Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:EventLog and removed Feature:EventLog labels Jul 1, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 14, 2021
@YulNaumenko
Copy link
Contributor

We've added recent fields to the event log documents and follow up issues will be created as needed.

@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting Feature:EventLog Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants