[RAC][Epic] Observability of the alerting framework phase 1 #98902

mikecote · 2021-04-30T11:39:48Z

Epic Name

Observability of the alerting framework phase 1

Background

The RAC initiative will drastically increase the adoption of alerting. With an increase in adoption, there will also be an increase in rules the alerting framework will handle. This increase can cause the overall alerting framework to behave in unexpected ways, and it currently takes a lot of steps to identify the root cause.

User Story / Problem Statement(s)

As a Kibana administrator, I can quickly identify root causes when the alerting framework isn't behaving properly.
As a Kibana developer, I can see insight into the performance impact my rule type has.

Success Criteria

An initial step at reduced times for Kibana administrators to find root causes of framework misbehaviour.
An initial step at providing insights to developers about their rule types.

Proposal

See #98902 (comment)

The agreed upon proposal from the above comment yielded these two tickets:

Gain insight into task manager health apis when a problem occurs [Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505
Improve event log data to include when the rule execution starts [Alerting] [o11y] Improve event log data to include when the rule execution starts #101507

These issues should be considered part of this effort, as they will help tell a better performance story from an event log perspective:

Add rule type id in custom kibana.alerting field [event log] add rule type id in custom kibana.alerting field #95411
Populate rule.* ECS fields for alert events [event log] populate rule.* ECS fields for alert events #94137
Add rule to event log shared object for provider: actions action: execute log event [event log] add rule to event log shared object for provider: actions action: execute log event #99225

Related issues

[Alerting] Add alerts from Stack Monitoring regarding Task Manager health [Alerting] Add alerts from Stack Monitoring regarding Task Manager health #98378
[Task Manager] Health Metrics capacity estimation enhancements [Task Manager] Health Metrics capacity estimation enhancements #98625
[Alerting + Task Manager] Stack Monitoring for Alerting / Task Manager [Alerting + Task Manager] Stack Monitoring for Alerting / Task Manager #95197
Best practices for rule executions Best practices for rule executions #98246
Index task manager health API into Kibana monitoring indices [Alerting] Index task manager health data into Kibana SM monitoring indices #98373
[event log] add rule type id in custom kibana.alerting field [event log] add rule type id in custom kibana.alerting field #95411
[event log] populate rule.* ECS fields for alert events [event log] populate rule.* ECS fields for alert events #94137
[discuss] extending event log for faster/easier access to active instance date information [discuss] extending event log for faster/easier access to active instance date information #93704
[event log] add rule to event log shared object for provider: actions action: execute log event [event log] add rule to event log shared object for provider: actions action: execute log event #99225

elasticmachine · 2021-04-30T11:39:50Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-04-30T20:33:32Z

We should take another look at the APM-ization we did a while back, see if that still provides useful information. I believe we wrapped tasks as APM transactions, so get pretty good data from that - counts/durations per task and indirectly alert/action (since each alert and action has it's own task type).

Even though APM is not officially "on" today in Kibana, some day it will be, and presumably before it is "official" there may be some sneaky way to enable it. And on-prem customers could use it today. So we should make sure what we have in there today is going to be useful.

chrisronline · 2021-04-30T20:41:25Z

I'd like to advocate for integrating into Stack Monitoring in two ways:

Leverage the collection and shipping mechanisms that currently exist in Stack Monitoring to allow users to ship monitoring metrics and logs surrounding task manager/rules/connectors to different cluster(s)
Leverage the existing Stack Monitoring UI, specifically the Kibana monitoring UI, to visualize performance metrics for task manger/rules/connectors. The current hierarchy is very broad but we could add grouping by rule/connector type and show specific UIs for that data - similarly to how the Stack Monitoring UI handles things like ML and CCR monitoring.

I know the Stack Monitoring UI is in flux and the future is unclear, but it feels like the most straightforward path, as the UI and shipping/collection mechanisms are proven to work for users and reinventing that wheel will take time.

pmuellr · 2021-04-30T21:03:13Z

The event log can provide some pretty good data in terms of counts/durations of actions and alerts, but doesn't currently track any other task manager -related tasks. The index is read-only and considered a Kibana system index because of it's .kibana- prefix, but assuming it's straight-forward to manually provide read privs to this index for users, it's straight-forward to create an index pattern for it, and then use it in Discover and Lens.

I just played with it again, and one of the problems with the current event log structure is that the saved object references are not usable within Lens, presumably because they are a nested field. I'm wondering if we can "runtime" them into fields that Lens could see? Otherwise, we can't get any rule-specific breakdowns.

The last set of changes for the event_log introduced the top-level rule property, which is a great place we could put the rule-specific information:

kibana/x-pack/plugins/event_log/generated/mappings.json

Lines 173 to 219 in 8810e84

    
           "rule": { 
        
               "properties": { 
        
                   "author": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword", 
        
                       "meta": { 
        
                           "isArray": "true" 
        
                       } 
        
                   }, 
        
                   "category": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "description": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "id": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "license": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "name": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "reference": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "ruleset": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "uuid": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   }, 
        
                   "version": { 
        
                       "ignore_above": 1024, 
        
                       "type": "keyword" 
        
                   } 
        
               } 
        
           },

We'll need to figure out how to map all the goodies though - we'll want the rule type id and the rule id available, at a minimum. There are some related issues already open for this: #94137 and #95411

This won't help for actions though. I'd say if we can solve accessing the nested fields in Lens with runtime fields, let's go with that, otherwise we can provide a property connector or such under the existing custom kibana field, to add the connector id, space, etc.

pmuellr · 2021-05-04T16:38:12Z

Existing known issues regarding improving event log for diagnosis (just added to the list at the top of this issue):

[event log] add rule type id in custom kibana.alerting field [event log] add rule type id in custom kibana.alerting field #95411
[event log] populate rule.* ECS fields for alert events [event log] populate rule.* ECS fields for alert events #94137
[discuss] extending event log for faster/easier access to active instance date information [discuss] extending event log for faster/easier access to active instance date information #93704
[event log] add rule to event log shared object for provider: actions action: execute log event [event log] add rule to event log shared object for provider: actions action: execute log event #99225

chrisronline · 2021-05-13T17:05:09Z

Proposal

To ensure users have the right tools and visibility to understand and react to the health and performance of the alerting framework (including task manager), we need to integrate into existing infrastructure and tooling provided to users of the stack (specifically thinking Stack Monitoring and/or APM). This will ensure we do not need to reinvent the wheel when it comes to how to collect/store/visualize metrics and logs, but rather help tell a holistic story for users on where they go and what they do to diagnose/detect/fix performance issues with their stack. We want users to go where they currently go to solve performance issues with Kibana

Implementation

Identifying the appropriate metrics

Determine which metrics are needed to understand that the alerting framework is not performing optimally (this includes both TM & Alert/Actions metrics) [Alerting] [o11y] Determine metrics necessary to diagnose performance issues with TM and Alerting #100675
Determine how we want to visualize these metrics to the end user (this affects mapping choices and we need to understand the impact of these decisions - for example, does using nested fields mean we aren't able to use Lens to visualize the data? It looks like we wouldn't be able to) [Alerting] [o11y] Determine how to represent collected metrics #100676

Establishing appropriate default alerts

Determine "out of the box" alerts for these metrics with viable (yet configurable) defaults (possible integration with SM) [Alerting] [o11y] Determine what alerts we might want to support #100677
Explore integration default alerts with other stack solutions, such as APM. Either default, or something we can solve through documentation to create necessary alerts? [Alerting] [o11y] Determine what alerts we might want to support #100677

Collect the data

Add collection logic for metrics to support existing stack solutions (Stack Monitoring & APM, maybe more?) [Alerting] [o11y] Add data collection #100678
Explore indexing data ourselves if there are limitations around indexing strategy (SM is limited to a single collection interval for example) - this would give us better control over how our data is indexed so we can better control what visualizations we could support, but it comes at the cost of needing to manage collecting/shipping the data as well as controlling the lifecycle of the indices [Alerting] [o11y] Add data collection #100678

Data storage

Determine where and how the data is stored (In a separate cluster? In SM indices? in a custom index?) If we store the metrics in the same cluster where the alerts exist, we could have issues accessing the data if the cluster is under load. [Alerting] [o11y] Determine where to store the index #100681

Visualize the data

Figure where users would see this data. Kibana SM UI?, custom dashboard? [Alerting] [o11y] Determine how to represent collected metrics #100676

PoC

I'm working on proving how this might integrate into Stack Monitoring in #99877

chrisronline · 2021-05-26T10:49:30Z

Hey @arisonl ,

From your perspective, what would be some helpful end-user outputs from this effort, most likely in terms of specifics charts that would be helpful?

I solicited some high level input about what general metrics would be helpful from @gmmorris and @pmuellr where they said:

Drift for sure, ideally with granularity at rule type level.
Execution Duration, as we find the long running ones are often the ones causing trouble.
Failure rate by rule type.

drift for overall "are there a lot of tasks queued"
execution duration to how fast things are actually executing

I'm hoping to translate these into visualizations which will help in shaping and mapping the data properly. Do you have any thoughts on how to best represent these? Or perhaps you have additional thoughts on which metrics to collect too?

I imagine most of these will leverage line charts over time, but we could and should think about what each line chart could contain (like a drift line chart that has four lines representing p50, p90, p95, p99 drift, etc)

chrisronline · 2021-06-03T20:29:56Z

After spending more time working on this, I think we need to change the proposal somewhat drastically.

Proposal

In the short term, we need to focus on delivering value in terms of helping support users with various issues with alerting and/or task manager. I think this starts with reusing existing tooling to help. The support team has an existing tool, support diagnostics, that takes a snapshot of the state of the cluster (in the form of calling a bunch of APIs). This tool is something that the support team uses with nearly every case that comes in, and it's usage can be slightly adapted to include Kibana metrics as well (it might already do this by default).

We can deliver value by enhancing the data the tool already collects, and also add more data points for collection, specifically referring to enhancing the event log and then adding specific queries for the tool to run against the event log to capture important data points.

In the long term, we will use the short term gains to help infer what our long term solution looks like. I still feel confident an integration with Stack Monitoring is the best route, but we need more time to flesh out what exactly we should collect before attempting that more.

Implementation

Enhancing event log

Add rule type id in custom kibana.alerting field [event log] add rule type id in custom kibana.alerting field #95411
Populate rule.* ECS fields for alert events [event log] populate rule.* ECS fields for alert events #94137
(Discuss) extending event log for faster/easier access to active instance date information [discuss] extending event log for faster/easier access to active instance date information #93704
Add rule to event log shared object for provider: actions action: execute log event [event log] add rule to event log shared object for provider: actions action: execute log event #99225

Identifying "cheat sheet" queries from event log

Identify some queries that we've ran against the event log for SDHs that would save time if known upfront

Ensure as much data is available

The more data we can make available through existing or new APIs, the more we can do in the support tool without needing to wait for an official release - we have existing APIs for TM health and alerting health and we need to ensure that those metrics, combined with raw task manager and event log output, will be enough (to the best of our knowledge) to service all previous SDHs

chrisronline · 2021-06-04T21:20:41Z

This has been approved by the team as of 6/7/2021 as the targeted items in the first stage**

After even MORE time thinking this through, the goal is to identify data points that are impossible to know when supporting users. We had a meeting where we identified and discussed these and these should be the focus on the first release of this effort. I'm going to list problem statements with a brief description and suggested remedial steps at the end.

However, the current thinking is that the initial delivery on this effort will only involve action items that solve the below problems. The assumption is that once these problems are resolved, we will be in a much better state to help users.

Once we feel confident that enough (as much as possible) "impossible"s are solved, it makes sense to pivot this thinking into how to deliver a better experience for us and for our users to give them the necessary insight into the health and performance of task manager and the alerting system. For now, I will not spend the time thinking through the next steps here, to ensure we focus on the value we need to ship in the initial delivery.

Problems

We can only see health stats when requested, not when there is/was a problem

We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.

To combat this, we have a couple options:

Persist health metrics over time so we are able to query for metrics at certain time periods
Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

The event log does not show the whole picture

We do not currently write to the event log when an rule starts execution (only when the rule finishes execution) so it's not possible for us to stitch together the timeline of rule execution to understand if one is starting and not finishing or something else.

To combat this, we should write to the event log when a rule begins execution

We have no insight to the stack infrastructure to debug problems

We've run into numerous issues with misconfiguration of a Kibana, and sometimes this Kibana is missed when looking at the infrastructure. This is primarily due to not having a reliable way to know how many Kibanas are talking to a single ES cluster.

To combat this, we need to learn more about our available tools. I think the best way to handle this is to rely on Stack Monitoring which should tell us how many Kibanas are reporting to a single Elasticsearch cluster. Kibana monitoring is on by default, as long as monitoring is enabled on the cluster, which should give us valuable insight into the infrastructure. Once we have the full list, we should be able to quickly identify misconfigurations, such as different encryption keys used on Kibanas that talk to the same Elasticsearch cluster.

cc @elastic/kibana-alerting-services to verify this list feels complete based on the conversations the past two days

chrisronline · 2021-06-14T16:47:15Z

For 7.14, we are aiming to deliver:

Gain insight into task manager health apis when a problem occurs : [Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505
Improve event log data to include when the rule execution starts [Alerting] [o11y] Improve event log data to include when the rule execution starts #101507

We are confident these two items (in addition to internal training around existing tools/solutions) will help us answer impossible questions with customer issues, such as "why was my action delayed by 5 minutes at this time yesterday?" and "why didn't my rule run at this time?"

gmmorris · 2021-06-14T17:14:49Z

Thanks @chrisronline

As this Epic is being worked on across multiple streams, I feel it's worth taking stock of what we have decided to prioritise and why.

As Chris noted above, in order to deliver on the success criteria stated for this Epic we decided to focus on problems that are currently either "impossible" to resolve in customer deployments, or at least extremely difficult.

With that in mind we took stock of these sorts of problems, as identified by our root cause analysis of past support cases, and prioritised the following issues:

Already merged

These issues have been merged and, barring any unexpected defects, are aimed at inclusion in the nearing possible minor release.

Last Updated: 23rd June

Issue	Title	Why have we prioritised this
#98625	[Task Manager] Health Metrics capacity estimation enhancements	Should dramatically reduce the time spent diagnosing scalability issues in the Alerting Framework
#95411	[event log] add rule type id in custom kibana.alerting field	Should dramatically reduce the time spent correlating rule events with specific rule types
#94137	[event log] populate rule.* ECS fields for alert events	Aligns our events with those produces by Security and Observability solutions, improving our ability to correlate issues across products in the stack
#93704	[discuss] extending event log for faster/easier access to active instance date information	Enables us to correlate active alert to actions, by including the activity duration in the Event Log
#96683	[alerts] http 500 response when creating an alert using an API key has the http authorization	Adds more context around API failures that are related to the use of API keys (This is more about automation than Observability, it it should help our support efforts, so feels related and worth noting)
#98729	Alerting docs are missing an example to list the top rules that are executing slowly	Docs changes enable customers to help themselves, freeing us up to respond to other customers with more complex issues
#99160	Improve Task Manager instrumentation	(Experimental ⚠️ ) Enables us to use Elastic APM to trace issues across Rules and Actions 🎉 [for the record, this was an initiative by the awesome @dgieselaar ]
#98624	[Task Manager] Workload aggregation caps out at 10k tasks	Adds more detailed information into our health monitoring, giving us a full picture of the workload system
#98796	Status page returns 503 when alerting cannot find its health task in Task Manager	This was a bug which caused our health monitoring to be unreliable at certain points, felt like we had to address this asap
#99930	Alerting health API only considers rules in the default space	Same as above - a bug that impacted our health monitoring and priority wasn't even up for debate :)
#101227	[alerting] log warning when alert tasks are disabled due to saved object not found	Adds more detailed information into why a rule task might have failed. At the moment we don't actually know when a missing SO has caused a task to fail.
#87055	Issues with alert statuses UI	UX improvements that we hope will enable customers to help themselves, freeing us up to respond to other customers with more complex issues
#101505	Gain insight into task manager health apis when a problem occurs	Enables us to correlate between Task Manager health stats and issues identified by customers, as long as they have debug logging enabled at the time
#101507	Improve event log data to include when the rule execution starts	Enables us to correlate between rule execution and issues identified by customers
#99225	[event log] add rule to event log shared object for provider: actions action: execute log event	should dramatically reduce the time spent identifying the root cause of action failures

In Progress

These issues are aimed for inclusion the nearing possible minor release, but this depends on progress made by feature freeze.

Issue	Title	Why have we prioritised this

chrisronline · 2021-07-01T18:08:46Z

Per #98902 (comment), we shipped everything we aimed to ship for the first phase of this effort so closing this ticket

mikecote added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog epic Theme: rac label obsolete 7.14 candidate labels Apr 30, 2021

mikecote assigned chrisronline May 12, 2021

mikecote mentioned this issue May 26, 2021

Research questions we find hard to answer at the moment when supporting an SDH #100671

Closed

This was referenced Jun 7, 2021

[Alerting] [o11y] Gain insight into task manager health apis when a problem occurs #101505

Closed

[Alerting] [o11y] Improve event log data to include when the rule execution starts #101507

Closed

mikecote mentioned this issue Jun 15, 2021

Ensure support can leverage new observability of alerting 7.14 features #102251

Closed

chrisronline mentioned this issue Jun 16, 2021

[Alerting] [o11y] Improve event log data to include when the action execution starts #102358

Closed

chrisronline closed this as completed Jul 1, 2021

gmmorris mentioned this issue Jul 5, 2021

Watcher Management UI should indicate if Watcher is stopped / started #36401

Closed

gmmorris added the Meta label Jul 5, 2021

chrisronline mentioned this issue Jul 12, 2021

[Alerting] Next steps for O11y of Alerting #105306

Open

ymao1 mentioned this issue Aug 9, 2021

[Alerting][Docs] Add example event log queries with new fields #107888

Open

chrisronline mentioned this issue Sep 28, 2021

Leveraging event log data to provide better insights of the alerting framework #111452

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RAC][Epic] Observability of the alerting framework phase 1 #98902

[RAC][Epic] Observability of the alerting framework phase 1 #98902

mikecote commented Apr 30, 2021 •

edited by chrisronline

Loading

elasticmachine commented Apr 30, 2021

pmuellr commented Apr 30, 2021

chrisronline commented Apr 30, 2021

pmuellr commented Apr 30, 2021

pmuellr commented May 4, 2021

chrisronline commented May 13, 2021 •

edited

Loading

chrisronline commented May 26, 2021

chrisronline commented Jun 3, 2021

chrisronline commented Jun 4, 2021 •

edited

Loading

chrisronline commented Jun 14, 2021

gmmorris commented Jun 14, 2021 •

edited

Loading

chrisronline commented Jul 1, 2021

[RAC][Epic] Observability of the alerting framework phase 1 #98902

[RAC][Epic] Observability of the alerting framework phase 1 #98902

Comments

mikecote commented Apr 30, 2021 • edited by chrisronline Loading

Epic Name

Background

User Story / Problem Statement(s)

Success Criteria

Proposal

Related issues

elasticmachine commented Apr 30, 2021

pmuellr commented Apr 30, 2021

chrisronline commented Apr 30, 2021

pmuellr commented Apr 30, 2021

pmuellr commented May 4, 2021

chrisronline commented May 13, 2021 • edited Loading

Proposal

Implementation

Identifying the appropriate metrics

Establishing appropriate default alerts

Collect the data

Data storage

Visualize the data

PoC

chrisronline commented May 26, 2021

chrisronline commented Jun 3, 2021

Proposal

Implementation

Enhancing event log

Identifying "cheat sheet" queries from event log

Ensure as much data is available

chrisronline commented Jun 4, 2021 • edited Loading

This has been approved by the team as of 6/7/2021 as the targeted items in the first stage**

Problems

We can only see health stats when requested, not when there is/was a problem

The event log does not show the whole picture

We have no insight to the stack infrastructure to debug problems

chrisronline commented Jun 14, 2021

gmmorris commented Jun 14, 2021 • edited Loading

Already merged

Last Updated: 23rd June

In Progress

chrisronline commented Jul 1, 2021

mikecote commented Apr 30, 2021 •

edited by chrisronline

Loading

chrisronline commented May 13, 2021 •

edited

Loading

chrisronline commented Jun 4, 2021 •

edited

Loading

gmmorris commented Jun 14, 2021 •

edited

Loading