Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RAC][Epic] Observability of the alerting framework phase 1 #98902

Closed
mikecote opened this issue Apr 30, 2021 · 12 comments
Closed

[RAC][Epic] Observability of the alerting framework phase 1 #98902

mikecote opened this issue Apr 30, 2021 · 12 comments
Assignees
Labels
epic Feature:Alerting Feature:EventLog Feature:Task Manager Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete

Comments

@mikecote
Copy link
Contributor

mikecote commented Apr 30, 2021

Epic Name

Observability of the alerting framework phase 1

Background

The RAC initiative will drastically increase the adoption of alerting. With an increase in adoption, there will also be an increase in rules the alerting framework will handle. This increase can cause the overall alerting framework to behave in unexpected ways, and it currently takes a lot of steps to identify the root cause.

User Story / Problem Statement(s)

As a Kibana administrator, I can quickly identify root causes when the alerting framework isn't behaving properly.
As a Kibana developer, I can see insight into the performance impact my rule type has.

Success Criteria

An initial step at reduced times for Kibana administrators to find root causes of framework misbehaviour.
An initial step at providing insights to developers about their rule types.

Proposal

See #98902 (comment)

The agreed upon proposal from the above comment yielded these two tickets:

These issues should be considered part of this effort, as they will help tell a better performance story from an event log perspective:

Related issues

@mikecote mikecote added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:EventLog epic Theme: rac label obsolete 7.14 candidate labels Apr 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Apr 30, 2021

We should take another look at the APM-ization we did a while back, see if that still provides useful information. I believe we wrapped tasks as APM transactions, so get pretty good data from that - counts/durations per task and indirectly alert/action (since each alert and action has it's own task type).

Even though APM is not officially "on" today in Kibana, some day it will be, and presumably before it is "official" there may be some sneaky way to enable it. And on-prem customers could use it today. So we should make sure what we have in there today is going to be useful.

@chrisronline
Copy link
Contributor

I'd like to advocate for integrating into Stack Monitoring in two ways:

  1. Leverage the collection and shipping mechanisms that currently exist in Stack Monitoring to allow users to ship monitoring metrics and logs surrounding task manager/rules/connectors to different cluster(s)

  2. Leverage the existing Stack Monitoring UI, specifically the Kibana monitoring UI, to visualize performance metrics for task manger/rules/connectors. The current hierarchy is very broad but we could add grouping by rule/connector type and show specific UIs for that data - similarly to how the Stack Monitoring UI handles things like ML and CCR monitoring.

I know the Stack Monitoring UI is in flux and the future is unclear, but it feels like the most straightforward path, as the UI and shipping/collection mechanisms are proven to work for users and reinventing that wheel will take time.

@pmuellr
Copy link
Member

pmuellr commented Apr 30, 2021

The event log can provide some pretty good data in terms of counts/durations of actions and alerts, but doesn't currently track any other task manager -related tasks. The index is read-only and considered a Kibana system index because of it's .kibana- prefix, but assuming it's straight-forward to manually provide read privs to this index for users, it's straight-forward to create an index pattern for it, and then use it in Discover and Lens.

I just played with it again, and one of the problems with the current event log structure is that the saved object references are not usable within Lens, presumably because they are a nested field. I'm wondering if we can "runtime" them into fields that Lens could see? Otherwise, we can't get any rule-specific breakdowns.

The last set of changes for the event_log introduced the top-level rule property, which is a great place we could put the rule-specific information:

"rule": {
"properties": {
"author": {
"ignore_above": 1024,
"type": "keyword",
"meta": {
"isArray": "true"
}
},
"category": {
"ignore_above": 1024,
"type": "keyword"
},
"description": {
"ignore_above": 1024,
"type": "keyword"
},
"id": {
"ignore_above": 1024,
"type": "keyword"
},
"license": {
"ignore_above": 1024,
"type": "keyword"
},
"name": {
"ignore_above": 1024,
"type": "keyword"
},
"reference": {
"ignore_above": 1024,
"type": "keyword"
},
"ruleset": {
"ignore_above": 1024,
"type": "keyword"
},
"uuid": {
"ignore_above": 1024,
"type": "keyword"
},
"version": {
"ignore_above": 1024,
"type": "keyword"
}
}
},

We'll need to figure out how to map all the goodies though - we'll want the rule type id and the rule id available, at a minimum. There are some related issues already open for this: #94137 and #95411

This won't help for actions though. I'd say if we can solve accessing the nested fields in Lens with runtime fields, let's go with that, otherwise we can provide a property connector or such under the existing custom kibana field, to add the connector id, space, etc.

@pmuellr
Copy link
Member

pmuellr commented May 4, 2021

Existing known issues regarding improving event log for diagnosis (just added to the list at the top of this issue):

@chrisronline
Copy link
Contributor

chrisronline commented May 13, 2021

Proposal

To ensure users have the right tools and visibility to understand and react to the health and performance of the alerting framework (including task manager), we need to integrate into existing infrastructure and tooling provided to users of the stack (specifically thinking Stack Monitoring and/or APM). This will ensure we do not need to reinvent the wheel when it comes to how to collect/store/visualize metrics and logs, but rather help tell a holistic story for users on where they go and what they do to diagnose/detect/fix performance issues with their stack. We want users to go where they currently go to solve performance issues with Kibana

Implementation

Identifying the appropriate metrics

Establishing appropriate default alerts

Collect the data

  • Add collection logic for metrics to support existing stack solutions (Stack Monitoring & APM, maybe more?) [Alerting] [o11y] Add data collection #100678
  • Explore indexing data ourselves if there are limitations around indexing strategy (SM is limited to a single collection interval for example) - this would give us better control over how our data is indexed so we can better control what visualizations we could support, but it comes at the cost of needing to manage collecting/shipping the data as well as controlling the lifecycle of the indices [Alerting] [o11y] Add data collection #100678

Data storage

Visualize the data

PoC

I'm working on proving how this might integrate into Stack Monitoring in #99877

@chrisronline
Copy link
Contributor

Hey @arisonl ,

From your perspective, what would be some helpful end-user outputs from this effort, most likely in terms of specifics charts that would be helpful?

I solicited some high level input about what general metrics would be helpful from @gmmorris and @pmuellr where they said:

Drift for sure, ideally with granularity at rule type level.
Execution Duration, as we find the long running ones are often the ones causing trouble.
Failure rate by rule type.

drift for overall "are there a lot of tasks queued"
execution duration to how fast things are actually executing

I'm hoping to translate these into visualizations which will help in shaping and mapping the data properly. Do you have any thoughts on how to best represent these? Or perhaps you have additional thoughts on which metrics to collect too?

I imagine most of these will leverage line charts over time, but we could and should think about what each line chart could contain (like a drift line chart that has four lines representing p50, p90, p95, p99 drift, etc)

@chrisronline
Copy link
Contributor

After spending more time working on this, I think we need to change the proposal somewhat drastically.

Proposal

In the short term, we need to focus on delivering value in terms of helping support users with various issues with alerting and/or task manager. I think this starts with reusing existing tooling to help. The support team has an existing tool, support diagnostics, that takes a snapshot of the state of the cluster (in the form of calling a bunch of APIs). This tool is something that the support team uses with nearly every case that comes in, and it's usage can be slightly adapted to include Kibana metrics as well (it might already do this by default).

We can deliver value by enhancing the data the tool already collects, and also add more data points for collection, specifically referring to enhancing the event log and then adding specific queries for the tool to run against the event log to capture important data points.

In the long term, we will use the short term gains to help infer what our long term solution looks like. I still feel confident an integration with Stack Monitoring is the best route, but we need more time to flesh out what exactly we should collect before attempting that more.

Implementation

Enhancing event log

Identifying "cheat sheet" queries from event log

  • Identify some queries that we've ran against the event log for SDHs that would save time if known upfront

Ensure as much data is available

  • The more data we can make available through existing or new APIs, the more we can do in the support tool without needing to wait for an official release - we have existing APIs for TM health and alerting health and we need to ensure that those metrics, combined with raw task manager and event log output, will be enough (to the best of our knowledge) to service all previous SDHs

@chrisronline
Copy link
Contributor

chrisronline commented Jun 4, 2021

This has been approved by the team as of 6/7/2021 as the targeted items in the first stage**

After even MORE time thinking this through, the goal is to identify data points that are impossible to know when supporting users. We had a meeting where we identified and discussed these and these should be the focus on the first release of this effort. I'm going to list problem statements with a brief description and suggested remedial steps at the end.

However, the current thinking is that the initial delivery on this effort will only involve action items that solve the below problems. The assumption is that once these problems are resolved, we will be in a much better state to help users.

Once we feel confident that enough (as much as possible) "impossible"s are solved, it makes sense to pivot this thinking into how to deliver a better experience for us and for our users to give them the necessary insight into the health and performance of task manager and the alerting system. For now, I will not spend the time thinking through the next steps here, to ensure we focus on the value we need to ship in the initial delivery.

Problems

We can only see health stats when requested, not when there is/was a problem

We have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first.

To combat this, we have a couple options:

  1. Persist health metrics over time so we are able to query for metrics at certain time periods
  2. Log health metrics when we detect a problem (the current task manager health api contains buckets of data and each bucket features a "status" field) so users can go back and see what was logged when they experienced the issue

The event log does not show the whole picture

We do not currently write to the event log when an rule starts execution (only when the rule finishes execution) so it's not possible for us to stitch together the timeline of rule execution to understand if one is starting and not finishing or something else.

To combat this, we should write to the event log when a rule begins execution

We have no insight to the stack infrastructure to debug problems

We've run into numerous issues with misconfiguration of a Kibana, and sometimes this Kibana is missed when looking at the infrastructure. This is primarily due to not having a reliable way to know how many Kibanas are talking to a single ES cluster.

To combat this, we need to learn more about our available tools. I think the best way to handle this is to rely on Stack Monitoring which should tell us how many Kibanas are reporting to a single Elasticsearch cluster. Kibana monitoring is on by default, as long as monitoring is enabled on the cluster, which should give us valuable insight into the infrastructure. Once we have the full list, we should be able to quickly identify misconfigurations, such as different encryption keys used on Kibanas that talk to the same Elasticsearch cluster.

cc @elastic/kibana-alerting-services to verify this list feels complete based on the conversations the past two days

@chrisronline
Copy link
Contributor

For 7.14, we are aiming to deliver:

We are confident these two items (in addition to internal training around existing tools/solutions) will help us answer impossible questions with customer issues, such as "why was my action delayed by 5 minutes at this time yesterday?" and "why didn't my rule run at this time?"

@gmmorris
Copy link
Contributor

gmmorris commented Jun 14, 2021

Thanks @chrisronline

As this Epic is being worked on across multiple streams, I feel it's worth taking stock of what we have decided to prioritise and why.

As Chris noted above, in order to deliver on the success criteria stated for this Epic we decided to focus on problems that are currently either "impossible" to resolve in customer deployments, or at least extremely difficult.

With that in mind we took stock of these sorts of problems, as identified by our root cause analysis of past support cases, and prioritised the following issues:

Already merged

These issues have been merged and, barring any unexpected defects, are aimed at inclusion in the nearing possible minor release.

Last Updated: 23rd June

Issue Title Why have we prioritised this
#98625 [Task Manager] Health Metrics capacity estimation enhancements Should dramatically reduce the time spent diagnosing scalability issues in the Alerting Framework
#95411 [event log] add rule type id in custom kibana.alerting field Should dramatically reduce the time spent correlating rule events with specific rule types
#94137 [event log] populate rule.* ECS fields for alert events Aligns our events with those produces by Security and Observability solutions, improving our ability to correlate issues across products in the stack
#93704 [discuss] extending event log for faster/easier access to active instance date information Enables us to correlate active alert to actions, by including the activity duration in the Event Log
#96683 [alerts] http 500 response when creating an alert using an API key has the http authorization Adds more context around API failures that are related to the use of API keys (This is more about automation than Observability, it it should help our support efforts, so feels related and worth noting)
#98729 Alerting docs are missing an example to list the top rules that are executing slowly Docs changes enable customers to help themselves, freeing us up to respond to other customers with more complex issues
#99160 Improve Task Manager instrumentation (Experimental ⚠️ ) Enables us to use Elastic APM to trace issues across Rules and Actions 🎉 [for the record, this was an initiative by the awesome @dgieselaar ]
#98624 [Task Manager] Workload aggregation caps out at 10k tasks Adds more detailed information into our health monitoring, giving us a full picture of the workload system
#98796 Status page returns 503 when alerting cannot find its health task in Task Manager This was a bug which caused our health monitoring to be unreliable at certain points, felt like we had to address this asap
#99930 Alerting health API only considers rules in the default space Same as above - a bug that impacted our health monitoring and priority wasn't even up for debate :)
#101227 [alerting] log warning when alert tasks are disabled due to saved object not found Adds more detailed information into why a rule task might have failed. At the moment we don't actually know when a missing SO has caused a task to fail.
#87055 Issues with alert statuses UI UX improvements that we hope will enable customers to help themselves, freeing us up to respond to other customers with more complex issues
#101505 Gain insight into task manager health apis when a problem occurs Enables us to correlate between Task Manager health stats and issues identified by customers, as long as they have debug logging enabled at the time
#101507 Improve event log data to include when the rule execution starts Enables us to correlate between rule execution and issues identified by customers
#99225 [event log] add rule to event log shared object for provider: actions action: execute log event should dramatically reduce the time spent identifying the root cause of action failures

In Progress

These issues are aimed for inclusion the nearing possible minor release, but this depends on progress made by feature freeze.

Issue Title Why have we prioritised this

@chrisronline
Copy link
Contributor

Per #98902 (comment), we shipped everything we aimed to ship for the first phase of this effort so closing this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic Feature:Alerting Feature:EventLog Feature:Task Manager Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Theme: rac label obsolete
Projects
None yet
Development

No branches or pull requests

6 participants