-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RAC][Epic] Observability of the alerting framework phase 1 #98902
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
We should take another look at the APM-ization we did a while back, see if that still provides useful information. I believe we wrapped tasks as APM transactions, so get pretty good data from that - counts/durations per task and indirectly alert/action (since each alert and action has it's own task type). Even though APM is not officially "on" today in Kibana, some day it will be, and presumably before it is "official" there may be some sneaky way to enable it. And on-prem customers could use it today. So we should make sure what we have in there today is going to be useful. |
I'd like to advocate for integrating into Stack Monitoring in two ways:
I know the Stack Monitoring UI is in flux and the future is unclear, but it feels like the most straightforward path, as the UI and shipping/collection mechanisms are proven to work for users and reinventing that wheel will take time. |
The event log can provide some pretty good data in terms of counts/durations of actions and alerts, but doesn't currently track any other task manager -related tasks. The index is read-only and considered a Kibana system index because of it's I just played with it again, and one of the problems with the current event log structure is that the saved object references are not usable within Lens, presumably because they are a nested field. I'm wondering if we can "runtime" them into fields that Lens could see? Otherwise, we can't get any rule-specific breakdowns. The last set of changes for the event_log introduced the top-level kibana/x-pack/plugins/event_log/generated/mappings.json Lines 173 to 219 in 8810e84
We'll need to figure out how to map all the goodies though - we'll want the rule type id and the rule id available, at a minimum. There are some related issues already open for this: #94137 and #95411 This won't help for actions though. I'd say if we can solve accessing the nested fields in Lens with runtime fields, let's go with that, otherwise we can provide a property |
Existing known issues regarding improving event log for diagnosis (just added to the list at the top of this issue):
|
ProposalTo ensure users have the right tools and visibility to understand and react to the health and performance of the alerting framework (including task manager), we need to integrate into existing infrastructure and tooling provided to users of the stack (specifically thinking Stack Monitoring and/or APM). This will ensure we do not need to reinvent the wheel when it comes to how to collect/store/visualize metrics and logs, but rather help tell a holistic story for users on where they go and what they do to diagnose/detect/fix performance issues with their stack. We want users to go where they currently go to solve performance issues with Kibana ImplementationIdentifying the appropriate metrics
Establishing appropriate default alerts
Collect the data
Data storage
Visualize the data
PoCI'm working on proving how this might integrate into Stack Monitoring in #99877 |
Hey @arisonl , From your perspective, what would be some helpful end-user outputs from this effort, most likely in terms of specifics charts that would be helpful? I solicited some high level input about what general metrics would be helpful from @gmmorris and @pmuellr where they said:
I'm hoping to translate these into visualizations which will help in shaping and mapping the data properly. Do you have any thoughts on how to best represent these? Or perhaps you have additional thoughts on which metrics to collect too? I imagine most of these will leverage line charts over time, but we could and should think about what each line chart could contain (like a drift line chart that has four lines representing p50, p90, p95, p99 drift, etc) |
After spending more time working on this, I think we need to change the proposal somewhat drastically. ProposalIn the short term, we need to focus on delivering value in terms of helping support users with various issues with alerting and/or task manager. I think this starts with reusing existing tooling to help. The support team has an existing tool, support diagnostics, that takes a snapshot of the state of the cluster (in the form of calling a bunch of APIs). This tool is something that the support team uses with nearly every case that comes in, and it's usage can be slightly adapted to include Kibana metrics as well (it might already do this by default). We can deliver value by enhancing the data the tool already collects, and also add more data points for collection, specifically referring to enhancing the event log and then adding specific queries for the tool to run against the event log to capture important data points. In the long term, we will use the short term gains to help infer what our long term solution looks like. I still feel confident an integration with Stack Monitoring is the best route, but we need more time to flesh out what exactly we should collect before attempting that more. ImplementationEnhancing event log
Identifying "cheat sheet" queries from event log
Ensure as much data is available
|
This has been approved by the team as of 6/7/2021 as the targeted items in the first stage**After even MORE time thinking this through, the goal is to identify data points that are impossible to know when supporting users. We had a meeting where we identified and discussed these and these should be the focus on the first release of this effort. I'm going to list problem statements with a brief description and suggested remedial steps at the end. However, the current thinking is that the initial delivery on this effort will only involve action items that solve the below problems. The assumption is that once these problems are resolved, we will be in a much better state to help users. Once we feel confident that enough (as much as possible) "impossible"s are solved, it makes sense to pivot this thinking into how to deliver a better experience for us and for our users to give them the necessary insight into the health and performance of task manager and the alerting system. For now, I will not spend the time thinking through the next steps here, to ensure we focus on the value we need to ship in the initial delivery. ProblemsWe can only see health stats when requested, not when there is/was a problemWe have the ability to see health metrics but not necessarily when we need to see them (when the problem occurs). This is most noticeable when the issue is intermittent and isn't noticeable at first. To combat this, we have a couple options:
The event log does not show the whole pictureWe do not currently write to the event log when an rule starts execution (only when the rule finishes execution) so it's not possible for us to stitch together the timeline of rule execution to understand if one is starting and not finishing or something else. To combat this, we should write to the event log when a rule begins execution We have no insight to the stack infrastructure to debug problemsWe've run into numerous issues with misconfiguration of a Kibana, and sometimes this Kibana is missed when looking at the infrastructure. This is primarily due to not having a reliable way to know how many Kibanas are talking to a single ES cluster. To combat this, we need to learn more about our available tools. I think the best way to handle this is to rely on Stack Monitoring which should tell us how many Kibanas are reporting to a single Elasticsearch cluster. Kibana monitoring is on by default, as long as monitoring is enabled on the cluster, which should give us valuable insight into the infrastructure. Once we have the full list, we should be able to quickly identify misconfigurations, such as different encryption keys used on Kibanas that talk to the same Elasticsearch cluster. cc @elastic/kibana-alerting-services to verify this list feels complete based on the conversations the past two days |
For 7.14, we are aiming to deliver:
We are confident these two items (in addition to internal training around existing tools/solutions) will help us answer impossible questions with customer issues, such as "why was my action delayed by 5 minutes at this time yesterday?" and "why didn't my rule run at this time?" |
Thanks @chrisronline As this Epic is being worked on across multiple streams, I feel it's worth taking stock of what we have decided to prioritise and why. As Chris noted above, in order to deliver on the success criteria stated for this Epic we decided to focus on problems that are currently either "impossible" to resolve in customer deployments, or at least extremely difficult. With that in mind we took stock of these sorts of problems, as identified by our root cause analysis of past support cases, and prioritised the following issues: Already mergedThese issues have been merged and, barring any unexpected defects, are aimed at inclusion in the nearing possible minor release. Last Updated: 23rd June
In ProgressThese issues are aimed for inclusion the nearing possible minor release, but this depends on progress made by feature freeze.
|
Per #98902 (comment), we shipped everything we aimed to ship for the first phase of this effort so closing this ticket |
Epic Name
Observability of the alerting framework phase 1
Background
The RAC initiative will drastically increase the adoption of alerting. With an increase in adoption, there will also be an increase in rules the alerting framework will handle. This increase can cause the overall alerting framework to behave in unexpected ways, and it currently takes a lot of steps to identify the root cause.
User Story / Problem Statement(s)
As a Kibana administrator, I can quickly identify root causes when the alerting framework isn't behaving properly.
As a Kibana developer, I can see insight into the performance impact my rule type has.
Success Criteria
An initial step at reduced times for Kibana administrators to find root causes of framework misbehaviour.
An initial step at providing insights to developers about their rule types.
Proposal
See #98902 (comment)
The agreed upon proposal from the above comment yielded these two tickets:
These issues should be considered part of this effort, as they will help tell a better performance story from an event log perspective:
Related issues
The text was updated successfully, but these errors were encountered: