-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate if Stack Monitoring rules can be recreated with current Observability rules #137277
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
One obvious blocker is the selection of the right data stream to pull the data from. Today our rules select the data stream by way of Metrics UI source configuration but we likely want to change that similar to what's outlined in #120928 (for logs). This has been raised via SDH as well. |
Investigation results CCR read exceptionsThis rule queries the CCS Stats metricset, which is made up of results from the This rule might be hard to recreate with existing rules. The closest I can find is a "Elasticsearch query" rule that looks for documents with that field in a given time range, where we would collapse on the remote cluster id. But this misses out on the per index grouping. I'm not sure how different instrumentation could help here either, since we wouldn't want to create counters per index to avoid mapping explosions.
Cluster healthThis rule pulls out the I think we can replicate this with another "Elasticsearch query" rule, that collapses on the cluster id field and filter out documents that are green/healthy and alerts if there are any hits. CPU/Memory UsageThis rule groups nodes by cluster, then looks at fields like This rule could be recreated using a Metrics Threshold rule. The same could be done for JVM Memory usage by looking at Disk UsageThis rule groups nodes by cluster, then looks at the ratio between I could not fine a rule that allows us to express a ratio between two values, however we could change the instrumentation to also include a "current percentage usage" where the ratio calculation is done in the instrumentation. Elasticsearch/Kibana/Logstash version mismatchThis rule looks at the It's a bit hard to recreate this rule with the same data, but one option is to instead use The same could be applied for Kibana by looking at License expirationThis rule looks at The same effect could be recreated by using an ES query rule, one for licenses that have Missing monitoring dataThis rule fetches the latest document of Elasticsearch cluster_stats and looks if the timestamp is older than a threshold, in which case it alerts. There isn't a clear way how we would recreate this kind of behavior. Maybe it's as easy as using an ES query rule to try to grab the latest document in a 5 minute window and if less than 1 document comes back we alert. Nodes changedThis rule pulls out the last two cluster state documents for each cluster, then compares which nodes are in each state. If any node was added, removed or restarted, it fires an alert. I don't see how we could recreate this rule with the rules we have today. I'm also not sure if this is something different instrumentation could help with or if this is something the Health APIs would be more suited for? Shard sizeThis rule works similar to the disk usage rule in the sense that it checks a relationship between two values for a threshold. This rule grabs the latest index_stats for each cluster to get access to the count of primary shards and the size in bytes for those shards, then checks if the average size of each primary shard (primary shard count / shard size) is above the threshold. Same here, can't reproduce with current rules unless we move the calculation to instrumentation time. Thread pool search/write rejectionsThis rule fetches the rejection counts, split by either search or write, and fires an alert if the count goes over a threshold. Another rule that looks at the result of comparing two documents and firing if the result is above a threshold, I don't think there is a way to replicate this with our current rules. Notes: In a similar fashion, any rule that we replace with the general ES query rule really highlight that we need a way to put partially configured rules into integrations, because it doesn't seem reasonable to ask the users to know which ES DSL query to put into the rule for the effect we want. That said, there might be some general case common across those rules that we could put into a new generic rule. Another small benefit of the current SM rules is that it leaves the SM system to track the cluster UUIDs to use, which would otherwise have to be filtered by the user, so having a smooth way to say "create a rule for this node" would be good to have if we don't have that already. This might just be a shower thought, but would it be possible/useful to be able to define rules out of building blocks, similar to how we define ingest pipelines? Are there enough similar steps in rule creation that we could put pieces together to fetch documents, unpack values, do the needed calculation and then check if we should alert or not? |
I think I'm done with my investigation, I'm putting this "In review", please see the comment above. I would also love for some thoughts or feedback from @elastic/actionable-observability and @elastic/response-ops-execution if they have the time! |
CCR read exceptionsIt sounds like if Cluster healthAgreed. Since the nuance seems to just be "Red == Danger, Not green == Warning", two rules makes sense. Otel metrics might have to map colors to gauge values as well since there's no direct support for strings other than metric labels. CPU/Memory UsageAgreed. No comments to add. Disk Usage
Agreed. TSVB has had "filter ratio" for a long time. In lens kibana/docs/user/dashboard/lens.asciidoc Line 145 in ced8978
count(kql='response.status_code > 400') / count() . A rule would be the next logical step.
Looks like the TSVB implentation is two filter aggs plus a bucket script Is there maybe an o11y rule that support arbitrary ES aggs? If so we could maybe create the ratio that way. If not I'd say let's get an issue option for generic ratio rule support.
Definitely an option to produce disk usage as a 0-100% gauge at the ES layer. Elasticsearch/Kibana/Logstash version mismatchMakes sense to me. No comments to add. License expirationI wonder if a gauge like Missing monitoring data
Agreed. We have a number of internal watches that do a similar thing https://github.com/elastic/cloud/blob/master/infrastructure/cluster-management/observability/production/logging/watches/cluster-delay.json#L71-L102 - we should definitely get an issue open to have this covered by a generic rule. Thread pool search/write rejectionsSounds to me like thread poll rejections should be a counter in ES, then we should alert when the rate of that counter crosses a threshold. Is counter rate something not supported by the current o11y rules yet? Nodes changedTough one. We should open an issue to track it for platform observability. The missing/changed problem is also one I've seen a lot. I think what we really want here is a comparison between observed and expected cluster topologies. How to get that even in an otel metrics world is kind of a head-scratcher for me. Shard size+1 for moving the calculation to instrumentation time. Thread pool search/write rejectionsI'd say threadpool rejections should be a counter labeled along with the pool name. If we don't have a way to alert on a counter breaching a threshold as a basic o11y rule, we should definitely open an issue to get that created. Notes
I dig this idea. The bigger topic on my mind is: Imagine we're not talking about Elasticsearch but rather (random example) Postgres. How can we use our integrations and rule system to highlight when (for example) a Postgres primary and replica are running different versions? Or when the primary's disk is nearly exhausted? I don't think we'd want to have a typescript kibana "postgres" plugin, so how do we make the rule system flexible enough that integrations can define the rules they need to inform the operator about problems? Layers that are composable in an integration definition might be a way to answer this. |
💭 AFAIK the metrics rules were written such that fetching, transforming and evaluating the conditions were sequential steps and this caused quite a lot of performance problems in real-world deployments. While I too appreciate the clean structure we had to move to an approach in which the transformations and comparisons were intertwined with the paginated fetching so we don't block the event loop for too long and we don't run into memory limitations trying to keep all results in memory at once. So whatever the "modular" structure is, I would recommend to take extra care that it is able to "stream" results and make local transformations and can be preempted. |
No, there is an aggregation rule but it only does the common math (sum/avg/max/min), the query rule allows to define DSL for the aggregations but the condition is only documents found (as far as I can tell).
Nope.
Feels very much like a Health API thing, not a metrics domain thing. Where some other system has to look at the observed/expected and translate that into events that we later can alert on.
I'll open issues for seeing if we should create:
Totally where I was going with those thoughts too, and avoiding having to create plugin code for all kinds of apps to match the Integration packages. The Integration should be the only thing a user install into Elastic.
That makes a lot of sense! I wonder if there are steps in our rules today that would not allow for that. I imagine some of the work @simianhacker has done to move computation into ES is also making this less of a problem? |
Yes, partly. But there are still several scenarios that could cause very large result sets (such as high cardinality grouping). |
Thanks! I added them to a "Alerting issues to be addressed" section of the platform observability meta. |
@miltonhultgren if no-one is opposed we could probably call this issue done. #137277 (comment) satisfies the AC pretty well. It's not as far as "example configs" in all cases, but it's at least description of what might be feasible with current features and issues are open to address shortcomings. |
Yeah, I'd also close this out (though input from other Alerting experts is still welcome!)! |
nice work @miltonhultgren ! |
Nice analysis! Probably worth getting us involved a little more - we could still have a meeting to go over your current thinking. Certainly pull us in as you're starting down more concrete paths. Some thoughts:
|
If we want to put Rules into Platform Observability packages and display data about them in the future we will need to use Rules that integrate with Alerts-As-Data (since that is the only rules data which will be public).
The current Stack Monitoring rules do not do this, but before we refactor them (#127284), we should investigate if we can do the same thing with the already existing Observability rules (which do integrate with Alerts-As-Data already).
Stack Monitoring rules
AC
The text was updated successfully, but these errors were encountered: