Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate if Stack Monitoring rules can be recreated with current Observability rules #137277

Closed
miltonhultgren opened this issue Jul 27, 2022 · 13 comments
Assignees
Labels
Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Jul 27, 2022

If we want to put Rules into Platform Observability packages and display data about them in the future we will need to use Rules that integrate with Alerts-As-Data (since that is the only rules data which will be public).

The current Stack Monitoring rules do not do this, but before we refactor them (#127284), we should investigate if we can do the same thing with the already existing Observability rules (which do integrate with Alerts-As-Data already).

Stack Monitoring rules

  • CCR read exceptions (Alert if any CCR read exceptions have been detected.)
  • Cluster health (Alert when the health of the cluster changes.)
  • CPU Usage (Alert when the CPU load for a node is consistently high.)
  • Disk Usage (Alert when the disk usage for a node is consistently high.)
  • Elasticsearch version mismatch(Alert when the cluster has multiple versions of Elasticsearch.)
  • Kibana version mismatch (Alert when the cluser has multiple versions of Kibana.)
  • License expiration (Alert when the cluster license is about to expire.)
  • Logstash version mismatch (Alert when the cluster has multiple versions of Logstash.)
  • Memory Usage (JVM) (Alert when a node reports high memory usage.)
  • Missing monitoring data (Alert when monitoring data is missing.)
  • Nodes changed (Alert when adding, removing, or restarting a node.)
  • Shard size (Alert if the average shard size is larger than the configured threshold.)
  • Thread pool search rejections (Alert when the number of rejections in the search thread pool exceeds the threshold.)
  • Thread pool write rejections (Alert when the number of rejections in the write thread pool exceeds the threshold.)

AC

  • There exists sample configs of different Observability rules that equal the existing Stack Monitoring Rules or an explanation for why the rule cannot be recreated with existing Observability rules
@miltonhultgren miltonhultgren added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 labels Jul 27, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@miltonhultgren
Copy link
Contributor Author

One obvious blocker is the selection of the right data stream to pull the data from. Today our rules select the data stream by way of Metrics UI source configuration but we likely want to change that similar to what's outlined in #120928 (for logs). This has been raised via SDH as well.

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Sep 21, 2022

Investigation results

CCR read exceptions

This rule queries the CCS Stats metricset, which is made up of results from the /_ccr/stats endpoint, and looks for the existence of the read_exceptions field after grouping the documents by remote cluster and follower_index and grabbing the top hit.

This rule might be hard to recreate with existing rules. The closest I can find is a "Elasticsearch query" rule that looks for documents with that field in a given time range, where we would collapse on the remote cluster id. But this misses out on the per index grouping.

I'm not sure how different instrumentation could help here either, since we wouldn't want to create counters per index to avoid mapping explosions.
If it's possible, we could add the index name as metadata to the counter, something in the shape like:

{
  read_exception_count: 5,
  metadata: {
    indices: ['index1', 'index2', ...]
  }
}

Cluster health

This rule pulls out the cluster_stats type documents and simply looks at the status property and fires if it's anything but green/healthy.

I think we can replicate this with another "Elasticsearch query" rule, that collapses on the cluster id field and filter out documents that are green/healthy and alerts if there are any hits.
There is some nuance lost since the SM rule fires with different severity based on the status but that could be reflected with two rules.

CPU/Memory Usage

This rule groups nodes by cluster, then looks at fields like node_stats.process.cpu.percent to aggregate the average CPU usage.

This rule could be recreated using a Metrics Threshold rule.
The current rule accounts for both container and bare metal environments, I don't know if customers are likely to have these mixed for the same app, if they do they'll need two rules.

The same could be done for JVM Memory usage by looking at node_stats.jvm.mem.heap_used_percent.

Disk Usage

This rule groups nodes by cluster, then looks at the ratio between node_stats.fs.total.total_in_bytes and node_stats.fs.total.available_in_bytes and alerts if it passes a threshold.

I could not fine a rule that allows us to express a ratio between two values, however we could change the instrumentation to also include a "current percentage usage" where the ratio calculation is done in the instrumentation.
Though it might be good to have a "ratio" rule regardless.

Elasticsearch/Kibana/Logstash version mismatch

This rule looks at the versions array in the cluster_state and alerts if there is more than 1 entry.

It's a bit hard to recreate this rule with the same data, but one option is to instead use elasticsearch.node.version and write a ES Query rule that collapses on the node version field and alert if there are more than 1 document found (grouped by cluster uuid).

The same could be applied for Kibana by looking at kibana_stats.kibana.version and for Logstash by looking at logstash_stats.logstash.version

License expiration

This rule looks at elasticsearch.cluster.stats.license.status and elasticsearch.cluster.stats.license.expiry_date_in_millis to alert about expired or soon-to-expire licenses.

The same effect could be recreated by using an ES query rule, one for licenses that have status other than active and one for those where expiry_date_in_millis is within a certain time range using a range query. The current rule supports warning in the intervals [60, 30, 14, 7].

Missing monitoring data

This rule fetches the latest document of Elasticsearch cluster_stats and looks if the timestamp is older than a threshold, in which case it alerts.

There isn't a clear way how we would recreate this kind of behavior. Maybe it's as easy as using an ES query rule to try to grab the latest document in a 5 minute window and if less than 1 document comes back we alert.
Overall, this feels like it should be a common enough problem to alert on that I'm surprised we don't already have a rule for this "data stream stopped".

Nodes changed

This rule pulls out the last two cluster state documents for each cluster, then compares which nodes are in each state. If any node was added, removed or restarted, it fires an alert.
Nodes are considered removed if they exist in the last document but not in the current document.
Added if they are in the current document but not in the last document.
Restarted if they exist in both documents with the same nodeUuid but have different nodeEphemeralId.

I don't see how we could recreate this rule with the rules we have today. I'm also not sure if this is something different instrumentation could help with or if this is something the Health APIs would be more suited for?
Perhaps if we pull documents per node but we still need a way to compare previous and current results within the same rule to say "there is more/less" than before.

Shard size

This rule works similar to the disk usage rule in the sense that it checks a relationship between two values for a threshold. This rule grabs the latest index_stats for each cluster to get access to the count of primary shards and the size in bytes for those shards, then checks if the average size of each primary shard (primary shard count / shard size) is above the threshold.

Same here, can't reproduce with current rules unless we move the calculation to instrumentation time.

Thread pool search/write rejections

This rule fetches the rejection counts, split by either search or write, and fires an alert if the count goes over a threshold.
It grabs the most recent and the oldest node_stats documents for each node (by cluster), then reports the change in rejections between these two documents (I assume because the counter never resets). It will fire if the number of rejections are above a threshold within the rule execution interval.

Another rule that looks at the result of comparing two documents and firing if the result is above a threshold, I don't think there is a way to replicate this with our current rules.

Notes:
While many rules can be recreated, they end up losing a bit of context that the custom rules offer, but in my opinion this means we should reshape our generic rules to also enable author of other integrations to offer more easily digested rules on top of generic executors, so that we can create rules that say "ES Node CPU usage over threshold" in the alert but the executor is the normal metric threshold executor (which is a single place for performance optimizations).

In a similar fashion, any rule that we replace with the general ES query rule really highlight that we need a way to put partially configured rules into integrations, because it doesn't seem reasonable to ask the users to know which ES DSL query to put into the rule for the effect we want. That said, there might be some general case common across those rules that we could put into a new generic rule.

Another small benefit of the current SM rules is that it leaves the SM system to track the cluster UUIDs to use, which would otherwise have to be filtered by the user, so having a smooth way to say "create a rule for this node" would be good to have if we don't have that already.

This might just be a shower thought, but would it be possible/useful to be able to define rules out of building blocks, similar to how we define ingest pipelines? Are there enough similar steps in rule creation that we could put pieces together to fetch documents, unpack values, do the needed calculation and then check if we should alert or not?

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Sep 23, 2022

I think I'm done with my investigation, I'm putting this "In review", please see the comment above.

I would also love for some thoughts or feedback from @elastic/actionable-observability and @elastic/response-ops-execution if they have the time!

@matschaffer
Copy link
Contributor

matschaffer commented Sep 27, 2022

CCR read exceptions

It sounds like if read_exception_count is implemented as a counter with per-index labels this alert could use the "Elasticsearch query" rule.

Cluster health

Agreed. Since the nuance seems to just be "Red == Danger, Not green == Warning", two rules makes sense. Otel metrics might have to map colors to gauge values as well since there's no direct support for strings other than metric labels.

CPU/Memory Usage

Agreed. No comments to add.

Disk Usage

I could not fine a rule that allows us to express a ratio between two values

Agreed. TSVB has had "filter ratio" for a long time. In lens

Filter ratio example:: To filter a document set, use `kql=''`, then compare to other documents within the same grouping:
recommends using a formula like count(kql='response.status_code > 400') / count(). A rule would be the next logical step.

Looks like the TSVB implentation is two filter aggs plus a bucket script

Is there maybe an o11y rule that support arbitrary ES aggs? If so we could maybe create the ratio that way. If not I'd say let's get an issue option for generic ratio rule support.

however we could change the instrumentation to also include a "current percentage usage" where the ratio calculation is done in the instrumentation.

Definitely an option to produce disk usage as a 0-100% gauge at the ES layer.

Elasticsearch/Kibana/Logstash version mismatch

Makes sense to me. No comments to add.

License expiration

I wonder if a gauge like license_lifespan might make sense to help make tracking this more straight forward. When we could have multiple metric rules that filter when less than 60 (5,184,000,000 ms), etc.

Missing monitoring data

Overall, this feels like it should be a common enough problem to alert on that I'm surprised we don't already have a rule for this "data stream stopped".

Agreed. We have a number of internal watches that do a similar thing https://github.com/elastic/cloud/blob/master/infrastructure/cluster-management/observability/production/logging/watches/cluster-delay.json#L71-L102 - we should definitely get an issue open to have this covered by a generic rule.

Thread pool search/write rejections

Sounds to me like thread poll rejections should be a counter in ES, then we should alert when the rate of that counter crosses a threshold.

Is counter rate something not supported by the current o11y rules yet?

Nodes changed

Tough one. We should open an issue to track it for platform observability. The missing/changed problem is also one I've seen a lot. I think what we really want here is a comparison between observed and expected cluster topologies. How to get that even in an otel metrics world is kind of a head-scratcher for me.

Shard size

+1 for moving the calculation to instrumentation time.

Thread pool search/write rejections

I'd say threadpool rejections should be a counter labeled along with the pool name. If we don't have a way to alert on a counter breaching a threshold as a basic o11y rule, we should definitely open an issue to get that created.

Notes

This might just be a shower thought, but would it be possible/useful to be able to define rules out of building blocks, similar to how we define ingest pipelines? Are there enough similar steps in rule creation that we could put pieces together to fetch documents, unpack values, do the needed calculation and then check if we should alert or not?

I dig this idea.

The bigger topic on my mind is: Imagine we're not talking about Elasticsearch but rather (random example) Postgres.

How can we use our integrations and rule system to highlight when (for example) a Postgres primary and replica are running different versions? Or when the primary's disk is nearly exhausted?

I don't think we'd want to have a typescript kibana "postgres" plugin, so how do we make the rule system flexible enough that integrations can define the rules they need to inform the operator about problems?

Layers that are composable in an integration definition might be a way to answer this.

@weltenwort
Copy link
Member

This might just be a shower thought, but would it be possible/useful to be able to define rules out of building blocks, similar to how we define ingest pipelines? Are there enough similar steps in rule creation that we could put pieces together to fetch documents, unpack values, do the needed calculation and then check if we should alert or not?

💭 AFAIK the metrics rules were written such that fetching, transforming and evaluating the conditions were sequential steps and this caused quite a lot of performance problems in real-world deployments. While I too appreciate the clean structure we had to move to an approach in which the transformations and comparisons were intertwined with the paginated fetching so we don't block the event loop for too long and we don't run into memory limitations trying to keep all results in memory at once.

So whatever the "modular" structure is, I would recommend to take extra care that it is able to "stream" results and make local transformations and can be preempted.

@miltonhultgren
Copy link
Contributor Author

miltonhultgren commented Sep 27, 2022

Is there maybe an o11y rule that support arbitrary ES aggs? If so we could maybe create the ratio that way. If not I'd say let's get an issue option for generic ratio rule support.

No, there is an aggregation rule but it only does the common math (sum/avg/max/min), the query rule allows to define DSL for the aggregations but the condition is only documents found (as far as I can tell).

Is counter rate something not supported by the current o11y rules yet?

Nope.

 I think what we really want here is a comparison between observed and expected cluster topologies. 

Feels very much like a Health API thing, not a metrics domain thing. Where some other system has to look at the observed/expected and translate that into events that we later can alert on.

Re: Missing rules

I'll open issues for seeing if we should create:

Imagine we're not talking about Elasticsearch but rather (random example) Postgres.

Totally where I was going with those thoughts too, and avoiding having to create plugin code for all kinds of apps to match the Integration packages. The Integration should be the only thing a user install into Elastic.

So whatever the "modular" structure is, I would recommend to take extra care that it is able to "stream" results and make local transformations and can be preempted.

That makes a lot of sense! I wonder if there are steps in our rules today that would not allow for that. I imagine some of the work @simianhacker has done to move computation into ES is also making this less of a problem?
Since we now pull less data into the JavaScript world?

@weltenwort
Copy link
Member

That makes a lot of sense! I wonder if there are steps in our rules today that would not allow for that. I imagine some of the work @simianhacker has done to move computation into ES is also making this less of a problem? Since we now pull less data into the JavaScript world?

Yes, partly. But there are still several scenarios that could cause very large result sets (such as high cardinality grouping).

@matschaffer
Copy link
Contributor

I'll open issues for seeing if we should create

Thanks! I added them to a "Alerting issues to be addressed" section of the platform observability meta.

@matschaffer
Copy link
Contributor

matschaffer commented Sep 27, 2022

@miltonhultgren if no-one is opposed we could probably call this issue done. #137277 (comment) satisfies the AC pretty well. It's not as far as "example configs" in all cases, but it's at least description of what might be feasible with current features and issues are open to address shortcomings.

@miltonhultgren
Copy link
Contributor Author

Yeah, I'd also close this out (though input from other Alerting experts is still welcome!)!

@matschaffer
Copy link
Contributor

nice work @miltonhultgren !

@pmuellr
Copy link
Member

pmuellr commented Oct 12, 2022

Nice analysis! Probably worth getting us involved a little more - we could still have a meeting to go over your current thinking. Certainly pull us in as you're starting down more concrete paths.

Some thoughts:

  • re: ratios, and thoughts of "we should change the source data" - probably :-), but there's also runtime fields that could potentially be used, if the number of things being searched over isn't huuge. Bucket script probably makes more sense though :-). In general we've not de-duped stuff in the event log, so for instance we log start/end/duration timing info, even though we only really need 2 of those, specifically so that we wouldn't have to do calculations DURING the rule execution. If a ratio is an interesting number, it probably should be in the source.

  • aggs; we've been talking about adding aggs support to the elasticsearch query rule, but not quite clear how it would work; it does seem do-able. That rule also just grew runtime fields support, and support for specifying output fields.

  • "modular approach" to handle 'version mismatch for X' - probably worth thinking through; if something like this could be built and be workable, seems better than creating 1-1 copies of the stack monitoring alerts, which will a long life, even if we support "modular" later. I've added a comment here as well: https://github.com/elastic/observability-dev/issues/2395

  • "So whatever the "modular" structure is, I would recommend to take extra care that it is able to "stream" results and make local transformations and can be preempted." Ya, and the rule lifecycle has gotten more complex as well, with the ability to determine if you've hit runtime caps for alerts, generate context data for recovered alerts, etc.

  • single or multiple alerts from the rules; I believe stack monitoring rules currently generates 0 or 1 alert per run. Like, "yes there are some problems, and here's what they are: 1,2,3". Alternatively rules can generate multiple alerts per rule, so in the case above, it could generate separate alerts for 1, 2, and 3 (and run actions for all three). And we will eventually have some "summary" support, so rules that generate multiple alerts can be configured to still create separate alerts, but only invoke the actions once, with all the alert info, so you'd be back to being able to get one email listing 3 alerts. Probably want to keep this in mind with the new designs.

  • "This rule could be recreated using a (existing rule type) rule." I guess the thinking is there would be a new UX in stack monitoring to build these, but they would end up just being the (existing rule type) in the end. Super-cool thought, but ... I don't think we'd want it to be exactly the same type. Because on the Rules page, they would get listed as (existing rule type) rules. We'd want to think about the best way to "reuse" a connector like this. Subclassing (I hope not! And I'm even an old Smalltalk dude and think that's a bad idea!)? Maybe just some fancy aliasing, so it would appear as a separate rule type, but at the server-side just reuses an existing rule implementation. And presumably a new UX. Or maybe we need to modularize our "stack" rules (like es query, index threshold).

  • "Overall, this feels like it should be a common enough problem (missing monitor data) to alert on that I'm surprised we don't already have a rule for this "data stream stopped". Ya, this has definitely been an issue - whenever someone asks how to right a rule to determine when something ISN'T happening, we have this problem. I don't know what the answer is, but we could definitely use something here, in a general sense, because folks have asked for this for index threshold and es query as well ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

5 participants