[RFC] Add metrics and tracing framework in Opensearch #1061

itiyamas · 2021-08-09T04:53:07Z

We have been looking into instrumenting the Opensearch code recently. Even though stats provides a good mechanism, it loses a lot of details like percentiles, which makes it really harder to debug issues in production. Wouldn’t it be great to add a metrics framework in Opensearch that allows a developer to add metrics easily at any part of the code without having to know the exact stats object where the metric belongs?

The framework can, for example, be integrated with RestActions and emit timing and error metrics per operation by default. Similarly, we could pass around this metrics object via ThreadContext down the executor chain and correlate timing metrics together in a single block per request. The metrics can have different logging levels allowing us to skip or add metric calculation on the fly- similar to what we have in logger.

Imagine that you added a new action within the bulk API chain and suddenly a few more requests start taking more time. One of the ways of achieving this is by adding a stat for the new operation within bulk stats. But because stats are always averaged or use precomputed percentiles - it is really tricky to confirm whether the new operation is the culprit. If there is a single metrics block that allows us correlating these metrics- it would be really simple to determine causation.

Now that I have talked about metric generation framework, the publishing can be implemented in a pluggable fashion to different sinks. We can provide a default implementation for the metric log file format, which can be plugged in via different metrics plugins.

jkowall · 2021-08-09T12:21:47Z

Sounds a bit overengineered to me. I would implement OpenTelemetry which can handle the metrics, but also help with tracing (logging is coming soon). The problem with "metrics" in OpenSearch today is that they are exposed as JMX, which is highly inefficient versus modern ways of exposing data as an endpoint (in the case of Prometheus) and allowing metrics to be scraped and stored easily. Similarly I am sure there are good reasons in OpenSearch when you might want to do some tracing too.

itiyamas · 2021-08-09T15:24:14Z

+1 on OpenTelemetry and tracing!

Opensearch does not provide hooks to collect detailed request level metrics/tracing easily. Once we have those hooks, we can integrate OpenTelemetry for metric collection, tracing etc. I don't think it is possible to add fine grained metrics in existing code with JMX. I will check it out though.

Bukhtawar · 2021-08-09T16:26:53Z

Breaking down the problem

Per action slow execution can also be traced by Initial draft on slow task execution #1021
We need more stats as to supplement existing metrics with various interaction layers like time spent in the queue, N/W round trip, IO time, blocked time etc, these would be aggregated over requests spanning a time frame and easily queryable over an endpoint
Per request id based latency breakdown and distributed tracing

jkowall · 2021-08-10T20:01:56Z

Per request id based latency breakdown and distributed tracing

If you have ever tried to instrument ElasticSearch you will learn this is a really bad idea, especially per request. It might be useful for debugging, but generally, the data will make no sense. I have done this in the past.

itiyamas · 2021-08-11T01:04:39Z

@jkowall Why do think request based latencies are bad? Is it bad in general or just for Opensearch or just for cases where throughput is really high?
I can think of cases where long running requests would hold the metrics object in memory, leading to increasing JVM, but that data is really low compared to the request data/partial response data itself. Memory CB calculations can be based on metrics data to stop collecting metrics based on user setting.
Next comes performance degradation due to extra work needed to collect and publish metrics. That is a trade-off any user can take by changing the instrumentation log level per logger instance. Additionally, if the throughput is really high, the system can provide an option to switch to in memory aggregates or request sampling, but I won't prematurely optimize this as the cost of skipping observability is pretty high too.

The way we collect stats in Opensearch gives us averages, which ends up hiding lot of issues. Request based latencies helps us track outliers easily. Aggregating metrics early on leaves me with very little information to troubleshoot the issues after they have already occurred. The kind of clusters I deal with on a daily basis, this information is really important and often I end up enabling trace logging after the fact and then wait for the issue to re-occur, which sometimes never re-occurs or trace logs don't have enough information. Reading the trace logs is pretty tedious at this point and doing it across requests is a nightmare.

CEHENKLE · 2021-10-12T18:53:24Z

Heya @itiyamas What are you thinking next steps will be for this?

jkowall · 2021-10-12T20:33:38Z

My challenge to your suggestion is that tracing on ElasticSearch is very challenging. When you install autoinstrumentation and collect traces they will make no sense at all. I have done this with several tools and the data was useless. Additionally the overhead of instrumentation was performance impacting.

If you want to collect metrics or collect response data that would be more reasonable. We actually already have something similar that @AmiStrn worked on around the profiler that exists already.

reta · 2021-12-01T19:57:27Z

@jkowall @itiyamas @Bukhtawar trying to summarize the discussion points and possible future developments on the subject of metrics / tracing / stats:

Metrics. The existing OpenSearch stats APIs provide a lot of insights into the internals of the each node and cluster in general. But the stats lack the semantics of rich metric types (fe histograms, ...) so it is difficult to catch outliers. Integration with OpenTelemetry is an option (alternatively, Micrometer / Dropwizard Metrics / ...).
Distributed Tracing. The OpenSearch has limited support of corelation ID propagation (X-Opaque-Id), which works only with some clients. It is propagated up to the task level but the breakdown is not captured or persisted anywhere for analysis or visualization. Integration with OpenTracing is an option (the autoinstrumentation is out of the picture, manual instrumentation / context propagation is required).
JFR. The OpenSearch server could be instrumented to emit context-specific JFR events. The overhead of the JFR instrumentation is very low (<3%), but the kinds and types of the events are subject to discussion. JFR events are local to the node but could be streamed to centralized location as well.
Stats improvements (a large chunk of improvements is included into [Meta] BackPressure in the OpenSearch Query (Search) path #1042).

Does it make sense to create RFCs for Metrics / Tracing / JFRs and at least run some experiments to understand a) how useful that would be? b) how difficult that would be?

Thoughts?

jkowall · 2021-12-07T13:12:56Z

@reta OpenTracing is deprecated, it should use OpenTelemetry if anything, but yes, I agree that autoinstrumentation is not a good idea and manually adding the code could add overhead depending on where you instrument in the code.

I agree that focusing on Metrics and Stats are a better approach. @AmiStrn was working on this earlier in the project, but we switched to other work when we realized that the governance for OpenSearch was not going to include other companies outside of AWS. When this changes then we might contribute to core features to make the project better in general.

itiyamas added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 9, 2021

minalsha removed the untriaged label Aug 24, 2021

minalsha changed the title ~~Add metrics framework in Opensearch~~ [RFC] Add metrics framework in Opensearch Aug 24, 2021

CEHENKLE added the discuss Issues intended to help drive brainstorming and decision making label Aug 31, 2021

itiyamas changed the title ~~[RFC] Add metrics framework in Opensearch~~ [RFC] Add metrics and tracing framework in Opensearch Sep 16, 2021

anasalkouz added the distributed framework label Oct 12, 2021

reta mentioned this issue Sep 8, 2022

Implementation of Error Metric for Rest Actions #4401

Open

rishabhmaurya mentioned this issue Mar 3, 2023

[RFC] Performance metrics framework #6533

Open

Gaganjuneja mentioned this issue Mar 20, 2023

[RFC] Distributed Tracing #6750

Open

dblock mentioned this issue Apr 17, 2023

Add field for warnings (and maybe debug/info?) to SearchResponse #6794

Open

khushbr mentioned this issue May 2, 2023

[RFC] OpenSearch Tracing Collector #7352

Open

Gaganjuneja mentioned this issue May 12, 2023

[META][Distributed Tracing] - 2.12 #7542

Closed

13 tasks

Gaganjuneja mentioned this issue Jul 9, 2023

[Meta][Tracing Framework Instrumentation] #8557

Open

7 tasks

Gaganjuneja mentioned this issue Sep 20, 2023

[RFC] Metrics Framework #10141

Open

This was referenced Jan 30, 2024

[META] Distributed Tracing features for release V3.0.0 #12081

Open

[META][Distributed Tracing] Release V2.13 #12082

Open

msfroh added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Add metrics and tracing framework in Opensearch #1061

[RFC] Add metrics and tracing framework in Opensearch #1061

itiyamas commented Aug 9, 2021

jkowall commented Aug 9, 2021

itiyamas commented Aug 9, 2021

Bukhtawar commented Aug 9, 2021

jkowall commented Aug 10, 2021

itiyamas commented Aug 11, 2021

CEHENKLE commented Oct 12, 2021

jkowall commented Oct 12, 2021

reta commented Dec 1, 2021

jkowall commented Dec 7, 2021

[RFC] Add metrics and tracing framework in Opensearch #1061

[RFC] Add metrics and tracing framework in Opensearch #1061

Comments

itiyamas commented Aug 9, 2021

jkowall commented Aug 9, 2021

itiyamas commented Aug 9, 2021

Bukhtawar commented Aug 9, 2021

jkowall commented Aug 10, 2021

itiyamas commented Aug 11, 2021

CEHENKLE commented Oct 12, 2021

jkowall commented Oct 12, 2021

reta commented Dec 1, 2021

jkowall commented Dec 7, 2021