Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Add metrics and tracing framework in Opensearch #1061

Open
itiyamas opened this issue Aug 9, 2021 · 9 comments
Open

[RFC] Add metrics and tracing framework in Opensearch #1061

itiyamas opened this issue Aug 9, 2021 · 9 comments
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request Roadmap:Stability/Availability/Resiliency Project-wide roadmap label

Comments

@itiyamas
Copy link
Contributor

itiyamas commented Aug 9, 2021

We have been looking into instrumenting the Opensearch code recently. Even though stats provides a good mechanism, it loses a lot of details like percentiles, which makes it really harder to debug issues in production. Wouldn’t it be great to add a metrics framework in Opensearch that allows a developer to add metrics easily at any part of the code without having to know the exact stats object where the metric belongs?

The framework can, for example, be integrated with RestActions and emit timing and error metrics per operation by default. Similarly, we could pass around this metrics object via ThreadContext down the executor chain and correlate timing metrics together in a single block per request. The metrics can have different logging levels allowing us to skip or add metric calculation on the fly- similar to what we have in logger.

Imagine that you added a new action within the bulk API chain and suddenly a few more requests start taking more time. One of the ways of achieving this is by adding a stat for the new operation within bulk stats. But because stats are always averaged or use precomputed percentiles - it is really tricky to confirm whether the new operation is the culprit. If there is a single metrics block that allows us correlating these metrics- it would be really simple to determine causation.

Now that I have talked about metric generation framework, the publishing can be implemented in a pluggable fashion to different sinks. We can provide a default implementation for the metric log file format, which can be plugged in via different metrics plugins.

@itiyamas itiyamas added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 9, 2021
@jkowall
Copy link
Contributor

jkowall commented Aug 9, 2021

Sounds a bit overengineered to me. I would implement OpenTelemetry which can handle the metrics, but also help with tracing (logging is coming soon). The problem with "metrics" in OpenSearch today is that they are exposed as JMX, which is highly inefficient versus modern ways of exposing data as an endpoint (in the case of Prometheus) and allowing metrics to be scraped and stored easily. Similarly I am sure there are good reasons in OpenSearch when you might want to do some tracing too.

@itiyamas
Copy link
Contributor Author

itiyamas commented Aug 9, 2021

+1 on OpenTelemetry and tracing!

Opensearch does not provide hooks to collect detailed request level metrics/tracing easily. Once we have those hooks, we can integrate OpenTelemetry for metric collection, tracing etc. I don't think it is possible to add fine grained metrics in existing code with JMX. I will check it out though.

@Bukhtawar
Copy link
Collaborator

Breaking down the problem

  1. Per action slow execution can also be traced by Initial draft on slow task execution #1021
  2. We need more stats as to supplement existing metrics with various interaction layers like time spent in the queue, N/W round trip, IO time, blocked time etc, these would be aggregated over requests spanning a time frame and easily queryable over an endpoint
  3. Per request id based latency breakdown and distributed tracing

@jkowall
Copy link
Contributor

jkowall commented Aug 10, 2021

  1. Per request id based latency breakdown and distributed tracing

If you have ever tried to instrument ElasticSearch you will learn this is a really bad idea, especially per request. It might be useful for debugging, but generally, the data will make no sense. I have done this in the past.

@itiyamas
Copy link
Contributor Author

@jkowall Why do think request based latencies are bad? Is it bad in general or just for Opensearch or just for cases where throughput is really high?
I can think of cases where long running requests would hold the metrics object in memory, leading to increasing JVM, but that data is really low compared to the request data/partial response data itself. Memory CB calculations can be based on metrics data to stop collecting metrics based on user setting.
Next comes performance degradation due to extra work needed to collect and publish metrics. That is a trade-off any user can take by changing the instrumentation log level per logger instance. Additionally, if the throughput is really high, the system can provide an option to switch to in memory aggregates or request sampling, but I won't prematurely optimize this as the cost of skipping observability is pretty high too.

The way we collect stats in Opensearch gives us averages, which ends up hiding lot of issues. Request based latencies helps us track outliers easily. Aggregating metrics early on leaves me with very little information to troubleshoot the issues after they have already occurred. The kind of clusters I deal with on a daily basis, this information is really important and often I end up enabling trace logging after the fact and then wait for the issue to re-occur, which sometimes never re-occurs or trace logs don't have enough information. Reading the trace logs is pretty tedious at this point and doing it across requests is a nightmare.

@minalsha minalsha changed the title Add metrics framework in Opensearch [RFC] Add metrics framework in Opensearch Aug 24, 2021
@CEHENKLE CEHENKLE added the discuss Issues intended to help drive brainstorming and decision making label Aug 31, 2021
@itiyamas itiyamas changed the title [RFC] Add metrics framework in Opensearch [RFC] Add metrics and tracing framework in Opensearch Sep 16, 2021
@CEHENKLE
Copy link
Member

Heya @itiyamas What are you thinking next steps will be for this?

@jkowall
Copy link
Contributor

jkowall commented Oct 12, 2021

My challenge to your suggestion is that tracing on ElasticSearch is very challenging. When you install autoinstrumentation and collect traces they will make no sense at all. I have done this with several tools and the data was useless. Additionally the overhead of instrumentation was performance impacting.

If you want to collect metrics or collect response data that would be more reasonable. We actually already have something similar that @AmiStrn worked on around the profiler that exists already.

@reta
Copy link
Collaborator

reta commented Dec 1, 2021

@jkowall @itiyamas @Bukhtawar trying to summarize the discussion points and possible future developments on the subject of metrics / tracing / stats:

  1. Metrics. The existing OpenSearch stats APIs provide a lot of insights into the internals of the each node and cluster in general. But the stats lack the semantics of rich metric types (fe histograms, ...) so it is difficult to catch outliers. Integration with OpenTelemetry is an option (alternatively, Micrometer / Dropwizard Metrics / ...).
  2. Distributed Tracing. The OpenSearch has limited support of corelation ID propagation (X-Opaque-Id), which works only with some clients. It is propagated up to the task level but the breakdown is not captured or persisted anywhere for analysis or visualization. Integration with OpenTracing is an option (the autoinstrumentation is out of the picture, manual instrumentation / context propagation is required).
  3. JFR. The OpenSearch server could be instrumented to emit context-specific JFR events. The overhead of the JFR instrumentation is very low (<3%), but the kinds and types of the events are subject to discussion. JFR events are local to the node but could be streamed to centralized location as well.
  4. Stats improvements (a large chunk of improvements is included into [Meta] BackPressure in the OpenSearch Query (Search) path #1042).

Does it make sense to create RFCs for Metrics / Tracing / JFRs and at least run some experiments to understand a) how useful that would be? b) how difficult that would be?

Thoughts?

@jkowall
Copy link
Contributor

jkowall commented Dec 7, 2021

@reta OpenTracing is deprecated, it should use OpenTelemetry if anything, but yes, I agree that autoinstrumentation is not a good idea and manually adding the code could add overhead depending on where you instrument in the code.

I agree that focusing on Metrics and Stats are a better approach. @AmiStrn was working on this earlier in the project, but we switched to other work when we realized that the governance for OpenSearch was not going to include other companies outside of AWS. When this changes then we might contribute to core features to make the project better in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request Roadmap:Stability/Availability/Resiliency Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

8 participants