Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK observability #2547

Open
dashpole opened this issue Jan 24, 2022 · 10 comments
Open

SDK observability #2547

dashpole opened this issue Jan 24, 2022 · 10 comments
Labels
area:metrics Part of OpenTelemetry Metrics area:trace Part of OpenTelemetry tracing enhancement New feature or request pkg:API Related to an API package pkg:SDK Related to an SDK package

Comments

@dashpole
Copy link
Contributor

Problem Statement

Context: kubernetes/enhancements#3161 (comment)

For an application instrumented with OpenTelemetry for tracing, and using the OTLP trace exporter, it isn't currently possible to monitor (with metrics) whether or not spans are being successfully collected and exported. For example, if my SDK cannot connect to an opentelemetry collector, and isn't able to send traces, I would like to be able to measure how many traces are collected, vs how many are not sent. I would like to be able to set up SLOs to measure successful trace delivery from my applications.

Proposed Solution

After the metrics API is stable, collect metrics in the trace SDK using the metrics API. Specifics about the metrics deserve their own design, but I should be able to tell the volume of spans my application is generating, and the success rate of exporting them. This would be done via a new TracerProviderOption: WithMeterProvider(MeterProvider).

Alternatives

We could add metrics to exporters individually, but most exporter-related metrics should be similar.

@dashpole dashpole added the enhancement New feature or request label Jan 24, 2022
@MrAlias MrAlias added area:metrics Part of OpenTelemetry Metrics area:trace Part of OpenTelemetry tracing labels Feb 28, 2022
@MrAlias
Copy link
Contributor

MrAlias commented Mar 1, 2022

In the meantime, while we wait for metrics to be stable enough for this, I created this: https://github.com/MrAlias/flow

@dashpole let me know if that helps.

@dashpole
Copy link
Contributor Author

dashpole commented Mar 2, 2022

Very cool. I'll take a look

@dashpole
Copy link
Contributor Author

dashpole commented Mar 2, 2022

It probably isn't quite enough to meet the needs I have, but may be useful for others

@thehackercat
Copy link

we also need this.

@MrAlias
Copy link
Contributor

MrAlias commented Jan 18, 2024

@MadVikingGod is going to look into what metrics should be added and the feasibility of this feature.

@MrAlias MrAlias moved this from Needs Triage to TODO in Go: Metric SDK (Post-GA) Jan 18, 2024
@MadVikingGod
Copy link
Contributor

Java does implement some metrics around the BatchSpanProcessor (BSP) and a generic wrapper for some (at least grpc, maybe more) exporters. The metrics below will indicate if I found them in Java.

How could this be implemented?

Experimental

To experiment with it and not include any API surface we can start with an experimental Environment Variable. This will indicate if we should use the global Metrics API. Doing this should allow us to explore the performance impact of any of the metrics while still maintaining compatibility.

Option API

This would add a number of WithMeterProvider() to anywhere that would produce these metrics. This could either act as an enable signal, only capture metrics if it's configured or an override signal, override using the global API.

This can realistically only be done for Objects that already use an option pattern, like the TracerProvider or the BatchSpanProcessor, which would prevent some components from having an override, like the SimpleSpanProcessor. We won't need an option for Samplers, because we can measure the output of this decision without instrumenting the internals of this code.

If we were to add an option for both TP and BSP, this would mean we would need a new type that is the union of both Options, similar to SpanStartEventOption

What should be instrumented

This is a non-exhaustive list of things that could be captured

From the Tracer

  • Number of Spans Started
    • Was it Sampled
  • Number of Spans Ended
    From the BSP
  • Number of spans exported (This is in Java)
  • Number of Spans Dropped (This is in Java)
  • Number of Spans currently in the queue (This is in Java)

From an exporter

  • Number of Exports
  • Number of Retries
  • Duration of Export
  • Number of Spans Exported
  • Number of Spans rejected

@logan-stytch
Copy link

We're very interested in this feature so we can tune our Batcher to ensure it doesn't inadvertently drop spans. I started a WIP PR (#5201), but it definitely needs some guidance. If there are already plans to release metrics in the near-to-mid-term, we can wait, but otherwise, this seemed like a well-scoped area where we could help contribute (especially using the Java implementation as a reference).

@dashpole
Copy link
Contributor Author

Discussed this at the in-person SIG meeting @ kubecon. We should

  • Add metric instrumentation to the trace SDK
  • Add metric and trace instrumentation to the metrics SDK
  • Add metric and trace instrumentation to the logs SDK

Instrumentation should default to the global meterprovider/tracerprovider, but also accept a WithMeterProvider/WithTracerProvider option that overrides the global (similar to a typical instrumentation library).

Another thought:

  • Should the global error handler default to using the internal logger with an error-level log? It would be nice to be able to control the behavior of the error handler by configuring the logger.

The approach for adding metric and trace instrumentation should be:

  • Define the semantic convention for SDK self-observability telemetry.
  • Implement it in our SDKs

@pellared pellared changed the title Trace SDK observability SDK observability Nov 14, 2024
@MrAlias
Copy link
Contributor

MrAlias commented Nov 15, 2024

Can we start by adding this as an experimental feature. That can be used to help progress semantic convention work without blocking this.

This assumes the experimental approach would just use the global providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:metrics Part of OpenTelemetry Metrics area:trace Part of OpenTelemetry tracing enhancement New feature or request pkg:API Related to an API package pkg:SDK Related to an SDK package
Projects
Development

No branches or pull requests

5 participants