SDK observability #2547

dashpole · 2022-01-24T20:06:32Z

Problem Statement

Context: kubernetes/enhancements#3161 (comment)

For an application instrumented with OpenTelemetry for tracing, and using the OTLP trace exporter, it isn't currently possible to monitor (with metrics) whether or not spans are being successfully collected and exported. For example, if my SDK cannot connect to an opentelemetry collector, and isn't able to send traces, I would like to be able to measure how many traces are collected, vs how many are not sent. I would like to be able to set up SLOs to measure successful trace delivery from my applications.

Proposed Solution

After the metrics API is stable, collect metrics in the trace SDK using the metrics API. Specifics about the metrics deserve their own design, but I should be able to tell the volume of spans my application is generating, and the success rate of exporting them. This would be done via a new TracerProviderOption: WithMeterProvider(MeterProvider).

Alternatives

We could add metrics to exporters individually, but most exporter-related metrics should be similar.

The text was updated successfully, but these errors were encountered:

MrAlias · 2022-03-01T23:43:08Z

In the meantime, while we wait for metrics to be stable enough for this, I created this: https://github.com/MrAlias/flow

@dashpole let me know if that helps.

dashpole · 2022-03-02T14:38:56Z

Very cool. I'll take a look

dashpole · 2022-03-02T20:13:27Z

It probably isn't quite enough to meet the needs I have, but may be useful for others

thehackercat · 2022-08-29T11:31:35Z

we also need this.

MrAlias · 2024-01-18T18:17:12Z

@MadVikingGod is going to look into what metrics should be added and the feasibility of this feature.

MadVikingGod · 2024-01-25T21:27:19Z

Java does implement some metrics around the BatchSpanProcessor (BSP) and a generic wrapper for some (at least grpc, maybe more) exporters. The metrics below will indicate if I found them in Java.

How could this be implemented?

Experimental

To experiment with it and not include any API surface we can start with an experimental Environment Variable. This will indicate if we should use the global Metrics API. Doing this should allow us to explore the performance impact of any of the metrics while still maintaining compatibility.

Option API

This would add a number of WithMeterProvider() to anywhere that would produce these metrics. This could either act as an enable signal, only capture metrics if it's configured or an override signal, override using the global API.

This can realistically only be done for Objects that already use an option pattern, like the TracerProvider or the BatchSpanProcessor, which would prevent some components from having an override, like the SimpleSpanProcessor. We won't need an option for Samplers, because we can measure the output of this decision without instrumenting the internals of this code.

If we were to add an option for both TP and BSP, this would mean we would need a new type that is the union of both Options, similar to SpanStartEventOption

What should be instrumented

This is a non-exhaustive list of things that could be captured

From the Tracer

Number of Spans Started
- Was it Sampled
Number of Spans Ended
From the BSP
Number of spans exported (This is in Java)
Number of Spans Dropped (This is in Java)
Number of Spans currently in the queue (This is in Java)

From an exporter

Number of Exports
Number of Retries
Duration of Export
Number of Spans Exported
Number of Spans rejected

logan-stytch · 2024-04-22T16:26:55Z

We're very interested in this feature so we can tune our Batcher to ensure it doesn't inadvertently drop spans. I started a WIP PR (#5201), but it definitely needs some guidance. If there are already plans to release metrics in the near-to-mid-term, we can wait, but otherwise, this seemed like a well-scoped area where we could help contribute (especially using the Java implementation as a reference).

dashpole · 2024-11-14T23:49:04Z

Discussed this at the in-person SIG meeting @ kubecon. We should

Add metric instrumentation to the trace SDK
Add metric and trace instrumentation to the metrics SDK
Add metric and trace instrumentation to the logs SDK

Instrumentation should default to the global meterprovider/tracerprovider, but also accept a WithMeterProvider/WithTracerProvider option that overrides the global (similar to a typical instrumentation library).

Another thought:

Should the global error handler default to using the internal logger with an error-level log?

opentelemetry-go/internal/global/handler.go

Line 30 in 128a6b8

log.Print(err)

It would be nice to be able to control the behavior of the error handler by configuring the logger.

The approach for adding metric and trace instrumentation should be:

Define the semantic convention for SDK self-observability telemetry.
Implement it in our SDKs

MrAlias · 2024-11-15T16:46:12Z

Can we start by adding this as an experimental feature. That can be used to help progress semantic convention work without blocking this.

This assumes the experimental approach would just use the global providers.

dashpole · 2024-11-18T15:23:23Z

We could start with metrics similar to the ones in the java SDK:

dashpole added the enhancement New feature or request label Jan 24, 2022

dashpole mentioned this issue Jan 24, 2022

KEP-647: Update apiserver tracing KEP to beta for 1.24 kubernetes/enhancements#3161

Merged

MrAlias added area:metrics Part of OpenTelemetry Metrics area:trace Part of OpenTelemetry tracing labels Feb 28, 2022

MrAlias mentioned this issue Sep 7, 2022

Move partialsuccess code to internal package #3146

Merged

MrAlias added this to Go: Metric SDK (Post-GA) Oct 20, 2022

MrAlias moved this to Backlog in Go: Metric SDK (Post-GA) Oct 20, 2022

MrAlias added pkg:API Related to an API package pkg:SDK Related to an SDK package labels Oct 20, 2022

jeremyrickard mentioned this issue Feb 2, 2023

KEP-2831: adding beta graduation criteria kubernetes/enhancements#3714

Merged

MrAlias assigned MadVikingGod Jan 18, 2024

MrAlias moved this from Needs Triage to TODO in Go: Metric SDK (Post-GA) Jan 18, 2024

MadVikingGod removed their assignment Jan 25, 2024

logan-stytch mentioned this issue Apr 12, 2024

[WIP] Add WithMetricProvider for BatchSpanProcessor #5201

Closed

pellared changed the title ~~Trace SDK observability~~ SDK observability Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK observability #2547

SDK observability #2547

dashpole commented Jan 24, 2022

MrAlias commented Mar 1, 2022

dashpole commented Mar 2, 2022

dashpole commented Mar 2, 2022

thehackercat commented Aug 29, 2022

MrAlias commented Jan 18, 2024

MadVikingGod commented Jan 25, 2024

logan-stytch commented Apr 22, 2024

dashpole commented Nov 14, 2024

MrAlias commented Nov 15, 2024

dashpole commented Nov 18, 2024

SDK observability #2547

SDK observability #2547

Comments

dashpole commented Jan 24, 2022

Problem Statement

Proposed Solution

Alternatives

MrAlias commented Mar 1, 2022

dashpole commented Mar 2, 2022

dashpole commented Mar 2, 2022

thehackercat commented Aug 29, 2022

MrAlias commented Jan 18, 2024

MadVikingGod commented Jan 25, 2024

How could this be implemented?

Experimental

Option API

What should be instrumented

logan-stytch commented Apr 22, 2024

dashpole commented Nov 14, 2024

MrAlias commented Nov 15, 2024

dashpole commented Nov 18, 2024