Client libraries: Manual/programmatic and auto instrumentation duplication problem #1767

lmolkova · 2021-06-16T21:46:36Z

What are you trying to achieve?

I'm working on Azure SDK instrumentation and we have a problem with double instrumentation on the HTTP layer.
Here's the context:

we want to support users who auto-instrument (with bytecode agent, monkey-patching, diagnostic sources) and those who don't - it's not in our control.
we instrument public API calls like 'upload blob' (so users know what happens on the public surface of the library and can map spans to their code). Public API calls translate into a number of REST calls (retries, auth, complex operations) and underlying HTTP calls give users insights into what happens in the SDK too.
we want to instrument HTTP layer so we can
- add extra properties that are specific to Azure and provide a better experience to users (e.g. users need them to get official support). We occasionally want to strip some sensitive parameters. All those are per-http request and have no place on the parent logical span.
- support legacy context propagation protocols with Azure services
- we can't rely on the auto-instrumentation (even if it's on) because we can't augment the span that will be created with auto-instrumentation on a lower API level with all the details that are important.

So users who don't use auto-instrumentation have spans representing HTTP calls from our client libraries, users who use auto-instrumentation, would get two spans for the same HTTP call.

Generalizing this problem beyond Azure SDKs:

TL;DR:
- Client libraries cannot trace common protocol-level calls if those could also be traced by auto-instrumentation.
- If they do (when auto-instrumentation is on), duplicate spans will be created and context propagation will be broken.
- Libraries could make their programmatic tracing configurable, but then they cannot inject service-specific data into auto-instrumented RPC-call span

Potential solutions

Backoff if the context is injected. If the context is already injected on the request (HTTP/gRPC/anything else), there is no other good option than to back off. Options are:
- re-instrument, i.e. create a new span and inject header (replace or add another value)? Then the logic that injected the header (and created a previous span) will be broken. There is no way to suppress that span.
- re-instrument, but not inject header? Then this instrumentation layer is broken - there is no reason to export this span
- don't instrument: ok, someone above already instrumented this request and perhaps created a span, nothing else to do. It seems nothing is broken and we didn't even create a span

This approach also means that the user's manual instrumentation always wins, which seems like a good default to have in terms of supportability. This is really short-term mitigation for a subset of double-instrumentation problems.

A similar contract might exist for server-side auto-instrumentation: if there is a current span already - back-off, but I can imagine it can go sideways if requests start from a thread that has span (e.g. created in start-up) by mistake.

Long-term solution: some level of coordination between layers of instrumentation. Some discussion on it Define span modelling conventions for HTTP client spans #1747.
- Down-to-earth approach we discussed a while back in the OpenCensus community was Terminal Context (perhaps terminology is outdated, happy to update if there is an interest). This only works for client-side instrumentation.
- anything else?

Existing discussions

Suppressing instrumentation implementations in OTel

Please note that context suppression is not really possible in client libraries as it requires dependency on instrumentation packages (to export suppress key). So this is not a viable workaround.

Happy to hear suggestions and drive it.

jkwatson · 2021-06-29T22:37:02Z

re-instrument, i.e. create a new span and inject header (replace or add another value)? Then the logic that injected the header (and created a previous span) will be broken. There is no way to suppress that span.

Why does this break the other span? I would assume that the newly created span will be the child of the one that first tried the injection, and so replacing the outgoing headers seems safe to me. Am I missing something?

lmolkova · 2021-06-29T23:29:40Z

Good point. It will be child only if the previous span was current, but there is no requirement to make it current and there is no reliable mechanism for context propagation that would work everywhere.

Also, it does not seem beneficial: it seems that if something has added headers, re-doing it is likely redundant.

lmolkova · 2021-06-30T00:15:17Z

Another issue is increased costs, users pay for telemetry either to vendors or through operational costs. If we agree this case is legit, we're talking about a significant increase in cost.

jkwatson · 2021-06-30T01:24:22Z

FYI, we (Splunk) have customers who are demanding the multiple layers like this all be represented, so there probably needs to be configurability around how this is handled.

lmolkova · 2021-06-30T02:27:04Z

@jkwatson Do multiple layers inject headers? If not - there could be any number of layers, just the one that injects headers wins.
Also, behavior is currently language-specific (e.g. .NET instrumentation backs-off, Java agent replaces header, not sure about others), so does it work now?

jkwatson · 2021-06-30T02:44:41Z

@jkwatson Do multiple layers inject headers? If not - there could be any number of layers, just the one that injects headers wins.
Also, behavior is currently language-specific (e.g. .NET instrumentation backs-off, Java agent replaces header, not sure about others), so does it work now?

There's less of an issue right now with double-injecting headers (the "Setter" spec does imply that you overwrite the headers, so that's really not an issue, I don't think). The issue is really that the javaagent has a policy of enforcing only a single "CLIENT" span, which means that multiple layers are being suppressed by the javaagent instrumentation currently. However, there is now an open discussion to deal with this and optionally allow nested CLIENT spans to proceed: open-telemetry/opentelemetry-java-instrumentation#3318

It would be good to have this be a part of the spec, though, and not just something pushed because one vendor has customers that want it. :)

kenfinnigan · 2021-07-08T19:55:41Z

I came across this very issue today in working on tracing in Quarkus.

I had the Vert.x Tracing SPI creating a HTTP CLIENT span after a JAX-RS ClientRequestFilter had already created an HTTP CLIENT span. It meant I had one CLIENT span with a child of the SERVER span it called, and another CLIENT span completely separate.

I ended up checking for traceparent in the outgoing request headers inside the Vert.x Tracing SPI and didn't create a new span if it was set.

Would be good to have a defined approach to how to know a duplicate span would be created if a piece of code created a new span.

lmolkova · 2021-07-18T18:05:51Z

More on this in #1811: backing off is not a great strategy because of potential reuse of http request instances #1811 (comment).

Additional context.
It not very common to reuse HTTP requests between retries, but e.g. golang http client allows that and there are likey other clients/cases where it happens.

[just as a tangent: I was surprised to find recently in https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/2727 that reusing HTTP requests is supported by most of the 20+ Java http client libraries that we instrument]

I agree that we need a way to identify if we are in a nested CLIENT span, for a couple of reasons:

to avoid overwriting traceparent from the outer CLIENT span

to conditionally suppress nested CLIENT spans (based on some kind of user verbosity preference)

It sounds like there are at least a couple of options to accomplish this:

use the presence of traceparent to know that we are in a nested CLIENT span (this requires clearing traceparent at the end of each request if http request reuse is supported by the instrumented library)

use some kind of marker in the Context to know that we are in a nested CLIENT span (this requires setting the marker in the context and propagating it to any downstream libraries that may also be instrumented)

Maybe we can focus initially on

Getting agreement that we really do need a way to identify if we are in a nested CLIENT span

Listing options to accomplish that

lmolkova · 2021-07-18T18:12:16Z

We need simple way to prevent multiple levels of HTTP instrumentation (duplicates from client libraries/users AND auto-instrumentation). Initial discussion on Java instrumentation SIG 2021/7/15 (UTC) https://docs.google.com/document/d/1WK9h4p55p8ZjPkxO75-ApI9m0wfea6ENZmMoFRvXSCw/edit

lmolkova added the spec:trace Related to the specification/trace directory label Jun 16, 2021

github-actions bot assigned bogdandrutu Jun 16, 2021

Oberon00 added area:semantic-conventions Related to semantic conventions area:span-relationships Related to span relationships labels Jun 22, 2021

srikanthccv mentioned this issue Jun 24, 2021

Server instrumentations should look for parent spans in current context before extracting context from carriers open-telemetry/opentelemetry-python-contrib#445

Closed

This was referenced Jun 29, 2021

HTTP spans are duplicated if auto-instumentation in OpenTelemetry is enabled Azure/azure-sdk-for-python#19573

Closed

Add context propagation requirements to HTTP conventions #1783

Closed

lmolkova mentioned this issue Jun 30, 2021

Draft: suppress instrumentation flag open-telemetry/opentelemetry-java-instrumentation#3443

Closed

lmolkova mentioned this issue Jul 9, 2021

Cleaning up context on reusable carriers before or after injection #1811

Closed

lmolkova mentioned this issue Sep 24, 2021

Instrumentation layers and suppressing duplicates open-telemetry/oteps#172

Closed

lmolkova mentioned this issue Jan 24, 2023

add SuppressTracing flag #3103

Closed

lmolkova mentioned this issue Jun 7, 2023

Is a "generic solution" for not duplicating spans still under construction? Is there any reference open-telemetry/opentelemetry.io#2827

Open

This was referenced Jul 6, 2023

Allow suppressing nested activities in native instrumentations without dependency on OTel SDK open-telemetry/opentelemetry-dotnet#4641

Closed

[BUG] Duplication between Azure.Core.Http spans and HTTP client ones Azure/azure-sdk-for-net#37446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client libraries: Manual/programmatic and auto instrumentation duplication problem #1767

Client libraries: Manual/programmatic and auto instrumentation duplication problem #1767

lmolkova commented Jun 16, 2021 •

edited

Loading

jkwatson commented Jun 29, 2021

lmolkova commented Jun 29, 2021 •

edited

Loading

lmolkova commented Jun 30, 2021 •

edited

Loading

jkwatson commented Jun 30, 2021

lmolkova commented Jun 30, 2021

jkwatson commented Jun 30, 2021

kenfinnigan commented Jul 8, 2021

lmolkova commented Jul 18, 2021 •

edited

Loading

lmolkova commented Jul 18, 2021 •

edited

Loading

Client libraries: Manual/programmatic and auto instrumentation duplication problem #1767

Client libraries: Manual/programmatic and auto instrumentation duplication problem #1767

Comments

lmolkova commented Jun 16, 2021 • edited Loading

Potential solutions

Existing discussions

Suppressing instrumentation implementations in OTel

jkwatson commented Jun 29, 2021

lmolkova commented Jun 29, 2021 • edited Loading

lmolkova commented Jun 30, 2021 • edited Loading

jkwatson commented Jun 30, 2021

lmolkova commented Jun 30, 2021

jkwatson commented Jun 30, 2021

kenfinnigan commented Jul 8, 2021

lmolkova commented Jul 18, 2021 • edited Loading

lmolkova commented Jul 18, 2021 • edited Loading

lmolkova commented Jun 16, 2021 •

edited

Loading

lmolkova commented Jun 29, 2021 •

edited

Loading

lmolkova commented Jun 30, 2021 •

edited

Loading

lmolkova commented Jul 18, 2021 •

edited

Loading

lmolkova commented Jul 18, 2021 •

edited

Loading