Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Lambda spec inappropriately prioritizes x-ray propagation #3060

Closed
tylerbenson opened this issue Dec 22, 2022 · 9 comments · Fixed by #3166
Closed

AWS Lambda spec inappropriately prioritizes x-ray propagation #3060

tylerbenson opened this issue Dec 22, 2022 · 9 comments · Fixed by #3166
Assignees
Labels
spec:trace Related to the specification/trace directory

Comments

@tylerbenson
Copy link
Member

tylerbenson commented Dec 22, 2022

What are you trying to achieve?
Propagation should follow the configuration provided to OpenTelemetry, not be dynamically influenced by external state.

Currently, the spec states that instrumentation should first evaluate x-ray propagation from the _X_AMZN_TRACE_ID environment variable before propagating the context from the lambda event.

This means that propagation could be working fine as configured, but if someone enables x-ray tracing, the resulting spans will be broken into separate traces (inconsistently depending on the x-ray sampling rate).

Options to resolve this:

  • Shift responsibility for evaluating the environment variable to the x-ray propagator.
  • Create a new propagator specific for evaluating the environment variable that should be configured ahead of the x-ray propagator.
  • Create a new propagator that can handle both environment variable and carrier propagation appropriately that would be used for lambda instead of the current x-ray propagator. (X-ray propagator could even dynamically choose this at startup if running in a lambda environment.)
  • Expose the propagation settings via API so the instrumentation can determine if x-ray propagation is being used and prioritize the environment variable propagation appropriately.
  • Create a hybrid carrier in the instrumentation that prioritizes the environment variable for the x-ray specific key. This would allow existing propagators to mostly work as-is with the logic pushed into the carrier.

I'm sure there are other ideas and look forward to discussing in the SIG.

Additional context.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/instrumentation/aws-lambda.md#determining-the-parent-of-a-span

@tylerbenson tylerbenson added the spec:trace Related to the specification/trace directory label Dec 22, 2022
@Aneurysm9
Copy link
Member

Options to resolve this:

  • Shift responsibility for evaluating the environment variable to the x-ray propagator.

This is a non-starter from my perspective as it inappropriately conflates the propagator interface (extract context from/inject context to a carrier) with the instrumentation scope of responsibility (determine what to use as the carrier for a propagator and what to do with the result). The choice to prefer an environment variable over the provided carrier is only relevant to this particular instrumentation and the general-purpose X-Ray propagator should not be burdened with knowledge of its existence.

  • Create a new propagator specific for evaluating the environment variable that should be configured ahead of the x-ray propagator.

  • Create a new propagator that can handle both environment variable and carrier propagation appropriately that would be used for lambda instead of the current x-ray propagator. (X-ray propagator could even dynamically choose this at startup if running in a lambda environment.)

  • Expose the propagation settings via API so the instrumentation can determine if x-ray propagation is being used and prioritize the environment variable propagation appropriately.

I think these are all variations on the same theme and overindex on the propagator as the locus of the problem. In addition to suffering from the above-mentioned interface conflation, each brings additional complexity and adds potential for confusion in selecting which propagator to use in which circumstances. This is not something, I don't think, that needs to be made the user's responsibility.

  • Create a hybrid carrier in the instrumentation that prioritizes the environment variable for the x-ray specific key. This would allow existing propagators to mostly work as-is with the logic pushed into the carrier.

I worry that this may lead to confusion for users if it is not explicitly opt-in. It could result in the context being extracted from the environment when the user expected it to be extracted from the event.

I think a minor tweak to this would result in the EventToCarrier interface used by the Go Lambda instrumentation that allows giving the user control over determining how to construct the carrier used for context propagation. Providing a generalized composite implementation and an implementation that can produce a carrier from the environment should allow for implementing the behavior currently specified and also allow a user to simply provide an alternate implementation tailored to their event structure if they are not interested in interacting with the trace context provided by the Lambda execution environment.

All of the options discussed to this point have looked at extracting a single trace context to use as the parent context for new spans created by the instrumentation. An alternative we should consider is giving an option to use the trace context extracted from the Lambda execution environment 1) as the parent context for new spans, 2) as a link added to new spans with a parent context extracted from the incoming event, or 3) not at all.

@tylerbenson
Copy link
Member Author

tylerbenson commented Dec 28, 2022

I don't understand how a propagator that returns a composite parent context would work with the existing API design. Perhaps you could elaborate on how you expect this to function?

EventToCarrier isn't coupled with the propagator and thus wouldn't know if x-ray propagation is prioritized without a config option to specify such.

I still think doing this inside the propagator system is the best option. The only other option not listed above is to formalize what some languages have already done and add a setting to enable/disable prioritizing the environment variable, but I don't love that and if we did add it I would say the default should be disabled.

@Aneurysm9
Copy link
Member

I don't understand how a propagator that returns a composite parent context would work with the existing API design. Perhaps you could elaborate on how you expect this to function?

There is no such thing as a "composite parent context". Trace contexts are scalar values. Which part of my response did you feel was suggesting a composite context? I may have been unclear with referents.

EventToCarrier isn't coupled with the propagator and thus wouldn't know if x-ray propagation is prioritized without a config option to specify such.

That is correct, and that's because this isn't a propagator issue but an instrumentation issue where the instrumentation needs to know how to provide an appropriate carrier to the configured propagator such that the desired behavior is achieved. The EventToCarrier implementation doesn't need to know what propagator is configured. It provides a carrier that the configured propagator will use. In an implementation that combines data from the Lambda execution environment with data from the incoming event it may result in adding data to the carrier that is completely ignored by the configured propagator. That's perfectly fine.

Imagine, for instance, an EventToCarrier implementation that treats the incoming event as JSON-encoded map[string]any:

func eventToCarrier(eventJSON []byte) propagation.TextMapCarrier {
  res := map[string]any{}
  err := json.Unmarshal(eventJSON, &res)
  if err != nil {
    return map[string]any{}
  }
  
  if xrayctx, ok := os.LookupEnv("_X_AMZN_TRACE_ID"); ok {
    res["_X_AMZN_TRACE_ID"] = xrayctx
  }

  return res
}

If I use this with the X-Ray propagator then it will discover the _X_AMZN_TRACE_ID value that was taken from the environment (if present). If I use the w3c propagator then it will be looking for tracecontext and tracestate entries in the carrier and will ignore the _X_AMZN_TRACE_ID value.

More interesting would be a composite that could augment any EventToCarrier implementation with the _X_AMZN_TRACE_ID value from the environment:

func CompositeEventToCarrier(base EventToCarrier) EventToCarrier {
  return func(eventJSON []byte) propagation.TextMapCarrier {
    res := base(eventJSON)

    if xrayctx, ok := os.LookupEnv("_X_AMZN_TRACE_ID"); ok {
      res.Set("_X_AMZN_TRACE_ID", xrayctx)
    }

    return res
  }
}

Then, as a user with a custom event structure that I have my own EventToCarrier implementation for I can set up my Lambda instrumentation with both like this:

handler := otellambda.Wrap(origHandler, otellambda.WithEventToCarrier(otellambda.CompositeEventToCarrier(myE2C)))

I still think doing this inside the propagator system is the best option. The only other option not listed above is to formalize what some languages have already done and add a setting to enable/disable prioritizing the environment variable, but I don't love that and if we did add it I would say the default should be disabled.

I think you think that the "propagator system" is more than it actually is and we're talking past each other. The propagators are, at heart, nothing more than a pair of pure functions: extract(carrier) context and inject(context, carrier). An implementation does nothing more than look at a carrier to try to find a context it understands or take a context and put it into a carrier in a way it understands. The carrier interface is similarly simple: get(key) value and set(key, value).

Propagators are able to function effectively across context encoding types and carrier types because they are so simple and depend on simple interfaces. Anything more complicated, such as "which carrier should be given to a propagator" is a concern for the instrumentation. HTTP instrumentation can know to make a carrier from the HTTP headers. Kafka instrumentation can know to make a carrier from Kafka message headers. gRPC instrumentation can know to create a carrier from message metadata. And the Lambda instrumentation can know to create a carrier from the Lambda execution environment and incoming event. Doing so will allow the instrumentation to function effectively without regard to which propagators are in use.

Where the current spec goes wrong is in requiring that the X-Ray propagator be used if the _X_AMZN_TRACE_ID environment variable is set regardless of the user intent expressed through the configured (or global) propagator. If the configured propagator is also the X-Ray propagator that's likely going to result in correct/expected behavior. If the configured propagator is any other propagator that is likely to result in unexpected behavior that some users will find to be incorrect. That is the thing we need to fix, not how propagators function.

@tsloughter
Copy link
Member

Propagation should follow the configuration provided to OpenTelemetry, not be dynamically influenced by external state.

I don't necessarily agree with that, but could see value in an environment variable to disable X-Ray propagation even if _X_AMZN_TRACE_ID is set.

An additional option, something I've been mulling a bit with respect to non-lambda issues similar to this, is a configuration option to say that the Trace from X-Ray should be used as a Link instead of a parent.

@tylerbenson
Copy link
Member Author

tylerbenson commented Dec 29, 2022

@Aneurysm9 sorry, instead of saying composite parent context I meant to say composite carrier.
The java implementation doesn't have the EventToCarrier abstraction, so maybe I'm not fully understanding your suggestion but perhaps it is like this...

In the Java API, the extract method looks like this:

  <C> Context extract(Context context, @Nullable C carrier, TextMapGetter<C> getter);

The instrumentation chooses which carrier to provide (for the case of lambda instrumentation, it's the AwsLambdaRequest.getHeaders()) and the getter (MapGetter.INSTANCE).

Instead of MapGetter.INSTANCE the instrumentation could provide a custom getter that uses the environment variable if _X_AMZN_TRACE_ID is passed into the getter. The getter would need to check if the environment variable has a valid span context and if not, delegate to the regular MapGetter.INSTANCE. Is that in line with what you were suggesting?

(One issue I noticed in your proposal for EventToCarrier is it doesn't validate if the environment variable has a valid sampling parent context, so it would have a different behavior than the current spec.)

@tsloughter I like that idea but I would suggest it should always be the case.
Instead of having this behavior where the x-ray propagation ignores the context from the parent if the environment variable is set, it should always use the context from the event as the parent and link to the environment variable context if present. I'm not sure what implications this would have on x-ray though.

@Aneurysm9
Copy link
Member

The java implementation doesn't have the EventToCarrier abstraction, so maybe I'm not fully understanding your suggestion

This will be one of the first items on the agenda for the SIG, then. I think it is important to provide customization opportunities at that interface.

The instrumentation chooses which carrier to provide (for the case of lambda instrumentation, it's the AwsLambdaRequest.getHeaders()) and the getter (MapGetter.INSTANCE).

And this is why it is important to have an EventToCarrier abstraction. Not every Lambda invocation carries an event that is a "request" or has "headers". Some may be carrying log records or custom events that have an embedded trace context and require the application to provide parsing of the event to extract a generic map structure that can be used with the stock MapGetter (or something that implements a carrier interface in languages like Go that do not use getters/setters).

Instead of MapGetter.INSTANCE the instrumentation could provide a custom getter that uses the environment variable if _X_AMZN_TRACE_ID is passed into the getter. The getter would need to check if the environment variable has a valid span context and if not, delegate to the regular MapGetter.INSTANCE. Is that in line with what you were suggesting?

I'm not sure a custom getter implementation is the answer. As the getter/setter are optional and not implemented in all languages, it wouldn't provide a consistent solution. I also don't think it necessarily helps solve the custom event interaction use case. There would still need to be something that can underly that getter and get fields from the incoming event, which starts to look a whole lot like EventToCarrier.

(One issue I noticed in your proposal for EventToCarrier is it doesn't validate if the environment variable has a valid sampling parent context, so it would have a different behavior than the current spec.)

The example code I provided above is a straw man for discussion, not a complete implementation. If we were to take this route then the EventToCarrier implementation responsible for getting _X_AMZN_TRACE_ID from the Lambda execution environment could certainly be required to check for the sampled flag. I'm not sure that's the right requirement, though, and I'd want the SIG to review the requirements around this interaction de novo.

@jsuereth
Copy link
Contributor

This issue raised two concerns in my mind.

FIRST I want to say that the concerns on propagators @Aneurysm9 raise are similar to discussions we've been having around OpenCensus binary propagation format and gRPC instrumentation. Specifically, the current specification around propagators (in OpenCensus) for binary propagation is a bit awkward. There seems to be never an opportunity where you'd not use the OC binary format and a different binary format. IN this case it seems the instrumetnation (gRPC) is tied to the propagation format.

However, the key here is we want users to be able to provide prioritization of which propagation format they want via configuration. The fact that the semconv tries to undo this (v.s provide a default), I feel, may be an overextension of the semconv specification. Specifically, should the existence of x-ray tracing always overrule other propagation? I'm ok if that's the default, but I'd be concerned if there were no ability to configure differently.

My second concern here is around the propagation specification itself. As @Aneurysm9 says, we conflate "how to fill out key-value pairs" with the prioritization of which instrumentation to attach to. However, unlike the conclusion in this thread, I actually think the ability for users to prioritize (even instrumentation specific propagation formats) is important.

I think we should do something here, possibly with the Propagation specification going forward. I see this necessary for gRPC + OpenCensus binary propagation in addition to what I'm seeing in AWS x-ray, both of which don't fix the current TextMapPropagator model.

@tylerbenson
Copy link
Member Author

We discussed this in the FAAS SIG meeting today and came to the following conclusion:

  • Instrumentation should be responsible for capturing the environment variable and putting it into the carrier before passing along to the propagator. (For implementations that have EventToCarrier it can be done there.)
  • If both the environment variable and the event have x-ray propagation, then the priority between those is currently undefined and needs to be discussed further.

@tylerbenson
Copy link
Member Author

Because there has been confusion on my final comment above, I want to point out that what was implemented in the spec in #3166 is different but not reflected in this issue because the SIG decided to move forward with span links on the Jan 31 meeting (but failed to notate here) which is why #3166 was focused on span links instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants