Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(instrumentation-aws-lambda): Changed capturing of X-Ray context as span link #1411

Conversation

martinkuba
Copy link
Contributor

Which problem is this PR solving?

The AWS Lambda spec has been updated with respect to handling the _X_AMZN_TRACE_ID env variable. Instead of using its context as a parent span, it now says that a link should be created instead. See this spec issue for more details.

Short description of the changes

A span link will now be created for a sampled context in the X-Ray env variable. I have also removed the disableAwsContextPropagation configuration since it is not needed anymore.

@codecov
Copy link

codecov bot commented Feb 24, 2023

Codecov Report

Merging #1411 (d77a73d) into main (b5fc0c4) will decrease coverage by 0.93%.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1411      +/-   ##
==========================================
- Coverage   96.06%   95.14%   -0.93%     
==========================================
  Files          14       17       +3     
  Lines         914     1173     +259     
  Branches      199      244      +45     
==========================================
+ Hits          878     1116     +238     
- Misses         36       57      +21     
Impacted Files Coverage Δ
...tapackages/auto-instrumentations-node/src/utils.ts 98.78% <100.00%> (ø)
...-instrumentation-aws-lambda/src/instrumentation.ts 94.01% <100.00%> (ø)

... and 1 file with indirect coverage changes

@martinkuba
Copy link
Contributor Author

@willarmiros Would you be able to review this?

Copy link
Contributor

@willarmiros willarmiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few concerns about the nature of this change, since it's the first I'm aware of it. I will also discuss w/ @Aneurysm9 offline.

@@ -50,8 +50,7 @@ In your Lambda function configuration, add or update the `NODE_OPTIONS` environm
| --- | --- | --- |
| `requestHook` | `RequestHook` (function) | Hook for adding custom attributes before lambda starts handling the request. Receives params: `span, { event, context }` |
| `responseHook` | `ResponseHook` (function) | Hook for adding custom attributes before lambda returns the response. Receives params: `span, { err?, res? }` |
| `disableAwsContextPropagation` | `boolean` | By default, this instrumentation will try to read the context from the `_X_AMZN_TRACE_ID` environment variable set by Lambda, set this to `true` to disable this behavior |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't removing this a breaking change?

return ROOT_CONTEXT;
}

private static _determineLinks(): Link[] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that changing this behavior also being a breaking change. Will X-Ray customers who once saw their Lambda span as a direct child of the incoming context now only see it linked to the incoming context? Or will it still have a parent-child relationship (as before) but now ALSO have a link? I suppose this should have been discussed in the spec change process, but I wasn't aware of it :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willarmiros to be clear since you are the component owner, are you asking that we block this change? Since this comes from the spec, I would lean on the side of accepting it. Also, since this component is 0.x version it should be expected that breaking changes are possible.

@martinkuba
Copy link
Contributor Author

I have a few concerns about the nature of this change, since it's the first I'm aware of it. I will also discuss w/ @Aneurysm9 offline.

@willarmiros This change was introduced in this spec change and discussed in this issue. @Aneurysm9 was part of these discussions, and I think in favor of making the change.

@willarmiros
Copy link
Contributor

Thanks for the context - I don't want to block this change as it is indeed implementing the spec change. Can you explain though how customers can configure their Lambda functions/instrumentation to maintain the old experience (of reading from _X_AMZN_TRACE_ID env var and not doing any linking)?

@martinkuba
Copy link
Contributor Author

Maintaining the old behavior is no longer in the spec, and keeping it as an option has not been considered at this time (as far as I know). If you feel strongly that it should, I think we should update the spec first so that it is consistent across different implementations (there is a similar update for Python in progress).

@willarmiros
Copy link
Contributor

willarmiros commented Mar 24, 2023

Makes sense @martinkuba, we may pursue updating the spec in the future. For now I am just trying to fully understand the change in behavior as it stands today. So the old default behavior was:

  1. Lambda invoked.
  2. Lambda service creates a trace context and injects it into _X_AMZN_TRACE_ID env var.
  3. Lambda instrumentation parses the env var for a valid trace context
  4. If trace context is valid, creates a span to represent the invocation, and assigns the _X_AMZN_TRACE_ID trace context as the parent context. If it is not valid, start a new trace and ignore _X_AMZN_TRACE_ID.
  5. Other spans (e.g. client spans) are created as children of the invocation span

Now, after this change, the behavior will be:

  1. Lambda invoked.
  2. Lambda service creates a trace context and injects it into _X_AMZN_TRACE_ID env var.
  3. Lambda instrumentation parses the env var for a valid trace context
  4. The instrumentation creates a span for the Lambda invocation in a new trace regardless of the _X_AMZN_TRACE_ID validity. If the env var is valid, instrumentation links the invocation span to the _X_AMZN_TRACE_ID trace context.
  5. Other spans are created as children of the invocation span in the new trace

Is that correct?

@Aneurysm9
Copy link
Member

There is still the eventContextExtractor that can be provided to extract a parent context and that context can be extracted from this environment variable or the event as desired. See this comment outlining the concept, though using EventToCarrier from the Go instrumentation that separates carrier creation from context extraction. This was supposed to be part of the spec change stopping the forced use of the Lambda-provided context, but needs to be added as a follow up.

@tylerbenson
Copy link
Member

@willarmiros your summary of the expected before and after behavior looks correct to me.

@martinkuba
Copy link
Contributor Author

Hi @carolabadeer, you are listed as the maintainer for this component. Can you please take a look at this change? It's been open for a long time, and we need help moving it forward.

@cartersocha
Copy link

@Aneurysm9 any remaining comments? We'd like to move forward on this

@Aneurysm9
Copy link
Member

There needs to be a dead-simple mechanism for the user to get the same behavior they currently have after this change is made. This change in the Java instrumentation has caused customer pain and churn that we can and should be able to avoid. @bryan-aguilar and @rapphil can provide more details there.

@carlosalberto
Copy link

Hey @open-telemetry/javascript-approvers

I suggest we go ahead and merge the current PR as this implements an experimental section of the Specification (there's already an issue regarding tracking the changes requested to make AWS users lives easier): open-telemetry/opentelemetry-lambda#714

@carlosalberto
Copy link

Any reason this is not merged yet? @martinkuba

@tylerbenson
Copy link
Member

@carlosalberto I believe Code owner review required. He may be able to bypass somehow though.

@carolabadeer
Copy link
Contributor

Reiterating @Aneurysm9's concerns above:

There needs to be a dead-simple mechanism for the user to get the same behavior they currently have after this change is made. This change in the Java instrumentation has caused customer pain and churn that we can and should be able to avoid.

Since this is a breaking change, it is important to have a configuration mechanism for choosing between keeping the current behaviour and using the new behaviour with span links.

@cartersocha
Copy link

Breaking changes aren’t applicable when a spec or component is marked as experimental. Additionally, the technical committee has made the decision to move forward with this change.

The community and sig welcome any proposals and example implementations to improve this behavior for users going forward.

@Aneurysm9
Copy link
Member

Aneurysm9 commented Aug 7, 2023

Breaking changes aren’t applicable when a spec or component is marked as experimental.

That is not in line with the guidance we have received regarding recent and planned changes to "experimental" semantic conventions that have been de facto stable. We have to acknowledge that we have shipped functionality based on these conventions for a long time and users have established an understanding that they are experimental, but relatively stable. A change like this without any means for the user to recover the prior behavior is significantly destabilizing and worthy of more critical thought than "it's experimental, so break whatever whenever".

Additionally, the technical committee has made the decision to move forward with this change.

No, they did not as far as I can tell. From their meeting notes on 2023-07-12:

The maintainer is responsible for navigating conflicts and making decisions. Escalation to TC should be reserved for special situations.
This isn’t one of those special situations because FaaS semantic conventions are still experimental.

The TC have explicitly deferred to the repository maintainers. How about we let them speak for themselves here.

@dyladan
Copy link
Member

dyladan commented Aug 7, 2023

The reason it's not yet merged is because it does not seem to me that the code owners are satisfied with the change.

quoting from: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/CONTRIBUTING.md#component-ownership

Component owners are generally given authority to make decisions relating to implementation and feature requests for their components, provided they follow the best practices set out by the maintainers. Component owners MUST do their best to maintain a high level of quality, security, performance, and specification compliance within their components. Maintainers may override the decisions of component owners, but should only do so when they feel one or more of these traits is compromised.

In this case, the component owners have requested that the original behavior be accessible behind a configuration flag because the change introduces breaking behavior. I believe that to be a reasonable request. There is an explanation of the breaking change here: open-telemetry/opentelemetry-lambda#714 (issue refers to java specifically, but it is the same change). There is an ongoing effort to introduce a long term fix for this issue here: open-telemetry/semantic-conventions#164.

I had been holding off on merging this in the hopes that the PR author and the code owners could come to a resolution on their own, or that the question would be resolved in the specification which is still under active development. If we push through a change the code owners aren't happy with it calls into question the reasoning for even having code owners and may erode the trust between the maintainers and the code owners. Since the specification itself is still experimental, I don't believe the maintainers should necessarily force it through. If the spec was stable I might feel differently.

I suggest we go ahead and merge the current PR as this implements an experimental section of the Specification (there's already an issue regarding tracking the changes requested to make AWS users lives easier): open-telemetry/opentelemetry-lambda#714

I'm not sure I agree with the conclusion here. I'm hesitant to merge a change which follows an experimental spec that is known to have issues that may require more changes in the future, particularly when the code owners have specifically raised concerns.


My recommendation is to follow @Aneurysm9's original request and make the old behavior accessible by configuration. It could be argued whether the old or new behavior should be the default and I see merit in both choices there. It is my hope that the code owners and PR authors can resolve that question between themselves without maintainers needing to step in.

@dyladan
Copy link
Member

dyladan commented Aug 7, 2023

@carlosalberto since you commented on this and you're a TC member I want to make sure you're OK with the above response. I assume a TC member only comments on a PR for a contrib repo when there is a specific reason to do so.

@carlosalberto
Copy link

Hey @carolabadeer

Since this is a breaking change, it is important to have a configuration mechanism for choosing between keeping the current behaviour and using the new behaviour with span links.

Please join the FaaS group to potentially discuss this. For the time being, as the related section doesn't include such flag, we should merge this change (as Java already did).

@Aneurysm9
Copy link
Member

For the time being, as the related section doesn't include such flag, we should merge this change (as Java already did).

Are you speaking on behalf of the TC? That is not the understanding I have from the notes of the TC discussion. The TC has said this should be handled by the maintainers. The maintainers have deferred to the component owners. The component owners are asking for this change. Why should we merge this PR without the approval of the code owners or maintainers?

@carlosalberto
Copy link

How about we ask the impacted user to stay in older version while we sorting things out ? non x-ray user can get newer version and benefit from the change. We can then discuss and decide whether to put in a knob to turn it on/off (short term solution). Later we can figure out the long term solution (open-telemetry/semantic-conventions#164) etc.

Absolutely. There's also the chance that AWS could keep a fork of this instrumentation, and do the proper massaging till they support Links (hopefully soon? they will be extensively used in Messaging).

@dyladan
Copy link
Member

dyladan commented Aug 9, 2023

Hi all, I have a suggestion. How about we ask the impacted user to stay in older version while we sorting things out ? non x-ray user can get newer version and benefit from the change. We can then discuss and decide whether to put in a knob to turn it on/off (short term solution). Later we can figure out the long term solution (open-telemetry/semantic-conventions#164) etc.

This is definitely possible but I think it's not ideal. I suspect it is what broken users are doing in Java and Python. I would expect it to work in most cases, but dependency management is tricky enough in JS as it is.

@Aneurysm9
Copy link
Member

"Fixing" the situation for some users at the expense of others when both can be accommodated is not a good solution by any measure.

Agreed. And going back to the previous situation will break non AWS X-Ray users, which is not a great win for OTel as an OSS project IMHO.

We are not asking to return to the status quo ante. We are asking for an interim compromise that retains the option for users to choose which behavior they need.

There's also the chance that AWS could keep a fork of this instrumentation, and do the proper massaging till they support Links (hopefully soon? they will be extensively used in Messaging).

This is not about supporting or not supporting links. It is about breaking trace context and not being able to represent a complete trace at all. Links are needed for users who are not using X-Ray so that they can connect to the spans that are vended by the Lambda service. Users who are using X-Ray need to have the correct parent segments so that they have a complete trace, including the spans that are vended by the lambda service. The breakage this introduces was clearly documented on open-telemetry/opentelemetry-lambda#714.

@dyladan
Copy link
Member

dyladan commented Aug 9, 2023

Sorry for multiple comments in a row. New comments are coming in while I'm authoring replies to previous ones.

It's very simple: this is not a JS-specific feature, it's something that will have to be added in the future to other languages. I'm very surprised that nobody came to the FaaS working calls and/or present a very small PR that covers this in the semconv docs. Or if this is something that only has to exist in JS, I'm happy to be proven wrong.

It's not a JS specific issue. I'm also surprised it hasn't been resolved in the spec and I've not been involved there so I don't have any context to speak to it. I'm just trying not to break existing JS users. The reason I'm focused on JS and not other languages is because I'm responsible for JS and not those other languages.

There's also the chance that AWS could keep a fork of this instrumentation

I really don't want to get to the point where we have to tell someone to fork the instrumentation. To me that is the worst case scenario that I hope we can avoid.

@Aneurysm9
Copy link
Member

To make sure we're all on the same page, here's what the specification currently says:

If the _X_AMZN_TRACE_ID environment variable is set, instrumentation SHOULD try to parse an OpenTelemetry Context out of it using the AWS X-Ray Propagator. If the resulting Context is valid then a Span Link SHOULD be added to the new Span's start options with an associated attribute of source=x-ray-env to indicate the source of the linked span. Instrumentation MUST check if the context is valid before using it because the _X_AMZN_TRACE_ID environment variable can contain an incomplete trace context which indicates X-Ray isn’t enabled. The environment variable will be set and the Context will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can disable AWS X-Ray for the function if the X-Ray Span Link is not desired.

Let's break that down:

If the _X_AMZN_TRACE_ID environment variable is set, instrumentation SHOULD try to parse an OpenTelemetry Context out of it using the AWS X-Ray Propagator.

This says that instrumentation SHOULD try to extract a context from that environment variable. It does not require that it be done. The current states of this instrumentation is compliant with this specification. The proposed change to this PR would also be compliant with this specification.

If the resulting Context is valid then a Span Link SHOULD be added to the new Span's start options with an associated attribute of source=x-ray-env to indicate the source of the linked span.

Again, the language is SHOULD. Instrumentation can add a span or not and still be compliant. The current implementation of this instrumentation is compliant, as would be the instrumentation after the change that has been requested to this PR.

Instrumentation MUST check if the context is valid before using it because the _X_AMZN_TRACE_ID environment variable can contain an incomplete trace context which indicates X-Ray isn’t enabled. The environment variable will be set and the Context will be valid and sampled only if AWS X-Ray has been enabled for the Lambda function. A user can disable AWS X-Ray for the function if the X-Ray Span Link is not desired.

Here we encounter the only MUST requirement in this section of the specification, though it is conditioned on instrumentation choosing to implement the prior SHOULD regarding creating a span link. The current implementation is compliant, as would be an implementation with the change requested on this PR.

As it stands, the implementation is compliant with the specification, but presents issues for users who do not use AWS X-Ray. As-is this PR would be compliant with the specification, but would present issues for users who do use AWS X-Ray. If the author implements the change requested by both the code owner and maintainers, the instrumentation would still be compliant with the specification and would present options for all users to achieve their desired outcomes.

@martinkuba as the author of this PR can you please implement the change that has been requested by the code owners and maintainers so that we can resolve this?

@dyladan
Copy link
Member

dyladan commented Aug 10, 2023

@martinkuba as the author of this PR can you please implement the change that has been requested by the code owners and maintainers so that we can resolve this?

I'd actually advise to hold off until the conversation is resolved and we are sure everyone is on the same page.

@shuwpan
Copy link

shuwpan commented Aug 11, 2023

Shall we all come to FaaS SIG meeting next Tuesday (Aug 15) and try to reach some consensus ? These spanlink prs are open for a long time (half year). Repo of different languages are in inconsistent state because of this.

I am new to the community, just wondering, is there any sort of voting mechanism ? When we are in stalemate, we can still make decision and move forward ?

@Aneurysm9
Copy link
Member

This can be added to the agenda for the next SIG meeting, but attendance and participation at a SIG meeting cannot be the only way to make progress on this, or any, issue.

Given that there is no spec compliance issue WRT the current state of this instrumentation or any of the proposed states discussed on this PR, I don't think it is necessary to discuss anything in the FaaS SIG to progress this PR. We can certainly discuss proposed changes to the spec, but the FaaS SIG's spec discussions are not what is holding this PR up at this time.

@dyladan
Copy link
Member

dyladan commented Aug 11, 2023

Shall we all come to FaaS SIG meeting next Tuesday (Aug 15) and try to reach some consensus ? These spanlink prs are open for a long time (half year). Repo of different languages are in inconsistent state because of this.

I was already planning to come to this meeting. In my opinion they should be solving this there and we shouldn't be having these arguments in implementation SIG contrib repos.

I am new to the community, just wondering, is there any sort of voting mechanism? When we are in stalemate, we can still make decision and move forward ?

No, there is no voting mechanism. Instead, there is a hierarchy with TC/GC at the top, maintainers beneath them, approvers beneath them, and contributors (and component owners in there somewhere around approvers). In this case the maintainer (me) has simply not decided to exercise their power yet. If I do decide to do so, I can settle the argument as long as my decision complies with the policies of the project (maintained by GC) and the specification (maintained by TC). If the parties involved are unhappy with the decision they could then escalate to the TC. Technically, the GC sits above the TC but in practice they are treated as equal.

We are not in a stalemate in my opinion, because arguments and new ideas for solutions are still being formulated. If we reach a point where no new information is forthcoming and no decision has been made, then we will be in a stalemate. This is a natural part of the open source development process and it occasionally has hiccups that need to be addressed. If there is a true stalemate I will settle it as maintainer and we'll go from there.

@dyladan
Copy link
Member

dyladan commented Aug 11, 2023

I'm also planning to bring this up in the maintainers meeting. I'd like to hear what the other maintainers think of how this situation should be handled or could have been handled better. I'm also not entirely sure the @open-telemetry/python-maintainers are aware that they're about to release a breaking change to their users, or that the @open-telemetry/java-instrumentation-maintainers are aware they have already done so (they might be aware but I just want to make sure).

@dyladan
Copy link
Member

dyladan commented Aug 11, 2023

This can be added to the agenda for the next SIG meeting, but attendance and participation at a SIG meeting cannot be the only way to make progress on this, or any, issue.

It may not be the only way but I believe it will be the fastest way.

Given that there is no spec compliance issue WRT the current state of this instrumentation or any of the proposed states discussed on this PR, I don't think it is necessary to discuss anything in the FaaS SIG to progress this PR. We can certainly discuss proposed changes to the spec, but the FaaS SIG's spec discussions are not what is holding this PR up at this time.

Technically this is correct but it does hinge on deciding not to fulfill a SHOULD requirement. I would much rather see a version of the spec where all users can use the instrumentation and where it does not have such caveats.


In any case, I do not believe this is the correct place to have discussion about if the spec is or is not correct. The only thing clear to me is that both instrumentation versions "in the wild" are broken in one way or another regardless of spec compliance. I would much rather see this settled by the FaaS SIG.


edit:

It may not be the only way but I believe it will be the fastest way.

To be clear, I don't know if it will be resolved in the meeting, but I hope to impress on them that they need to address this and that they can't expect language sig maintainers to arbitrate these things without the context of the last 5 months of conversations.

@trask
Copy link
Member

trask commented Aug 11, 2023

I'm also planning to bring this up in the maintainers meeting. I'd like to hear what the other maintainers think of how this situation should be handled or could have been handled better. I'm also not entirely sure the @open-telemetry/python-maintainers are aware that they're about to release a breaking change to their users, or that the @open-telemetry/java-instrumentation-maintainers are aware they have already done so (they might be aware but I just want to make sure).

fwiw, we (Java) released this breaking change 4 months ago and haven't heard from anyone about it (cc @rapphil to keep me honest). we would be happy to accept a PR to opt-in to the prior behavior if the new behavior is causing problems for some users.

@trask
Copy link
Member

trask commented Aug 11, 2023

btw I agree with @dyladan that this discussion should ideally be happening within the scope of the FAAS SIG, and they should ideally update the FAAS semconv to reflect the outcome of their discussion (and if the FAAS SIG is deadlocked then they can escalate to the TC)

@rapphil
Copy link

rapphil commented Aug 15, 2023

FYI we intend to create PRs to provide flags that will allow users to opt-in into the previous behaviour. The first pr is out for Python: open-telemetry/opentelemetry-python-contrib#1909

Next we intend to do this for Js and finally for Java.

@dyladan
Copy link
Member

dyladan commented Aug 15, 2023

FYI we intend to create PRs to provide flags that will allow users to opt-in into the previous behaviour. The first pr is out for Python: open-telemetry/opentelemetry-python-contrib#1909

Next we intend to do this for Js and finally for Java.

Is this in the spec or is it just a reaction to this conversation?

@Aneurysm9
Copy link
Member

FYI we intend to create PRs to provide flags that will allow users to opt-in into the previous behaviour. The first pr is out for Python: open-telemetry/opentelemetry-python-contrib#1909
Next we intend to do this for Js and finally for Java.

Is this in the spec or is it just a reaction to this conversation?

The proposed behavior is compliant with the current language in the spec and achieves the intended goal of the spec change (as I understood it) of allowing users of all trace backends and context propagation methods to effectively utilize this instrumentation.

@dyladan
Copy link
Member

dyladan commented Aug 15, 2023

The proposed behavior is compliant with the current language in the spec and achieves the intended goal of the spec change (as I understood it) of allowing users of all trace backends and context propagation methods to effectively utilize this instrumentation.

I was asking if there have been spec changes specifically to address this. "Technically allowed" is different than "recommended".

@github-actions
Copy link
Contributor

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@ithompson-gp
Copy link

Has there been any move to merge this?
As an end-user, we would like conventions to be respected, and move forward with developing Span Link as a means to represent causal relationships in an async EDA.
This is a true blocker to any FaaS adoption of OpenTelemetry

@tylerbenson
Copy link
Member

@ithompson-gp you can see a new proposal for how we intend to move forward here:
open-telemetry/semantic-conventions#354

Feel free to ping me on slack or join a SIG meeting if you have questions or concerns.

@github-actions github-actions bot removed the stale label Oct 30, 2023
@ithompson-gp
Copy link

Since this open-telemetry/semantic-conventions#354 has been merged, what is the status of the changing this (instrumentation-aws-lambda) instrumentation to follow the pattern?

Wondering of what mechanism will be put in place to properly respect the fact we - the user - would like to have the config option available to set SpanLink?

Opening out the current implementation to make configurable SpanOption setting as well as configuration to set the extracted - parent - context as a SpanLink is a requirement in an EDA environment.

Any update on this @tylerbenson @martinkuba ?

@martinkuba
Copy link
Contributor Author

@ithompson-gp The JS work to implement the latest spec is tracked here.

I think there is a potential follow-up to this for handling links when there are multiple contexts available. This would need to be added to the spec first, and I am not planning on working on that at the moment. If you would like to discuss further, I suggest attending the FaaS SIG weekly meeting.

@martinkuba
Copy link
Contributor Author

Closing this since adding a link is no longer in the spec.

@martinkuba martinkuba closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.