Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time to fully support TraceContext #71

Closed
watson opened this issue Apr 8, 2019 · 16 comments
Closed

Time to fully support TraceContext #71

watson opened this issue Apr 8, 2019 · 16 comments

Comments

@watson
Copy link
Contributor

watson commented Apr 8, 2019

I'd like to propose that we start the work of dropping the elastic-apm- prefix on our traceparent header + add support for tracestate. We have to add a transition period though. This issue is for discussion that 🙂

While TraceContext is not a finished standard yet, it's at it's final stage, Candidate Recommondation (CR), and require real-world usage from at least 3 APM vendors in order to become so. While I think there's already enough vendors pushing this so that we don't need to, I don't see any reason for us to hold back any longer. Hence this proposal.

Proposed plan

  1. On all incoming requests, our agents should look for both elastic-apm-traceparent and traceparent. If both are present, but the values differ, the agent should prefer elastic-apm-traceparent over traceparent.
  2. On all outgoing requests, backend agents will always set traceparent, and optionally set elastic-apm-traceparent; RUM agents will either set elastic-apm-traceparent or traceparent, due to CORS. This will be controlled by a new environment variable, described below.
  3. We will introduce a new config environment variable, ELASTIC_APM_USE_ELASTIC_TRACEPARENT_HEADER, which will control the above behaviour. Its value will initially default to true, to ensure backwards compatibility with older agents. Later we will flip the default value to false, to phase out backwards compatibility. The configuration exists as an "escape hatch" for large/complex environments, but otherwise shouldn't be necessary.
  4. On all incoming requests, our agents should look for tracestate and, based on the rules in the TraceContext spec, forward it to all outgoing requests. We currently don't need our agents to use this header for anything.
    • Be aware that the incoming HTTP request can have more than one tracestate header. Our agents should combine multiple incoming tracestate headers into one outgoing tracestate header.

Testing

@gregkalapos brought up that there's a bunch of compatibility tests we can run to ensure we support the spec correctly. It would be a good idea if we all added those to our CI in some way if possible:

https://github.com/w3c/trace-context/tree/master/test

What we are voting on

@elastic/apm-agent-devs Are you ok with moving forward with this plan?

@roncohen Please weigh in here if you have any objections.

Vote

Agent Yes No Indifferent N/A Link to agent issue
.NET
elastic/apm-agent-dotnet#177
Go
elastic/apm-agent-go#503
Java
elastic/apm-agent-java#923
Node.js
elastic/apm-agent-nodejs#994
Python
elastic/apm-agent-python#628
Ruby
elastic/apm-agent-ruby#621
RUM
elastic/apm-agent-rum-js#477
@axw
Copy link
Member

axw commented Apr 8, 2019

On all outgoing requests where an agent today would have set elastic-apm-traceparent, it should also set traceparent to the same value.

This means we either have to upgrade all agents at the same time, or have broken distributed traces. If we're going to make this the default behaviour, should we do it in a major version bump of each agent? Perhaps we could make it an option before then?

Is it not still possible for the traceparent format to be changed between now and the final publication? I think that was the primary motivation for not using the same header name in the first place.

On all incoming requests, our agents should look for tracestate and, based on the rules in the TraceContext spec, forward it to all outgoing requests. We currently don't need our agents to use this header for anything.

I think we also need to define a vendor key ("elastic"?), and probably our own versioning scheme for our vendor-specific tracestate. Maybe we can defer the second bit till later.

@watson
Copy link
Contributor Author

watson commented Apr 8, 2019

@axw wrote:

This means we either have to upgrade all agents at the same time, or have broken distributed traces. If we're going to make this the default behaviour, should we do it in a major version bump of each agent? Perhaps we could make it an option before then?

I think you misunderstood. We would still keep the old header. So this should be backward compatible and wouldn't need to major bump.

Is it not still possible for the traceparent format to be changed between now and the final publication? I think that was the primary motivation for not using the same header name in the first place.

Technically yes, but unlikely. But now that other APM vendors have implemented the header (without a prefix), this would mean a lot of broken implementations. So if this happens I expect that it will be dealt with in a way that would make it as painless as possible. E.g. by bumping the version. But of course, there are no guarantees.

I think we also need to define a vendor key ("elastic"?), and probably our own versioning scheme for our vendor-specific tracestate. Maybe we can defer the second bit till later.

Yes, and I agree we should defer it as we don't have a need for this key at the moment.

@axw
Copy link
Member

axw commented Apr 8, 2019

I think you misunderstood. We would still keep the old header. So this should be backward compatible and wouldn't need to major bump.

Indeed, I missed the "also" and inserted "instead" :)
SGTM.

@hmdhk
Copy link
Contributor

hmdhk commented Apr 8, 2019

For the RUM agent, I think we can only add traceparent header on a major version, the reason is that adding any header to a cross-origin request requires that configuration be changed on the backed so that the backend sends appropriate Access-Control-Allow-Headers header.

We might want to remove the elastic-apm-traceparent header on the same major version as well, otherwise users will have to keep this header in their backend configuration until we remove it.

@watson
Copy link
Contributor Author

watson commented Apr 8, 2019

@jahtalab Good point. If you major bump it should be ok to remove elastic-apm-traceparent. Here's a simple upgrade path table:

RUM sends Backend accepts Result
elastic-apm-traceparent elastic-apm-traceparent Ok
elastic-apm-traceparent elastic-apm-traceparent + traceparent Ok
traceparent elastic-apm-traceparent Error
traceparent elastic-apm-traceparent + traceparent Ok

So the user just has to make sure to upgrade the backend agents first before upgrading the RUM agent in order for this to work. As this will be implemented with a major bump in the RUM agent I think this is an acceptable approach.

@hmdhk
Copy link
Contributor

hmdhk commented Apr 8, 2019

Just to be clear the backend configuration I was referring to is not necessarily in the same backend as the backend instrumented by the our agents e.g. a reverse proxy that handles cross-origin header.

But, I agree we should remove elastic-apm-traceparent on the same major version. I think it make sense to wait until enough of our backend agents have released this change before releasing this on the RUM agent.

@roncohen
Copy link

I think it's fine to start looking into supporting this, but i would have preferred to let other blaze the trail when it comes to implementing it. With that in mind, we should make sure to make it part of our regular priority decision process, as we do with everything.

@hmdhk
Copy link
Contributor

hmdhk commented Jan 6, 2020

Since the RUM agent's implementation of the tracecontext will be a breaking change, we're planning to release this change once the backend agents have support for the traceparent header. I've looked at the issues and the following seems to be the current status:

Agent Status/Milestone Link to agent issue
.NET Done (v1.3)  elastic/apm-agent-dotnet#177
Go Done (v1.6) elastic/apm-agent-go#503
Java Done  elastic/apm-agent-java#923
Node.js Done elastic/apm-agent-nodejs#994
Python Done elastic/apm-agent-python#628
Ruby Done (v3.5) elastic/apm-agent-ruby#621
RUM 7.6  elastic/apm-agent-rum-js#477

@elastic/apm-agent-devs, Please respond with 👍 if the above plan is accurate, otherwise please comment if you expect any changes.

For .NET there is no milestone. Is this still the case @gregkalapos , @SergeyKleyman ? Do you have any plans for this?

I'm not sure about Node.js, cc @axw @lreuven .

@gregkalapos
Copy link
Contributor

For .NET there is no milestone. Is this still the case @ gregkalapos , @ SergeyKleyman ? Do you have any plans for this?

@jahtalab we did not add elastic/apm-agent-dotnet#177 to 7.6, so if we don't change priorities we can add it to 7.7.

However, I don't think it's a very high effort thing, it could be done fairly quickly.

Are we ok with 7.7 for .NET?

@lreuven
Copy link

lreuven commented Jan 7, 2020

the table is up to date, we can align the .Net & node in the near future(7.x timeframe).

@gregkalapos
Copy link
Contributor

Short update on .NET - this was implemented and released with .NET Agent v. 1.3 - which was released in sync with 7.6. Updated the table above - fyi @jahtalab.

@hmdhk
Copy link
Contributor

hmdhk commented Feb 17, 2020

The remaining ones are Java and Python, I can see the changes are merged but not released yet.
@beniwohli and @felixbarny when do you plan to release those changes?

@beniwohli
Copy link
Contributor

beniwohli commented Feb 17, 2020

@jahtalab yes, release is going out in a couple hours

/edit due to an issue with Jenkins, I wasn't able to trigger the release. I'll try again tomorrow.

@felixbarny
Copy link
Member

probably early next week

@beniwohli
Copy link
Contributor

We were able to trigger the release yesterday, so Python should be all good

@hmdhk
Copy link
Contributor

hmdhk commented Mar 20, 2020

The RUM agent released version 5.0.0 this week which includes (among other breaking changes) switching to the W3C traceparent header. With that all of our agents have support for the TraceContext spec. We're going to release a blog post about this soon. Thanks everyone for the effort 🎉

@hmdhk hmdhk closed this as completed Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants