Add support for Haystack trace propagation #146

ldoin · 2020-06-29T21:39:28Z

I'm working on a service receiving requests that come from an external system using Haystack as a tracing system. To maintain complete traces between the two systems and still use datadog internally It would be great to have support for Haystack.

I have seen that the java agent has support for it (see DataDog/dd-trace-java#895) and was wondering if you would be open to a contribution to add this for the cpp tracer ?

cgilmour · 2020-07-03T06:12:24Z

Hi @ldoin , we'd be open to this.

The existing propagation code will probably need changing so that Datadog, B3 and Haystack work correctly and usefully.
A question for you: do you intend to correlate the info between the two different systems?

There's a few different combinations of services and tracing implementation where it's not clear the right way to behave.
Specifically:

Services A and B: A reports to datadog, B reports to Haystack. A starts the trace, and B needs haystack headers to emit trace data.
Services A, B and C: A and C report to Haystack and B reports to datadog. A starts the trace, and to make it work through B and C, haystack headers need to be sent.

For both of those cases, it seems there'd be incomplete traces in both systems, and some way is needed to correlate them. There's a few ways to do it, and I'm not sure the correct one.
Your suggestions would be helpful!

aantono · 2020-07-06T18:10:17Z

Hello @cgilmour, glad to hear you guys are on board.

We are currently trying to solve the challenges you've described, so I actually have an internal version of dd-trace-java being tested to see how it would address those exact points. If things pan out, I will be making another PR to the dd-trace-java repo with the enhancements for your consideration. 😜

If you have any ideas or proposals of your own, we would love to hear them or sync up on the possible approaches to tackle those.

cgilmour · 2020-07-09T07:30:22Z

Great! Yup, I have a few thoughts on this, and overall there's no single "perfect" answer for dealing with this.
Below is a huge wall of details, thoughts and assumed understanding. Sorry for that!
Ask questions where I wasn't clear enough or the conclusion didn't seem logical.
It might read as if I've made my mind up, but it's still very much in the "figuring out details" stage and trying to find the best "imperfect" answer 😄

The current C++ implementation for Datadog+B3 headers is a bit confusing.
Eg: when extracting a propagated trace, it expects the trace IDs to be the same for both Datadog and B3 headers.
At some point in time that seemed sensible, but now it seems a bit wrong.
Also when configured for extracting B3 headers, and injecting Datadog and/or B3 headers, the B3 trace ID is used as the datadog-trace-id.
This only works correctly for 64-bit IDs, and also seems a bit wrong.
Lastly, when injecting a propagated trace, it'll do all the ones it has configured.

The reason for questions about how things would be correlated is quite important.
If there's no sensible way to correlate traces between the two different systems, then there's still some value in the data from capturing traces and calculations that are derived from them. It'd just be cooler if a full distributed trace could be built up.

The way I think it should work is.. a long explanation.

When extracting traces, the datadog headers should always be supported and used for the trace ID. One additional "other propagation headers" can also be extracted. There's the DD_PROPAGATION_STYLE_EXTRACT environment variable. It can apply the first one that works.
Eg: if DD_PROPAGATION_STYLE_EXTRACT is set to B3,Haystack,Datadog, and a request comes in haystack headers, then those are used.
Datadog headers would always get extracted if they exist, regardless of whether they're configured in that list or not.

When starting a trace, if it has a datadog trace ID to use, then it uses that to continue the trace. If it didn't receive one, then a new trace is started. The spans for that trace get tagged with that one "other propagation headers", eg: haystack.trace_id. That's something that can be searched for when trying to correlate traces from different systems.

The tricky decision here is related to propagated sampling decisions.
I think we basically use the "strongest" decision available. So if there's a datadog-sampling-priority, we use that. If not, fall back to one from "other propagation headers", and if still undecided, just make a new decision.

Finally when injecting traces, there's another variable that controls it: DD_PROPAGATION_STYLE_INJECT.
In this case, we don't automatically inject datadog headers - it needs to be in the list. For "other propagation headers", we only inject the ones we received (extracted). We don't "convert" at all, or add ones that we didn't receive.
So for example if extract was set to B3,Haystack,Datadog and inject was set to B3,Haystack,Datadog, and we extracted haystack headers, we don't invent some B3 header values and emit those. If we do, it's meaningless - pretending a trace existed when it didn't.

It'd be nice to ignore "baggage" headers completely, but some thought should still be applied on that topic. I don't know the right answer, but probably:

when extracting, datadog and 'other' had baggage headers, merge them together
when injecting, inject them all for whichever propagation headers are being injected (ie: either just 'other' or 'datadog' and 'other')

The bit below is still referring to this example setup: Services A, B and C: A and C report to Haystack and B reports to datadog. A starts the trace, and to make it work through B and C, haystack headers need to be sent.

The final thing to consider is whether we modify the span ID for "other propagation headers" when injecting.
For datadog headers, we definitely will but doing that for haystack headers is an open question.
Not doing it is fine for the simple A->B->C case, and partially avoids the "broken traces" problem. It does make B invisible to that system though.
(The backend of it will receive traces from A and C that look perfectly connected. B just proxies the headers through.)
The problems come up when an error occurs in C, or if the flow is more like A->B then B->C and B->D in the same request.
The UI for the (not-Datadog) system would be throwing red-herrings in terms of where to troubleshoot if B is invisible there.
So there are reasonable arguments for and against keeping the 'other' trace headers intact.
One solution is adding a baggage item to say "actually B was here" (so it gets reported by C, D etc).
I don't like this idea, but it would technically be less worse than being invisible. (Not a stellar endorsement)

aantono · 2020-07-19T18:07:42Z

@cgilmour - Here is the proposed implementation of what we were thinking in Java - DataDog/dd-trace-java#1691

We internally did some tests and things appear to be stitching correctly.
The conversion between Haystack using 128-bit UUID as a format to DataDog using a 64-bit BigInteger is done by prepending a static MSB value of "44617461-646f-6721" to the LSB tail from the DDId.toHexString() value (later formatted to UUID syntax).
The reverse is being done similar to the current B3, but instead of losing the precision, we preserve the original Haystack value in Baggage as well as adding it to Tags (so it is always visible in the UI and propagated to the next Span via Baggage context).

Let us know what you think on this approach, and maybe we can narrow it to the acceptable final result together!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Haystack trace propagation #146

Add support for Haystack trace propagation #146

ldoin commented Jun 29, 2020 •

edited

Loading

cgilmour commented Jul 3, 2020

aantono commented Jul 6, 2020

cgilmour commented Jul 9, 2020

aantono commented Jul 19, 2020

Add support for Haystack trace propagation #146

Add support for Haystack trace propagation #146

Comments

ldoin commented Jun 29, 2020 • edited Loading

cgilmour commented Jul 3, 2020

aantono commented Jul 6, 2020

cgilmour commented Jul 9, 2020

aantono commented Jul 19, 2020

ldoin commented Jun 29, 2020 •

edited

Loading