-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representing an asynchronous span in Zipkin #1189
Comments
The Zipkin data model doesn't support async spans like tracing kafka messages. It would need to be redesigned. (There's some work going on about Zipkin data model v2 I think) |
One thing you could do, to represent it, is to start a new trace, with a new trace id, on the consumer side, and then add a binary annotation with an ID that correlates the two traces. But we have no UI support or special tooling support for that yet. |
Thanks, I will look at the v2 discussion. I would like to include the asynchronous span in the same trace. An alternative approach I can see (for v1 code) is to submit non-core (custom) annotations when consuming and producing (using the same spanId). It appears that Zipkin would not attempt to adjust for clock skew where the core annotations are not present. It also seems that it would derive the duration of the span from the difference between the newest and oldest timestamps of the custom annotations. I am not sure whether this will play nice with the rest of the tooling though (very new to this code base). Any thoughts on that? Note: Implied in the above is that we are sending the in-band tracing information (e.g. spanId) as metadata in the queued message. |
To address the clock skew thing, pretend the producer action is a client
action and the consumer action is a server one. these represent the
boundaries of your two services, with kafka etc in the middle.
The problem is that there is kafka in the middle :) That said, the clock
skew will be shifted anyway as regardless of that the consumer shouldn't
receive a message before it is produced.
Another (more dramatic approach) to clock skew could be to keep track of
things on the instrumentation side. For example, if you know the time of
the kafka server, you can shift to that before reporting. This is different
than NTP as you'd just shift for the purposes of sending trace data as
opposed to the whole VM/instance.
|
@AndrewWang996 had a question about this in Brave. Basically how do you deal with a one-way system. He was concerned about half-open spans ("cs" and "sr" only), although I don't know what the specific issue with that was. Guessing dependencies view? or maybe lack of duration?
I mentioned similar things to here..
Things we could do here is investigate clock skew when no RPC spans exist in a trace. That could be bumped out as a separate issue, but I think it would only work if "ca" and "sa" are logged. |
Measuring the latency between producing to and consuming from Kafka queues is precisely what I'm trying to do, although unlike @tennenbaum, I only considered adding the "cs" annotation upon produce and "sr" upon consume without worrying about how long it takes the message to be produced and consumed. My problem is that I was attempting to use the ClientRequestInterceptor + ServerRequestInterceptor mentioned in Brave's 3.0.0 api, but it seems that this only submits the span if either ["cs", "cr"] or ["sr", "ss"] are handled in the Interceptor.handle(Adapter) methods. I only needed to handle "cs" and "sr" with the ClientRequestInterceptor and ServerRequestInterceptor, but in order for Brave to submit the spans, I needed to make a dummy adapter and submit "ss" as well, not to close the span, but just so that Brave knew to submit it. This is suboptimal. I'm not even worrying about dependencies or duration. |
If your goal is to measure latency of async spans, I think we can find
something to work.
If your goal is to make a span with only "cs" and "sr" in it, I'm not
interested in helping as it will produce bugs elsewhere we'd have to clean
up. The core RPC annotations are made to be used together, so you'd be
better off being more flexible about this point.
|
ps spent all my battery on this topic on the flight :) so I think there are a few ways to skin this cat. I tried a couple
I can share my kafka code if someone is interested (written in brave) |
here's the kafka spike openzipkin/brave#212 |
let's see if we can nail a design down for zipkin v1 model here: #1243 |
related issue: multiple parents aka linked traces #1244 |
We can safely consider async span representation done (and eventually dusted) since Zipkin v2. |
For instance, if we want a span to represent the latency between producing a message to a queue and consuming it from the queue (e.g. if Kafka is the queue). In this case the producer call will finish prior to the start of the consumer call. We would like to represent the duration between producing and consuming.
We could use the core RPC annotations in this case to the represent the producer call (cs at the start, cr at the end) and consumer call (sr at the start, ss at the end). However, as the client annotations will both occur before the server annotations, looking at the Zipkin code (in particular zipkin.internal.CorrectForClockSkew#apply) this will cause Zipkin to detect clock skew where there is none.
Is there a better way of representing this type of call using the Zipkin data model?
The text was updated successfully, but these errors were encountered: