-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support async spans #1243
Comments
related issue: multiple parents aka linked traces #1244 |
had a chat with @mbogoevici from spring-cloud-stream. We talked about a special case, where you might want to visualize the trace as being "open" the same way that a root span normally works in RPC. IE the critical path of the pipeline continues until a terminal event occurs. This allows you to see the entire duration clearly. This implies the root span isn't closed immediately when it sent something to a broker, rather it is left open until a terminal event occurs. Something down the line will close it.
|
@adriancole when publish zipkin 2 release,it support MQ and asyn call |
Looks nice. So whenever I use messaging we should go with this approach or is only for fully |
Would be nice if both push (notification/topic) and pull (queue) communication was covered. One tricky thing, which would also be nice to be covered as well, is batching support in both produce (publish), and consume side (push, pull). E.g. one can have multiple publish requests which end up consumed in one pull request. Also, single publish request could publish multiple messages, which get delivered with multiple independent push requests. |
@adriancole <https://github.com/adriancole> when publish zipkin 2 release
we usually discuss when features will land vs certain version numbers. What
features are you interested in?
|
Would be nice if both push (notification/topic) and pull (queue)
communication was covered.
One tricky thing, which would also be nice to be covered as well, is
batching support in both produce (publish), and consume side (push, pull).
E.g. one can have multiple publish requests which end up consumed in one
pull request. Also, single publish request could publish multiple messages,
which get delivered with multiple independent push requests.
good point. we've actually another issue on that topic, as it is trickier
than one-way single producer single consumer: linked traces
#1244
|
@adriancole I interested trace MQ and asyn call |
@adriancole The path of least resistance for async (half-duplex, or messaging), is to follow this model. This prevents us from needing to wait for a big-bang zipkin 2 release of the entire ecosystem. |
@marcingrzejszczak @adriancole the initial discussion was in the context of Stream apps - we like to have a continuous trace that covers the entire pipeline - but I guess most of the reasoning applies in the general case of apps talking to each other over a broker - including linked spans for pub-sub with multiple consumers |
In terms of an update, I have a working POC for Sleuth sans Baggage which I will happy to share once I change it to use the Baggage support that @marcingrzejszczak has added in 1.2 |
@adriancole <https://github.com/adriancole> The path of least resistance
for async (half-duplex, or messaging), is to follow this model. This
prevents us from needing to wait for a big-bang zipkin 2 release of the
entire ecosystem.
gotcha.. no big bangs here (though there are some features that take a
while!) :)
the effort to add messaging span is very little. It is defining constants
and adjusting some things in the UI and clock skew checker. Main reason I
delayed this one was rule-of-three. I think we have enough interested at
this point to move forward with implementing "ms" "mr"
|
@adriancole What‘s the plans,probably when released,Thanks a lot |
I'll update this with a pull request in the next couple days latest by
monday. In the mean time, you can always use the "ms" and "mr" annotations
yourself, they just won't be interpreted by the UI until we upgrade.
|
@lijunyong ps after thinking about it.. I am not going to raise this PR yet. @mbogoevici's work is new information. We should see if that is extending this or an alternative. Plus I'd like to get a tally of at least a couple folks who aren't primarily interested in JVM environments. Finally, we need to vet this against single-host span pattern. If there's anyone on this thread tracing non java environments with opinions on one-way tracing please speak up (cc @jcarres-mdsol @ys @mjbryant @hammett)! Between now and when pull requests to various repos are reviewed and merged, feel free to just use the conventions described for one-way messages. Ex. send "ms" on the way out and "mr" when receiving (just like client+server, except no response is carried back). There's nothing in zipkin that prevents experimenting with this. |
Message based (one way) is 95% of our system. I'd be very interested in native support for this. |
@adriancole I don't understand,Could you provide the example about MQ in github?Thanks a lot |
So this will not mix with the current system? Instead you are saying that ms involves creating a trace2? |
@lijunyong @jcarres-mdsol here's a repaste of a comment on brave which I hope will clarify both of your questions. Please let me know if it doesn't. Here's an example scenario asked by @coolbetm
Service A receives a POST and sends a message to kafka Service B consumes the message and puts a message on a second topic Service C consumes the message and saves to a database |
Why the last CS/CR to itself? |
Why the last CS/CR to itself?
This is a call to a database. I'm assuming the database doesn't speak
zipkin, so we just log "sa" in the client span to indicate the remote side.
|
If serviceC finishes processing the message but does not generate any other zipkin information after |
If serviceC finish processing the message but does not generate any other
zipkin information after mr (for instance we never trace calls to DBs)
how will we know the trace has in fact ended?
Since "mr" closes the messaging span, all spans are closed.. so it is the
same situation as today. A trace is effectively ended when all spans finish.
|
I thought
To do both we need at least one more annotation, no? |
I thought mr was emitted when the worker got the message from the queue.
Performance wise I would want to know two things:
- How long the message was in the messaging bus
After "ms" we can use the normal "ws" to indicate when it went to the bus.
Similarly, we can use "wr" to tell the difference between when we got the
message vs started to process it.
- How long it took the worker to process the message
The messaging span in my mind really should be transit only (don't record
message processing in the span). That's why in my example, I have listed a
child span to process the message. Timewise, this directly follows the
messaging span, so its (start) timestamp will very close or exactly the
same as "mr" of its parent (the messaging span)
To do both we need to tell at least one more annotation, no?
yeah I think probably I didn't highlight "ws" "wr" (which incidentally are
defined also for client and server spans, just not widely used)
|
first step: adding constants to the thrift openzipkin/zipkin-api#24 |
During openzipkin/zipkin-api#24 we noticed this can be complicated when the messaging system is also RPC (such as pubsub or Amazon SQS which are both http services). For example, how would they know if they should add a "ms" or "cs" considering the production of a message is literally an rpc (usually http) call? I noted that we still have the option to not do this change, or to instead repurpose "cs" "sr" to address one-way concerns. For example, thrift and other RPC systems have a one-way specialization. It is important to think about things like this, but also understand tactical vs strategic change. For example, any annotations like this are tactical in my view. I'll try to recap some of this (which came from a gitter chat with @jcarres-mdsol) I think that linked traces are a better fit for messaging than these annotations. The issue with linked traces is we need to change schema (which isn't a problem, just something to get through) The "ms" "mr" was a way to identify a means to handle one-way for those in the brown field, because it requires very little change (similar to if we decided to treat "cs" "sr" missing response, as a special one-way). So, presence of some heuristic for one-way doesn't preclude linked traces, or even handle it better in any way. It is merely a stepping stone. Here are the steps I'm thinking of that lead us forward with regards to messaging:
Both of the above are independent, but once the latter is in, we can add linked traces to it much easier than trying to add to the multi-host span model we have today. Back to the effort here, as far as I can tell, we either proceed as noted or try to repurpose the "cs" and "sr" similarly. Another option is to not do this and head directly towards links. I'll give another couple days to make sure people have seen this and can weigh in if any of this context changes their mind. |
I think we should make a decision as perpetually incomplete work is haunting, particularly as we've discussed this topic for a very long time now, and there are people interested in it. That, or we should make visible what must happen before a decision can be made. We don't want a worst of show, where we've spent lots of time discussing, and gathering interest, but lose them in analysis paralysis. A valid question is what to do when an RPC system is used to carry what is semantically a one-way message. I'd flatten this to the generic topic of one-way RPC, which exists in many systems (regardless if you pay attention to acks or not). This is separate from the concern that led to this issue, which is that a non-RPC system is used to carry single-producer single-consumer messages. The latter concern is addressed by ad-hoc users today, using annotations which may or may not be named "ms" and "mr" :) If we feel those in the latter camp should continue to work ad-hoc, that's a valid thing, but I think people should be more explicit about the desire to halt this, and what, if anything it would be replaced with. We'd then close the issue as won't fix and stop spending time on it. If we feel there's something that must be addressed, let's make test cases for that. They can be simple descriptions. Regardless, we need to find a way to progress this topic. |
Here's a diagram of the latest revision of this, based on a pre new-year chat with @jcarres-mdsol By not defining a new address annotation for the message broker, a lot of code will be similar to today. Should we decide to use new annotations, they would be exactly like existing, except clarifying one-way semantics. that reduces the explanation to...
|
NEXT STEP: Experiment with this design both in server and instrumentation projects One danger area I think there is will be the code involved to handle "ms", "mr" neatly vs just allowing "cs" and "sr". The code impact of this will be present in instrumentation as well as the server (and its UI). For example, instrumentation will amplify this work, as a dozen different tracers may need new if statements etc. We need to compare implementing one-way/unary/messaging as..
In this repo, we can open a branch/pull request would include test cases that include one-way spans, modeling pubsub with pubsub RPC on either side, checking the dependency linker and the clock skew adjuster. This pull request should note experiences that led to either option being preferable. In one or more instrumentation repos, we'd check the two modeling approaches (cs/sr with flush vs ms/mr implicit flush), and in their corresponding pull request note experiences. Once we have the bigger picture, I think we can lock-in and roll out what works best. I'm glad the choices are quite narrow at this point. @openzipkin/instrumentation-owners anyone have time to help experiment on this? |
In brave 4, I think it is easier overall to use "cs" and "sr". I made an example with Kafka under this assumption https://github.com/openzipkin/brave/pull/300/files#diff-0f684bf482d2b5acd65765f8d6f8f24eR35 |
moved the kafka example to a gist as I think it is too heavyweight to be in the normal test tree https://gist.github.com/adriancole/76d94054b77e3be338bd75424ca8ba30 |
I think it is much better to formalize one-way/async as client+server without a response. It is feasible and it is more correct to say one-way/async than say "messaging" knowing we only handle single-producer single-consumer anyway. The linked spans work could pick up from there. I've added a demo in Brave: https://github.com/openzipkin/brave/tree/master/brave/src/test/java/brave/features/async I know the clock skew adjuster will change, but it will be less work than adding new annotations, as it already special cases skew found when "sr" happens before "cs". I thought that it might break single-host spans in the dependency graph until I realized that single-host spans are already broken! In other words, regardless of one or two-way, the dependency graph would need code anyway. So summary is that we should forget about "ms"->"mr" and go with "cs"->"sr", which reduces work on all parties to support this feature. If anyone feels against this, please make it known! |
ps I chatted with @fedj and don't think we can help correct clock skew on async-spans. The reason is that the current approach uses the two views of duration in the same span to interpret skew. Since async would have only a single view of latency, it won't work. This means either people need to keep their clocks in sync, or they need to resolve skew as a part of the reporting pipeline somehow. The problem will be the same regardless of which annotations are used except in the case where messaging is implemented under the covers as normal RPC. In that case, all 4 annotations will be present and then skew can correct automatically. |
Another topic of interest when we get to documenting. If someone is trying to model an "ack", they should be careful not to use "cr" since it isn't semantically the same (as cr implies a round trip of an operation). When modeling ack, it would be better to use "ws" wire send or just make up another annotation (ex. kafka.ack). |
Before, we only supported backfilling timestamp and duration for spans that have at least two annotations. This special cases an incomplete one-way span (ex only cs), as well its complete duration (sr - cs). See #1243
added #1497 to make one-way spans via cs -> sr work |
One-way tracing is supported via openzipkin/zipkin#1243 This documents how that works
One-way tracing is supported via openzipkin/zipkin#1243 This documents how that works
fyi, those who need to trace kafka via one-way approach can likely do so using 0.11 Brave implementation may follow here openzipkin/brave#211 |
In a recent tracing workshop, we were able to elaborate a bit on async tracing. In many cases, new traces are not needed. Ex if propagation does not imply sharing spans, you can do single producer multiple consumer as normal spans in the same trace. Here's a diagram describing most concerns
|
closing as async one-way is done. lets move to #1654 for multiple consumers |
We've had many issues about representing one-way activity. This is an attempt to define the most semantic means to address this concern in zipkin v1.
In zipkin, we have an RPC span, where two hosts participate. They participate by adding bookend annotations to a single span. This allows for many conveniences including hosts able to report incomplete work, and report in any order.
For example, a typical RPC span is ["cs", "sr", "ss", "cr"], for a dual host (client+server) span, or either of its partial ones ["sr", "ss"](uninstrumented client) and ["cs", "cr"](uninstrumented server).
The path of least resistance for async (half-duplex, or messaging), is to follow this model. This prevents us from needing to wait for a big-bang zipkin 2 release of the entire ecosystem.
My proposal is to add a messaging span: "ms" and "mr".
The message sender adds a "ms" when a one-way message is sent. They are free to add "ws" to indicate it going on the wire.
The message receiver adds a "mr" when a one-way message is received. They are free to precede that with "wr" to indicate one being processed from the wire.
This works like the RPC spans, but to pin it down, here are example spans.
["mr"] <- message received from an uninstrumented sender. The host of "mr" is the processor.
["ms"] <- message sent to an uninstrumented recipient, or the message was dropped. The host of "mr" is the sender.
["ms", "mr"] <- message sent and received from instrumented hosts. Similarly, the host of "ms" is the sender and the host of "mr" is the receiver.
The duration of a complete messaging span can be a lot longer than an RPC span, potentially orders of magnitude longer than the processing time. Once we have a means to indicate this (via "ms" "mr"), we can teach the UI to collapse the timeline or do other tricks.
Messaging spans need to be propagated, just like RPC ones. For example, storing the binary version of a trace id as metadata on the message is appropriate.
This design adds pressure for "linked traces" especially when messages are needed to create a new outbound activity. This is out of scope of this design, but not precluded by it. In other words, we need to take on "linked traces" separately.
Fixes #1189
Fixes #925
The text was updated successfully, but these errors were encountered: