-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zipkin collector performance issues #940
Comments
I think one problem is that the kafka receiver is literally writing span-at-a-time, while the span stores accept spans in bulk. I'm presuming the span stores are optimized for bulk, but we haven't benchmarked this. Note there's another related discussion going on w/ @yurishkuro and @danchia. For example, Yuri had a suggestion about intermediating here. #961 (comment) |
I am down for benchmarking any tweaks that we have in mind like batching cql queries or optimizing span stores for bulk. I think even adding the functionality of collector receiving a trace-at-a-time can improve the performace by decreasing the total kafa messages. What do you think about that? |
It's difficult to implement trace-at-a-time, since different parts/spans of the same trace can potentially be processed by different collector instances. There's no "stickyness". |
Even within the same collector, because a trace composed of many different spans from different machines I don't think we can avoid multiple messages. That said, if I'm not wrong the per message overhead for Kafka is pretty low, and there are tunables to help with higher message volumes (batching on the producer size, and fetch size on the consumer side). |
@eirslett i meant collector should be able to handle bundle of different spans of the same trace coming again.
Instead, facility to bundle up spans 1-3 and 4-6 together, and let collector separate those out.
I say this, because we see each span whether with large data or small takes around same ~90ms. if @adriancole is correct and is limited by kafka receiver, this should lead to good perf win for us, as our tracer can bundle up the spans together and send. |
@prat0318 Do you have your kafka consumer settings handy? A long time ago the defaults in zipkin were not very good, causing offset updates to ZooKeeper very often (which was super expensive). |
@danchia i am using all the defaults of zipkin. I think offset update happens every 10s. by default (if i am not wrong). please let me know if there is a way from outside to see the consumer config. |
Aah! I see now, what you mean. |
@eirslett but the bottleneck is on the collector end when it is reading the spans. It is currently constrained to receive one span message at a time. So, i was thinking making that |
Exactly. SpanStore doesn't require the bundle of spans it receives to be in
the same trace. A few tracers do send bundles at a time, subject to either
span count or bundle size.
Right now, the Kafka receiver literally reads only one span from a message.
It doesn't matter if the span is 200bytes or 2megs.
I think the first thing we can try is just allowing the receiver to accept
a bundle.. Not sure if we try to read a list and only one is present.. If
that works or not in thrift.
Once the receiver can accept N spans (aka a bundle) then instrumentation
can choose how much to send per message.
Make sense?
|
That means, it should not be hard to make changes from collector end if we want to receive a bundle of spans. Right?
confused how they can do that if they use the same collector? Does collector support this yet?
I think that is important point to keep it Backwards Compat. Collector should be able to accept both a list of spans or a single span correctly. overall, i am very excited to try out the idea of span bundle. |
I opened #979 about multiple spans. Still, there's a question about
88milliseconds going unaccounted for
Currently, the collector itself isn't instrumented, so it is hard to get
what could be the problem.
Is it possible for you to run the collector against a different span store?
If, for example, you can run using collector-dev.scala, you could possibly
narrow any non-kafka-related delays.
|
we should close this issue with a pull request updating zipkin/zipkin-collector/kafka/README.md with notes on how to achieve best performance. there are some notes here: https://docs.google.com/document/d/1Px44fjZ37gr05lV7UFo8AfrWZCcJHCuv58290XCbDaw/edit#bookmark=id.eeozlmh0fxr |
#3152 addressed this. If there are any other expert-level tweaks sites had success with we'ld be glad to hear them ! |
Zipkin collector when consuming from kafka topic and writing to cassandra takes around 90 ms. per trace, limiting its speed to 11 traces/s. With more and more services adding traces, this becomes a bottleneck. This can be alleviated with increasing the # of kafka partitions and increasing
KAFKA_STREAMS
# to match the partitions.But, i still wanted to check where those 90 ms. are getting used. I earlier thought it is the cassandra writes which are the culprits, but surprisingly they are not.
When run as a single thread, the logs show:
The interesting thing is the time gap between the last two lines. I am not sure what does collector waits in these 88 seconds. This time gap still remains for 5 workers though it doesn't get multiplied up by 5.
I am not sure if this is a known issue, but wanted to have a discussion if we can debug the reason of that gap.
The text was updated successfully, but these errors were encountered: