-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to speed up consumer action in Jaeger ingester #2927
Comments
help wanted |
Are you able to provide the metrics from both the ingester and from the collector components? |
not sure which metrics data you would like to check. Share you all the data in zip file. Thanks |
@jpkrohling Any finding and suggestion? Thx |
No, sorry, not yet. @albertteoh, you also have some experience with the ingester, right? Are you able to take a look? |
Unfortunately, I have very little practical prod experience with the ingester so I can't promise quality help here. Having said that, taking a cursory look at the metrics provided shows some fairly large offset lags:
I'm not entirely sure why some lags are 0 and some are so large, maybe because the ingesters are only consuming from one partition? If that's the case then perhaps these metrics look okay since each partition has at least one 0 lag. But if each ingester is trying to consume from each partition, then this could be a problem. In my past experience, one possible cause of ingesters struggling to keep up with consuming from kafka, is that the backing store (Elasticsearch in your case) is struggling to keep up with the write load. Do you have any numbers on your Elasticsearch cluster load? Maybe this might help? https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries. |
after our tuning, we found currently, the bottleneck is Ingester, not elasticsearch. Even we add more partion and more ingester, still cannot improve the capacity of ingester to fetch data from kafka partition. pstack 25895 |
How did you come to this conclusion?
Can we answer the question of what the ingester is waiting on? Ingester pulls span messages from kafka and synchronously writes this span to the Elasticsearch before pulling the next span message from its partition. Could the ingester be waiting on the "OK, the span write was successful" response from ES?
I'm assuming you mean the offset lag is still present; is it still increasing at the same rate (or getting worse)? If so, this seems to more strongly suggest a bottleneck in Elasticsearch since there are more ingesters to pull messages from kafka yet the lag is not improving, indicating the ingesters are blocked on writing out spans (before they can consume more kafka messages). Do you have numbers around ES performance (i.e. CPU, memory pressure)? This seems to be a useful resource. |
|
@albertteoh I am also thinking about your comments: |
@judyQD thanks for correcting me :) my assumption was incorrect and indeed jaeger does use the ES bulk API. What I suggest to do as next steps is to:
I threw together a quick dashboard of a locally running ingester with write success rates and latency histograms, happy to share the dashboard if you like (the +Inf error write was probably because I killed elasticsearch):
|
@judyQD did you manage to improve the ingester's consumption rate? |
@albertteoh Thanks for your nice help. Actually, no. I found Jaeger ingester use this package "github.com/Shopify/sarama" to be the consumer of kafka. I changed the max fetch size of this package, it improved the consumer performance. But the consumption rate is not fixed, and I have no idea how to increase the consumption rate. |
@judyQD Did you manage to get some measurements from the ingester and try modifying some of its parameters #2927 (comment)? For example, perhaps adding more workers or influencing the flush rate by count, duration or size, may help increase the consumption rate? |
@albertteoh Thanks for your comments. The paremeters you list are focus on the function to store date into ES. |
@judyQD If I'm not mistaken, the "Kafka consumer goroutine" (after consuming the span from the topic) eventually attempts to write/add the span to a buffer that is eventually bulk written to ES. I suspect the bottleneck is at the point where the Kafka consumer goroutine offloads the msg from Kafka via a write to an unbuffered channel. The reason:
To illustrate the problem of writing to an unbuffered channel while there is no "reading" goroutine, I've put together this short go program: https://play.golang.org/p/K88GTBJJwHL. If you comment out the sleep, you'll see the consumer goroutine is no longer blocked writing to the channel. This is why I think we could explore the parameters for tweaking the ES writes as this may be causing back-pressure on the Kafka consumer. So, perhaps increasing the number of workers or other flush-related parameters to hopefully increase write throughput to ES. I think it would also be good to measure the bulk index insert counts, etc. above, to provide empirical evidence that an increase in workers, etc. correlates to an increase in inserts and in turn, correlates to an improvement (or not) in the offset lag. I believe ingester also exposes these kafka consumer metrics. |
I know this is an old issue but recently i tried to improve ingester performance and landed on this issue so wanted to share what i did and how it helped us. Thanks to @albertteoh 's suggestions these are the steps i took.
These2 steps did not increase performance and then taking suggestions/ideas from above Using these I was able to increase document index rate of my Elasticsearch nodes by 50% and consumer lag on Ingesters went down immensely. Looking at |
may is the reason that jsonpb package so slow. |
@albertteoh would you mind sharing that Grafana dashboard for monitoring Jaeger Ingester? Is it only for Jeager Ingester of it's maybe also covering Collector / Query services soo? |
jaeger-v2 will have different implementation, closing this |
Describe the bug
I am using tracegen --> collector -->> kafka -->ingester --> ES.
And I found that it's kind of slow for ingester to pull spans in Kafka.
Ingester configuration is as below:
Screenshots
The text was updated successfully, but these errors were encountered: