-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spans lost with kafka and elasticsearch (over capacity) #2023
Comments
I agree this is an issue with better knowledge about the collector we could handle differently. For example, we can do blocking writes. |
Hi,
As a result, spans are no longer dropped under heavy load. Manuel and I have already tested it. |
ya this would work on Kafka but not http as in http would block application
reporting thread. the trick will be to have some config so that the storage
can know the collector is pull based. in worst case this is a property.
ps I am out for a couple weeks so that's why I won't be personally
responding until then. take care
|
Ok, please let me know if you are interested in a pull-request for this issue with ElasticSearch and feel free to discuss on the details |
PS I think easiest way is to add CollectorComponent.blockOnStorage option. This would allow us to wait until there's another request available, and suitable for most storage implementations. Even if elasticsearch returns unavailable, it will at least wait for that to occur. |
We are facing the same problem. @malonso1976 were you still hoping to create a pull request? |
fyi pull requests are indeed welcome on this.
essentially this is about pull-model, and in pull model you should be
able to block the thread with zero impact. Blocking the thread is
definitely not ok when http listener as it can cause applications to
use more resources.. this is only appropriate for pulling from kafka
or rabbitmq (or another buffer)
You just need to somehow inventory what is actually wrong, so you know
when to block or reduce rate. Or maybe use a tool like this that is
somehow trying to solve arbitrarily
https://github.com/Netflix/concurrency-limits
|
Sure, I will. I hope I find time in the next days |
I have a pull request ready to be submitted. I have no permissions to push it to this repo, how can I send it back to you? |
make a fork of zipkin and raise a pull request on a branch from your fork.
thanks!
|
Done |
Hi, I tried another approach - in the HttpCall class, inside enqueue, I gave the tryAcquire a timeout of a few seconds. That seems to solve the problem for us. Let me know what you think. |
Hi Dan, But Adrian told me to use concurrency-limits because it will make another scenario covered by the solution: direct zipkin span write through rest api. It makes sense... I could not test it under load, but it is on my task list. Have you setup elastic capacity settings according to concurrency queue? |
By "concurrency queue" you refer to the "limit"/"concurrency" in the configuration? |
Sure, write your email in this thread and will mail you back |
The If I get a chance I'll try and it on my end and see what happens. No promises on timeline :( |
Has it been fixed? |
there is no change. I dont remember last issue we updated on this topic as
the discussion split a few times. the main concern
* is knowing that the problem is an underlying resource issue that will
resolve (as opposed to something that will never resolve). we need a stable
signal like an exception type.
* then, change transports that buffer (such as rabbitmq and kafka) to
either not ack or push back lost spans in scenarios like this.
* then, assemble some back off or rate limiting thing so that we dont
pummel overloaded backend.
there is engineering work to do in order to resolve this. so far we have
implementations left not complete which is in some ways worse than none.
what we need is a volunteer to champion this issue
…On Tue, Apr 2, 2019, 2:12 PM Game Over ***@***.***> wrote:
Has it been fixed?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2023 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD61xAvTouI_M3EDDI0N19flzbxF0iDks5vcwLcgaJpZM4TlAyl>
.
|
I'm currently working on something that should help minimize the impact of this. The general idea is to allow Zipkin to do some buffering rather than just have a hard cutoff. This should help greatly in times of spike requests and sluggishness on the side of ES. It won't address any of the items you mentioned @adriancole, but should still help. What I need to know, if anyone can offer some advice, is how to get a config property into HttpCall. Currently it's kind of inferred from OkHttp's configuratoin which is not fantastic for general config properties. |
hi I think it could break abstraction to shared config between collection
and very specific storage implementation (okhttp is a very specific part of
how we implement ES and ES should not be coupled to transport)
can you overview what your design is? best to start with design vs a
special case pull request.
even if your plan is to keep some data in memory (at risk of OOM etc) we
can consider it.. just would like more details of the approach before
getting into specifics like okhttp config.
…On Tue, Apr 2, 2019, 8:44 PM Logic-32 ***@***.***> wrote:
I'm currently working on something that should help minimize the impact of
this. The general idea is to allow Zipkin to do *some* buffering rather
than just have a hard cutoff. This should help greatly in times of spike
requests and sluggishness on the side of ES.
It won't address any of the items you mentioned @adriancole
<https://github.com/adriancole>, but should still help. What I need to
know, if anyone can offer some advice, is how to get a config property into
HttpCall. Currently it's kind of inferred from OkHttp's configuratoin which
is not fantastic for general config properties.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2023 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD618YQ9mr4HR134aoGqO0s-y92oxbAks5vc17HgaJpZM4TlAyl>
.
|
High level plan is to replace the existing Semaphore in HttpCall with something that allows for a queue of some kind. My original plan was to update the OkHttpClient.Builder in ZipkinElasticsearchOkHttpAutoConfiguration to have a Dispatcher with an ExecutorService that allowed for fixed-length queue. Then I realized that the Dispatcher maintains it's own unbounded queue (which is a real problem and should be fixed/configurable in OkHttp itself, but I digress) so the ExecutorService doesn't help. Now my general idea is to just inspect the Dispatcher queue length to see if we exceed a certain threshold (this is where I want an autoconfigure property but can't easily get one) and drop requests if we do. Given that ZipkinElasticsearchOkHttpAutoConfiguration exists I assumed it'd be ok to have a setting from it accessible in HttpCall. But since I'm not seeing any obvious way to do the plumbing I suppose it's better to not. However, if I don't, then I'm not quite sure where else to put the threshold check. I could always wrap everything up in an ExecutorService somewhere (like ElasticsearchSpanConsumer) so that things are easy to control. But I'm not sure how I'd properly transition all existing configurations over to using that without some unexpected behavior. My main concern here is the usage of the HttpBulkIndexer in ElasticsearchSpanConsumer. The reason that is my primary concern is because of the issue that brought about #1760. The fix for that was, perhaps unfortunately, too broad and limited any interaction with ES. The change I'm proposing is more focused on the actual collection of spans which can cause back ups and OOM errors. To emphasize one point: the purpose of the queue is to help work through spike-load and semi-sluggish flushing to ES. OkHttp's unbounded queue seems like the real source of the problem in #1760 and what I think the Semaphore was trying to work around and therefore what I want to as well but in a less aggressive way. Unfortunately, I'm not sure how to relate any of this back to Kafka because we report to Zipkin over HTTP and the ElasticsearchSpanConsumer doesn't seem to care about whether you do that or use Kafka. |
Until I get a response to my above comment, I think I'll pursue the idea of removing the Semaphore and adding an ExecutorService in ElasticsearchSpanConsumer (or ElasticsearchStorage) to help decouple the ES component from OkHttp. Hopefully that addresses your initial concern @adriancole. Regardless, we can always iterate on it a bit. |
please work in a branch and this type of issue you are working with seems a
half implementation of our asynchronous reporter and likely going to cause
problems that the reporter has solved already.
…On Wed, Apr 3, 2019, 2:57 AM Logic-32 ***@***.***> wrote:
Until I get a response to my above comment, I think I'll pursue the idea
of removing the Semaphore and adding an ExecutorService in
ElasticsearchSpanConsumer to help decouple the ES component from OkHttp.
Hopefully that addresses your initial concern @adriancole
<https://github.com/adriancole>. Regardless, we can always iterate on it
a bit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2023 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD6107QWPRBbY-8tR188vtlGSc2-27xks5vc7YdgaJpZM4TlAyl>
.
|
Going to run it on our servers for a bit before even attempting a pull request to make sure it is stable. But when I commit it, it'll definitely be on a branch :) Is there any reference material on the asynchronous reporter I can review to make sure I'm not reinventing a wheel? Is there an ETA on when that'll come out? And you're not talking about the reporter in Brave, correct? |
@Logic-32 The main thing is that there are so many different types of issues :) They should be addressed independently. The backlog not being properly addressed in #1760 is simply a flaw or oversight. My guess is that you can create this as a test and break it. If it wasn't a fault in our design (letting the queue grow unbounded), then we can also ask okhttp about how to change.. there's sometimes a way. Can you raise an independent issue on the unbounded queue problem with the description you gave? |
on the async reporter.. it has a bundling feature with a size limitation of both the message size to the host and the backlog. What I mean is that if not using a durable queue, then we are back to memory. If in memory, we have to be careful about implementation of the bounded queue so as to not OOM and the AsyncReporter has code to do this. I would suggest possibly a reporter adapter to do the bulk message if we get to this point. However, there are still the concern of knowing when you should retry which can also be done separately. ES will definitely fail for unknown amounts of time.. So, normalizing the exception will help especially as sometimes the cluster is just dead as opposed to saying my queue is full. Main point is that there are several different issues, but opening one per case is a good idea so that they can be tracked. Ex same as you maybe don't care about durable queue, probably those with durable queues might not want the risk of another memory queue :P So, memory-backed collector queue for storage overloads is its own concern IOTW. I have to run, but easiest to chat on gitter, too. Thanks for the help.. I think there are some flaws we can fix. |
Can you look at this and help contribute to it? If you register for a wiki
user I can add you write permissions to help elaborate.
https://cwiki.apache.org/confluence/display/ZIPKIN/Collector+surge+and+error+handling
…On Wed, Apr 3, 2019 at 5:58 AM Logic-32 ***@***.***> wrote:
Going to run it on our servers for a bit before even attempting a pull
request to make sure it is stable. But when I commit it, it'll definitely
be on a branch :)
Is there any reference material on the asynchronous reporter I can review
to make sure I'm not reinventing a wheel? Is there an ETA on when that'll
come out? And you're not talking about the reporter in Brave, correct?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2023 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD613MSx2kq_AT4rMy96TfqlKjQ61doks5vc-CPgaJpZM4TlAyl>
.
|
Sorry for the delay, would you like me to open a PR on that small adjustment (I know it is not the full solution you're looking for, but it does help in the case of HTTP)? |
The tryAcquire() approach with a timeout should not have any negative impact on those using the HTTP API as long as the timeout is kept reasonable. I'm not sure what reasonable is but <= 1m should be good. That'll give Elasticsearch enough time to flush some stuff but not cause a huge backup on HTTP buffer items.
more common use of http api is things like span reporters. if you look
at the asyncreporter there is 1 thread in a blocking loop. if you
block this for 1 minute, unless it is very low traffic the buffer will
fill and spans drop on that side. I don't think blocking clients is in
general something nice to do for telemetry systems. For example, to
not block clients led to storage being async, and that was a for
concerns much shorter than a second. keep this in mind!
|
Shoot, that's right. The Semaphore is technically pre-enqueue so it is on the ingestion thread which would block sending a response. Didn't think about that earlier :( FWIW, I'm testing my changes for #2481 in our production environment soon. Still no ETA on when I'll make a pull request and certainly no promises on it being acceptable. Plus there is some Apache stuff to work through now. But I'll try to stay on top of it :) |
@Logic-32 thanks for the status update and keep up the good work. This change has needed a champion for a while and I'm glad you are considering all angles mentioned! |
#2502 now includes test instructions please give a try |
What kind of issue is this?
We have zipkin 2.7.1 reading form kafka with
zipkin-autoconfigure-collector-kafka10-2.7.1-module.jar
and the storage is elasticsearch.With high number of spans, we found that we are losing spans. We are checking the metric:
counter.zipkin_collector.spans_dropped.kafka
In the logs we get the exception:
We increased
ES_MAX_REQUESTS
but not fix the problem.zipkin/zipkin-storage/elasticsearch/src/main/java/zipkin2/elasticsearch/internal/client/HttpCall.java
Lines 85 to 91 in 972072a
I think that in the case of reading spans form Kafka if
semaphore.tryAcquire()
fails it must block, or the Kafka reader can control it with a new call to write the spans to the storage that blocks until there is resources available in the storage.With kafka we have the buffering so we don't have the problems of the http request to write to the storage, we can control easy the back pressure of the storage not reading from kafka without lossing any span.
The text was updated successfully, but these errors were encountered: