-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate how to limit backlog on Elasticsearch collector #1760
Comments
One way we could do it is to drop the incoming requests when we notice dispatcher.runningCallsCount() is at a specific threshold, maybe easy as our in-flight limit (default 64)! This trusts that auto-cancelation based on connect/read/write timeouts work (which defaults to 10s). If there's a possibility of stale in-flight requests otherwise, I guess we'd need something to watchdog in-flight requests to kill them if they are over the various timeouts. cc the usual okhttp folks |
From the Gist but also knowing the EventListener interface, is this something that could be prototyped first as an external library. Seems like all the hooks are there to come up with something pretty smart. |
Another strategy: have your own queue that sits in front of the dispatcher and does its own prioritization and enqueueing. Getting signaled when dispatcher’s size changes may be tricky. |
This sort of thing is hard to get right without knowing how the client and server(s) are handling things. Will OkHttpClient ever make the optimal decision? e.g. is it worth submitting a queued call shortly before timeout, and therefore newer calls should be prioritised over older ones. |
There are two main cases in ES anyway, the server fails not nicely (ex out
of disk) and the server fails nicely (response that says backoff). The
formal ES client has a backoff strategy for the latter. The former can
still happen and did in the issue leading to this one.
Regardless, one simplification we can make here is that we can choose to
drop new requests or drop in-flight ones on overload.
In either case, and we don't need to buffer in expectation of a transient
failure. For example, first priority is not dying (OOM), though later more
advanced things could be done.
|
@yschimke, looking at the EventListener interface, I'm not quite sure it'd be able to accomplish what we'd need by it self. Specifically, this line from the documentation makes me a bit uneasy:
The inability for a listener to modify the client being the phrase to note there. Maybe an EventListener in conjunction with a queue handling mechanism in front of OkHttp would work but that sounds like a lot of hoops to jump through and equally as many places for something to go wrong. @swankjesse, what would the benefit of putting a queue in front of OkHttp be (aside from what I mentioned above)? Would querying the Dispatcher directly be inappropriate in some way or are you just suggesting that an extra abstraction might lead to a cleaner implementation? @adriancole, the StorageComponent code is a bit deep and I haven't managed to get through it all yet, particularly to the ES client, but for reference, there was another situation where we hit an OOM: when one of our ES nodes went down. I wasn't able to identify and exact order of events that caused the issue but for some reason after the node (1 out of 3 of them) went offline we had a backup of requests and spiked to an OOM. That indicates to me that I'm guessing the server was not failing nicely and triggering the backoff strategy you mentioned. Again, not sure what other factors were at play but it's another scenario to consider at the very least. As for the simplification: that sounds ideal to me as long as proper metrics are maintained with respect to number of spans dropped ;) |
Honestly, I think we should start with dropping on backlog and go from
there. The test is easy.. crap out a cluster and throw traffic at zipkin.
uncrap the cluster and the drop metrics should stop.
|
downside of dropping on backlog (ex ready queue > 0 or some figure
defaulting to zero) is that we will still have a by default 64
requests in flight on a crap cluster. however, the change to improve
that (ex coordinate on http response) could layer a filter or other
code to throttle back. If 64 requests OOM the server anyway, you can
change that to a lower value today anyway. At any rate a healthy
backoff signal isn't always present, so regardless we need a knob like
drop on backlog as otherwise whoever's queue it is will lead to OOM.
|
In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760
In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760
#1765 I am soak testing this, but the server already survives a lot longer. basically, set |
In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes #1760
Thanks! As previously mentioned, transition to zipkin 2 will be a little involved for us but I will report back when I can! Hardest part with this is that the death is random. So who knows if we'll find the random event again ;) |
Hardest part with this is that the death is random. So who knows if we'll
find the random event again ;)
well thanks for dying earlier, as I think the code is a bit safer now, even
if not bulletproof. cheers!
|
#2502 now includes test instructions please give a try |
In the past, delayed or otherwise unhealthy elasticsearch clusters could create a backlog leading to a OOM arising from the http dispatcher ready queue. This chooses to prevent a ready queue instead. This means we drop spans when the backend isn't responding instead of crashing the server. Fixes openzipkin#1760
Our elasticsearch span consumer uses okhttp under the scenes, which is subject to a default of 64 in-flight connections. We should think about a cancelation policy/algo so that we don't OOM when the elasticsearch cluster gets backed up for whatever reason. I don't see a control to limit the backlog in the okhttp dispatcher, so we might have to do something here..
here are notes from @Logic-32 #1332 (comment)
The text was updated successfully, but these errors were encountered: