Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how to limit backlog on Elasticsearch collector #1760

Closed
codefromthecrypt opened this issue Oct 7, 2017 · 13 comments
Closed

Investigate how to limit backlog on Elasticsearch collector #1760

codefromthecrypt opened this issue Oct 7, 2017 · 13 comments

Comments

@codefromthecrypt
Copy link
Member

Our elasticsearch span consumer uses okhttp under the scenes, which is subject to a default of 64 in-flight connections. We should think about a cancelation policy/algo so that we don't OOM when the elasticsearch cluster gets backed up for whatever reason. I don't see a control to limit the backlog in the okhttp dispatcher, so we might have to do something here..

here are notes from @Logic-32 #1332 (comment)

We are using Brave with a URLConnectionSender to send the data. I'm surprised that no tool could handle more than 10k spans in your workshop; we handle it with minimal problems on only one zipkin-server, no cluster. The only problem is that the ES_READ_TIMEOUT is set to 60s and it sometimes takes more than 60s to write the data out. So, the data does get written but zipkin erroneously reports them as dropped. But our specific situation that got us into that state was having a 10s messageTimeout and a tight loop that hit the database; both of which have been remedied.

Quick review of the documentation indicates that the max connections only affects the size of the pool they use; which probably means everything else would be queued and backlogged.

Again, not knowing OkHttp that well, I wonder if you can inspect the Dispatcher queue and cancel anything in the queue if you exceed a certain threshold?
Ref: https://gist.github.com/RyanRamchandar/64c5863838940ec67f03

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Oct 7, 2017

One way we could do it is to drop the incoming requests when we notice dispatcher.runningCallsCount() is at a specific threshold, maybe easy as our in-flight limit (default 64)!

This trusts that auto-cancelation based on connect/read/write timeouts work (which defaults to 10s). If there's a possibility of stale in-flight requests otherwise, I guess we'd need something to watchdog in-flight requests to kill them if they are over the various timeouts.

cc the usual okhttp folks
@yschimke @swankjesse @JakeWharton

@yschimke
Copy link

yschimke commented Oct 7, 2017

From the Gist but also knowing the EventListener interface, is this something that could be prototyped first as an external library. Seems like all the hooks are there to come up with something pretty smart.

@swankjesse
Copy link

Another strategy: have your own queue that sits in front of the dispatcher and does its own prioritization and enqueueing.

Getting signaled when dispatcher’s size changes may be tricky.

@yschimke
Copy link

yschimke commented Oct 7, 2017

This sort of thing is hard to get right without knowing how the client and server(s) are handling things. Will OkHttpClient ever make the optimal decision? e.g. is it worth submitting a queued call shortly before timeout, and therefore newer calls should be prioritised over older ones.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Oct 7, 2017 via email

@Logic-32
Copy link
Contributor

Logic-32 commented Oct 9, 2017

@yschimke, looking at the EventListener interface, I'm not quite sure it'd be able to accomplish what we'd need by it self. Specifically, this line from the documentation makes me a bit uneasy:

All event methods must execute fast, without external locking, cannot throw exceptions, attempt to mutate the event parameters, or be reentrant back into the client.

The inability for a listener to modify the client being the phrase to note there. Maybe an EventListener in conjunction with a queue handling mechanism in front of OkHttp would work but that sounds like a lot of hoops to jump through and equally as many places for something to go wrong.

@swankjesse, what would the benefit of putting a queue in front of OkHttp be (aside from what I mentioned above)? Would querying the Dispatcher directly be inappropriate in some way or are you just suggesting that an extra abstraction might lead to a cleaner implementation?

@adriancole, the StorageComponent code is a bit deep and I haven't managed to get through it all yet, particularly to the ES client, but for reference, there was another situation where we hit an OOM: when one of our ES nodes went down. I wasn't able to identify and exact order of events that caused the issue but for some reason after the node (1 out of 3 of them) went offline we had a backup of requests and spiked to an OOM. That indicates to me that I'm guessing the server was not failing nicely and triggering the backoff strategy you mentioned. Again, not sure what other factors were at play but it's another scenario to consider at the very least.

As for the simplification: that sounds ideal to me as long as proper metrics are maintained with respect to number of spans dropped ;)

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Oct 10, 2017 via email

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Oct 10, 2017 via email

codefromthecrypt pushed a commit that referenced this issue Oct 10, 2017
In the past, delayed or otherwise unhealthy elasticsearch clusters could
create a backlog leading to a OOM arising from the http dispatcher ready
queue. This chooses to prevent a ready queue instead. This means we drop
spans when the backend isn't responding instead of crashing the server.

Fixes #1760
codefromthecrypt pushed a commit that referenced this issue Oct 10, 2017
In the past, delayed or otherwise unhealthy elasticsearch clusters could
create a backlog leading to a OOM arising from the http dispatcher ready
queue. This chooses to prevent a ready queue instead. This means we drop
spans when the backend isn't responding instead of crashing the server.

Fixes #1760
@codefromthecrypt
Copy link
Member Author

#1765 I am soak testing this, but the server already survives a lot longer. basically, set ES_MAX_REQUESTS=2 and throw lots traffic at it. Before this change, it would reliably OOM. It is harder to OOM if you have the default max requests of 64 :P

codefromthecrypt pushed a commit that referenced this issue Oct 10, 2017
In the past, delayed or otherwise unhealthy elasticsearch clusters could
create a backlog leading to a OOM arising from the http dispatcher ready
queue. This chooses to prevent a ready queue instead. This means we drop
spans when the backend isn't responding instead of crashing the server.

Fixes #1760
@codefromthecrypt
Copy link
Member Author

let's try simple first. I've put a semaphore in, and zipkin 2.2 has some nice things for grafana etc. give it a try and let's iterate from there..

here's an example of a forced surge and recovery though I'm sure we can kill the server if we try hard enough.

screen shot 2017-10-10 at 10 10 00 pm

@Logic-32
Copy link
Contributor

Thanks! As previously mentioned, transition to zipkin 2 will be a little involved for us but I will report back when I can!

Hardest part with this is that the death is random. So who knows if we'll find the random event again ;)

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Oct 14, 2017 via email

@codefromthecrypt
Copy link
Member Author

#2502 now includes test instructions please give a try

abesto pushed a commit to abesto/zipkin that referenced this issue Sep 10, 2019
In the past, delayed or otherwise unhealthy elasticsearch clusters could
create a backlog leading to a OOM arising from the http dispatcher ready
queue. This chooses to prevent a ready queue instead. This means we drop
spans when the backend isn't responding instead of crashing the server.

Fixes openzipkin#1760
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants