-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HystrixCollapser: Response Not Received #80
Comments
Need to first add some more debug information and deploy into production to capture more data (as I have not been able to replicate it outside of production). |
I figured out what's causing the issue ... under high traffic and high system load average, the ticker thread becomes overwhelmed so the collapser receives the ticks very late. In my canary testing when squeezing a box they were coming 5+ seconds late, sometimes as high as 8 seconds, when it should be 10ish milliseconds.
Note the 'batchReceivedTime' for how long from the time a batch began until it received the tick.
This shows the time for a loop on the timer. Ideally it would be 10ms ... here it is taking multiple seconds when under load. When the machine is healthy (load average of 6-7 on an 8-core box instead of approach 20 like for the tests above) it looks more like this:
It's still not the 10ms I want, but far healthier. When load is <6 the loop is <100ms. The considerations right now are:
I have changed code that I'll commit soon to improve the error handling when the latency does occur - as right now it ends up waiting twice when it only needs to wait once and then throw an exception. |
I also need to build in Collapser timeout that inherits from the underlying HystrixCommand timeout - so that if the collapser scheduling itself is struggling it just sheds load instead of blocking. |
Here are performance comparisons from an improved implementation I'm testing on a production canary: The new code has the following changes:
I'm going to run it in production on a canary for a while before committing or merging to make sure the initial performance improvements hold true and I haven't introduced any other bugs or memory leaks. |
Pull Request #114 was tested using a Netflix API production canary and tests by 2 other applications. |
I have merged this code and closing this issue. It will be released in 1.2.10. Notes from the pull request: I have gotten positive test results of this from 3 teams including my own testing on production canaries so I am merging this code and releasing. What I looked at in prod was:
The other 2 teams did stress tests in the test environment and everything passed. |
I still can't replicate this so am adding error handling anywhere I can foresee something as well as adding more logs. Netflix#80
IllegalStateException: Future Not Started Netflix#80 - refactored HystrixTimer to be multi-threaded and use ScheduledThreadPoolExecutor to achieve this - refactored HystrixCollapser to: -- timeout on wait and trigger batch execution from calling thread instead of relying on only background timer -- use single CountDownLatch per batch rather than 2 per collapsed requests to dramatically reduce number of items being waited on -- changed how batches are defined so the creation of a batch is a simpler step without a loop over all requests to register with the batch Testing on a production canary shows these changes have given significant efficiency and thus performance gains.
Symptoms:
Causes:
This issue was originally "IllegalStateException: Future Not Started" and thought to be related to this HystrixCollapser issue. It ended up not being so but taken over by it so the original bug was re-created at #113. The first 2 comments including pull request #81 are related to #113 and not this issue about HystrixCollapser.
The text was updated successfully, but these errors were encountered: