HystrixCollapser: Response Not Received #80

benjchristensen · 2013-01-08T18:57:31Z

Symptoms:

sometimes collapsed requests do not receive their response from the executing batch
collapsed requests that don't receive responses would block calling threads for 15+ seconds
latency in batch being executed

Causes:

the only cause that was reproducible (by using real production traffic) is high load average on a box causing the timer thread to become very latent in triggering batches (seconds, not just 10s or 100s of milliseconds).
various concurrency bugs that made the handling of this latency bad such as resulting in long running blocked threads

This issue was originally "IllegalStateException: Future Not Started" and thought to be related to this HystrixCollapser issue. It ended up not being so but taken over by it so the original bug was re-created at #113. The first 2 comments including pull request #81 are related to #113 and not this issue about HystrixCollapser.

benjchristensen · 2013-01-08T18:57:34Z

Need to first add some more debug information and deploy into production to capture more data (as I have not been able to replicate it outside of production).

benjchristensen · 2013-02-24T05:29:24Z

I figured out what's causing the issue ... under high traffic and high system load average, the ticker thread becomes overwhelmed so the collapser receives the ticks very late. In my canary testing when squeezing a box they were coming 5+ seconds late, sometimes as high as 8 seconds, when it should be 10ish milliseconds.

2013-02-24 01:24:46,337 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-154] [get] reqId=1361669037482r986.4960403647274eNFWII-005-08C19342-D86BF796D008 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@5d412316 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 7131 timeToNow: 10065 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@291f135c  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false  timeout: 5 timeunit: SECONDS
2013-02-24 01:24:46,345 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-130] [get] reqId=1fada0b8-e08e-f350-2233-796eb6ef6e71 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@68011b48 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 7066 timeToNow: 10008 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@51733ef  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false timeout: 5 timeunit: SECONDS
2013-02-24 01:24:47,913 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-187] [get] reqId=1361669058719r390.54578775539994eNFWII-005-07137C0C-002331F31882 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@1c2707e batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 5741 timeToNow: 10018 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@4ccc5d89  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false  timeout: 5 timeunit: SECONDS
2013-02-24 01:24:47,921 ERROR com.netflix.hystrix.HystrixCollapser$CollapsedRequestFutureImpl:1012 [api.endpoint.default.pool-thread-172] [get] reqId=d1c9684b-1d1c-3d6b-a23c-312fbd6bb195 TimedOut waiting on responseReference.  responseReceived: 1 batchReceived: 0 batchFuture: com.netflix.hystrix.HystrixCollapser$BatchFutureWrapper@438b70e9 batchFuture.isDone: true batchFuture.isCancelled: false batchFuture.mapResponseWork: false timeToGet: 0 batchReceivedTime: 5748 timeToNow: 10033 argument: com.netflix.dependencycommands.cinematch.CinematchPredictionRequestArguments@54f21c1a  countResponsesReceived: 0 from: 1  batchFuture.numGets: 0 waitedOnBatchFuture: false timeout: 5 timeunit: SECONDS

Note the 'batchReceivedTime' for how long from the time a batch began until it received the tick.

2013-02-24 04:49:12,153 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 5370  Listeners for this interval: 427
2013-02-24 04:49:19,377 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 7212  Listeners for this interval: 489
2013-02-24 04:49:25,453 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 6060  Listeners for this interval: 428
2013-02-24 04:49:37,421 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 11957  Listeners for this interval: 506
2013-02-24 04:49:42,546 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 5113  Listeners for this interval: 454
2013-02-24 04:49:53,374 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 10815  Listeners for this interval: 576

This shows the time for a loop on the timer. Ideally it would be 10ms ... here it is taking multiple seconds when under load.

When the machine is healthy (load average of 6-7 on an 8-core box instead of approach 20 like for the tests above) it looks more like this:

al: 663
2013-02-24 05:24:17,764 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 140  Listeners for this interval: 656
2013-02-24 05:24:21,568 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 133  Listeners for this interval: 654
2013-02-24 05:24:28,204 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 183  Listeners for this interval: 655
2013-02-24 05:24:29,669 ERROR com.netflix.hystrix.util.HystrixTimer$TickThread:191 [HystrixTimer_Tick] [run] reqId=1361681126421r626.2122162152082eNFWII-005-0A464098-58BDA384F293 Duration of Loop: 122  Listeners for this interval: 655

It's still not the 10ms I want, but far healthier.

When load is <6 the loop is <100ms.

The considerations right now are:

Should I shard the timer so that it's not handling the 600+ listeners (aggregate count for our production usage ... would vary by usage). Sharding would allow me to have a timer per 100 listeners for example (up to CPU count?).
How can I get the tick() to be more efficient so the timer ticking isn't prone to backing up?
Is there a completely different way to do this?

I have changed code that I'll commit soon to improve the error handling when the latency does occur - as right now it ends up waiting twice when it only needs to wait once and then throw an exception.

benjchristensen · 2013-02-24T05:34:25Z

I also need to build in Collapser timeout that inherits from the underlying HystrixCommand timeout - so that if the collapser scheduling itself is struggling it just sheds load instead of blocking.

benjchristensen · 2013-02-26T08:08:14Z

Here are performance comparisons from an improved implementation I'm testing on a production canary:

The new code has the following changes:

refactored HystrixTimer to be multi-threaded and use ScheduledThreadPoolExecutor to achieve this
refactored HystrixCollapser to:
- timeout on wait and trigger batch execution from calling thread instead of relying on only background timer
- use single CountDownLatch per batch rather than 2 per collapsed requests to dramatically reduce number of items being waited on
- changed how batches are defined so the creation of a batch is a simpler step without a loop over all requests to register with the batch

I'm going to run it in production on a canary for a while before committing or merging to make sure the initial performance improvements hold true and I haven't introduced any other bugs or memory leaks.

benjchristensen · 2013-02-27T06:20:48Z

Pull Request #114 was tested using a Netflix API production canary and tests by 2 other applications.

benjchristensen · 2013-02-27T17:52:09Z

I have merged this code and closing this issue. It will be released in 1.2.10.

Notes from the pull request:

I have gotten positive test results of this from 3 teams including my own testing on production canaries so I am merging this code and releasing.

What I looked at in prod was:

no new errors showed up
concurrency bugs and "Response Not Received" are not gone
performance numbers under normal system load for application behavior did not change
thread dumps look clean
histo of objects on the heap looks fine after running through peak traffic and over night (so doesn't look like I created any memory leaks)

The other 2 teams did stress tests in the test environment and everything passed.

I still can't replicate this so am adding error handling anywhere I can foresee something as well as adding more logs. Netflix#80

IllegalStateException: Future Not Started Netflix#80 - refactored HystrixTimer to be multi-threaded and use ScheduledThreadPoolExecutor to achieve this - refactored HystrixCollapser to: -- timeout on wait and trigger batch execution from calling thread instead of relying on only background timer -- use single CountDownLatch per batch rather than 2 per collapsed requests to dramatically reduce number of items being waited on -- changed how batches are defined so the creation of a batch is a simpler step without a loop over all requests to register with the batch Testing on a production canary shows these changes have given significant efficiency and thus performance gains.

ghost assigned benjchristensen Jan 8, 2013

benjchristensen mentioned this issue Jan 8, 2013

IllegalStateException: Future Not Started #81

Merged

benjchristensen mentioned this issue Feb 23, 2013

More work on HystrixCollapser response not received bug #112

Merged

This was referenced Feb 26, 2013

IllegalStateException: Future Not Started #113

Closed

HystrixCollapser Refactor for Performance and Concurrency Bug Fixes #114

Merged

benjchristensen closed this as completed Feb 27, 2013

abersnaze pushed a commit to abersnaze/Hystrix that referenced this issue Nov 7, 2013

More work on HystrixCollapser response not received bug

6df70a1

I still can't replicate this so am adding error handling anywhere I can foresee something as well as adding more logs. Netflix#80

mikegrima unassigned benjchristensen Sep 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HystrixCollapser: Response Not Received #80

HystrixCollapser: Response Not Received #80

benjchristensen commented Jan 8, 2013

benjchristensen commented Jan 8, 2013

benjchristensen commented Feb 24, 2013

benjchristensen commented Feb 24, 2013

benjchristensen commented Feb 26, 2013

benjchristensen commented Feb 27, 2013

benjchristensen commented Feb 27, 2013

HystrixCollapser: Response Not Received #80

HystrixCollapser: Response Not Received #80

Comments

benjchristensen commented Jan 8, 2013

benjchristensen commented Jan 8, 2013

benjchristensen commented Feb 24, 2013

benjchristensen commented Feb 24, 2013

benjchristensen commented Feb 26, 2013

benjchristensen commented Feb 27, 2013

benjchristensen commented Feb 27, 2013