Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

zoewangg · 2019-10-17T23:06:12Z

Description

Update HealthCheckedChannelPool to check KEEP_ALIVE when acquiring a channel from the pool to avoid soon-to-be inactive channels being picked up by a new request. This should reduce the frequency of IOException: Server failed to complete response errors. See #1380 #1466

Testing

Added unit tests.
Manual tests

Disabled retry and set maxIdleConnectionTimeout = 1 mins on purpose to expose the errors:

Before

CloudWatchAsyncStabilityTest: totalRequestCount: 4000, ioExceptionCount: 34

Based on the logs, 28 of ioExceptions were because the connections that were not reusable got picked up

After


CloudWatchAsyncStabilityTest:, totalRequestCount: 4000, ioExceptionCount: 0

Screenshots (if appropriate)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

Checklist

License

I confirm that this pull request can be released under the Apache 2 license

…f a channel before picking it up in the pool. See #1380

spfink

Why does KEEP-ALIVE being false indicate the channel is unhealthy?

zoewangg · 2019-10-17T23:41:06Z

KEEP_ALIVE was set from the previous response on the same channel based on the response headers. KEEP_ALIVE=false indicates the connection might be closed at any moment and should not be reused.

Basically, there could be a race condition when a channel is being closed at the same moment it's being acquired, and a non-reusable connection might gets picked up by a new request.

codecov-io · 2019-10-17T23:42:00Z

Codecov Report

Merging #1476 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@             Coverage Diff              @@
##             master    #1476      +/-   ##
============================================
+ Coverage     73.81%   73.82%   +<.01%     
  Complexity      720      720              
============================================
  Files           850      850              
  Lines         26158    26160       +2     
  Branches       2018     2018              
============================================
+ Hits          19309    19313       +4     
+ Misses         5975     5973       -2     
  Partials        874      874

Flag	Coverage Δ	Complexity Δ
#unittests	`73.82% <100%> (ø)`	`720 <0> (ø)`	⬇️

Impacted Files	Coverage Δ	Complexity Δ
...wssdk/http/nio/netty/internal/ResponseHandler.java	`87.09% <ø> (-0.09%)`	`0 <0> (ø)`
...p/nio/netty/internal/HealthCheckedChannelPool.java	`90.9% <100%> (+0.43%)`	`0 <0> (ø)`	⬇️
...k/http/nio/netty/internal/ChannelAttributeKey.java	`92.3% <100%> (+0.64%)`	`0 <0> (ø)`	⬇️
...nio/netty/internal/OldConnectionReaperHandler.java	`81.81% <0%> (-9.1%)`	`0% <0%> (ø)`
...o/netty/internal/http2/HttpOrHttp2ChannelPool.java	`80.64% <0%> (-1.08%)`	`0% <0%> (ø)`
...ssdk/core/internal/async/FileAsyncRequestBody.java	`87.61% <0%> (+3.8%)`	`0% <0%> (ø)`	⬇️
...ine/stages/ApiCallAttemptTimeoutTrackingStage.java	`100% <0%> (+6.89%)`	`0% <0%> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99a44ce...494f26d. Read the comment docs.

dagnir · 2019-10-18T00:05:21Z

...t/src/main/java/software/amazon/awssdk/http/nio/netty/internal/HealthCheckedChannelPool.java

    private boolean isHealthy(Channel channel) {
+        // There might be cases where the channel is not reusable but still active at the moment
+        // See https://github.com/aws/aws-sdk-java-v2/issues/1380
+        if (channel.attr(KEEP_ALIVE).get() != null && !channel.attr(KEEP_ALIVE).get()) {
+            return false;
+        }
+


I think this is fine, but any reason we don't close the channel after the response ins consumed instead?

We do invoke ChannelHandlerContext#close in ResponseHandler for this case but the channel might not be actually closed and is still active at this point. It's not guaranteed the channel is closed before it's released into the pool.

Also, because of the fact that the underlying HttpStreamsClientHandler from netty-reactive-streams does not close the channel if there are still inflight messages, channel.close might not be initiated at all. (This is what we discussed in the morning and we might need to figure out why there is still inflight message when the response is complete in some cases)

Update Netty HealthCheckedChannelPool to check KEEP_ALIVE attribute o…

494f26d

…f a channel before picking it up in the pool. See #1380

spfink reviewed Oct 17, 2019

View reviewed changes

spfink approved these changes Oct 17, 2019

View reviewed changes

dagnir reviewed Oct 18, 2019

View reviewed changes

zoewangg merged commit 2553931 into master Oct 18, 2019

zoewangg deleted the zoewang-nettyCheckKeepAlive branch October 18, 2019 01:08

This was referenced Oct 18, 2019

CloudWatch putMetricData sometimes produce a java.io.IOException: Server failed to send complete response #1380

Closed

SqsAsyncClient is failing unpredictably during streaming #1466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

zoewangg commented Oct 17, 2019

spfink left a comment

zoewangg commented Oct 17, 2019 •

edited

Loading

codecov-io commented Oct 17, 2019

dagnir Oct 18, 2019

zoewangg Oct 18, 2019 •

edited

Loading

Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

Conversation

zoewangg commented Oct 17, 2019

Description

Testing

Screenshots (if appropriate)

Types of changes

Checklist

License

spfink left a comment

Choose a reason for hiding this comment

zoewangg commented Oct 17, 2019 • edited Loading

codecov-io commented Oct 17, 2019

Codecov Report

dagnir Oct 18, 2019

Choose a reason for hiding this comment

zoewangg Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

zoewangg commented Oct 17, 2019 •

edited

Loading

zoewangg Oct 18, 2019 •

edited

Loading