Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HealthCheckedChannelPool to check KEEP_ALIVE attribute #1476

Merged
merged 1 commit into from
Oct 18, 2019

Conversation

zoewangg
Copy link
Contributor

Description

Update HealthCheckedChannelPool to check KEEP_ALIVE when acquiring a channel from the pool to avoid soon-to-be inactive channels being picked up by a new request. This should reduce the frequency of IOException: Server failed to complete response errors. See #1380 #1466

Testing

  • Added unit tests.
  • Manual tests

Disabled retry and set maxIdleConnectionTimeout = 1 mins on purpose to expose the errors:

Before

CloudWatchAsyncStabilityTest: totalRequestCount: 4000, ioExceptionCount: 34

Based on the logs, 28 of ioExceptions were because the connections that were not reusable got picked up

After


CloudWatchAsyncStabilityTest:, totalRequestCount: 4000, ioExceptionCount: 0

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

Checklist

  • I have read the CONTRIBUTING document
  • Local run of mvn install succeeds
  • My code follows the code style of this project
  • My change requires a change to the Javadoc documentation
  • I have updated the Javadoc documentation accordingly
  • I have read the README document
  • I have added tests to cover my changes
  • All new and existing tests passed
  • A short description of the change has been added to the CHANGELOG
  • My change is to implement 1.11 parity feature and I have updated LaunchChangelog

License

  • I confirm that this pull request can be released under the Apache 2 license

…f a channel before picking it up in the pool. See #1380
Copy link
Contributor

@spfink spfink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does KEEP-ALIVE being false indicate the channel is unhealthy?

@zoewangg
Copy link
Contributor Author

zoewangg commented Oct 17, 2019

KEEP_ALIVE was set from the previous response on the same channel based on the response headers. KEEP_ALIVE=false indicates the connection might be closed at any moment and should not be reused.

Basically, there could be a race condition when a channel is being closed at the same moment it's being acquired, and a non-reusable connection might gets picked up by a new request.

@codecov-io
Copy link

Codecov Report

Merging #1476 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1476      +/-   ##
============================================
+ Coverage     73.81%   73.82%   +<.01%     
  Complexity      720      720              
============================================
  Files           850      850              
  Lines         26158    26160       +2     
  Branches       2018     2018              
============================================
+ Hits          19309    19313       +4     
+ Misses         5975     5973       -2     
  Partials        874      874
Flag Coverage Δ Complexity Δ
#unittests 73.82% <100%> (ø) 720 <0> (ø) ⬇️
Impacted Files Coverage Δ Complexity Δ
...wssdk/http/nio/netty/internal/ResponseHandler.java 87.09% <ø> (-0.09%) 0 <0> (ø)
...p/nio/netty/internal/HealthCheckedChannelPool.java 90.9% <100%> (+0.43%) 0 <0> (ø) ⬇️
...k/http/nio/netty/internal/ChannelAttributeKey.java 92.3% <100%> (+0.64%) 0 <0> (ø) ⬇️
...nio/netty/internal/OldConnectionReaperHandler.java 81.81% <0%> (-9.1%) 0% <0%> (ø)
...o/netty/internal/http2/HttpOrHttp2ChannelPool.java 80.64% <0%> (-1.08%) 0% <0%> (ø)
...ssdk/core/internal/async/FileAsyncRequestBody.java 87.61% <0%> (+3.8%) 0% <0%> (ø) ⬇️
...ine/stages/ApiCallAttemptTimeoutTrackingStage.java 100% <0%> (+6.89%) 0% <0%> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99a44ce...494f26d. Read the comment docs.

Comment on lines 163 to +169
private boolean isHealthy(Channel channel) {
// There might be cases where the channel is not reusable but still active at the moment
// See https://github.com/aws/aws-sdk-java-v2/issues/1380
if (channel.attr(KEEP_ALIVE).get() != null && !channel.attr(KEEP_ALIVE).get()) {
return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but any reason we don't close the channel after the response ins consumed instead?

Copy link
Contributor Author

@zoewangg zoewangg Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do invoke ChannelHandlerContext#close in ResponseHandler for this case but the channel might not be actually closed and is still active at this point. It's not guaranteed the channel is closed before it's released into the pool.

Also, because of the fact that the underlying HttpStreamsClientHandler from netty-reactive-streams does not close the channel if there are still inflight messages, channel.close might not be initiated at all. (This is what we discussed in the morning and we might need to figure out why there is still inflight message when the response is complete in some cases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants