Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug where a connection may be not reused when using RetryingClient #5290

Merged
merged 11 commits into from
Nov 8, 2023

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Nov 7, 2023

Motivation:

Armeria's HttpChannelPool is bound to an EventLoop. Different EventLoops have different HttpChannelPools. In other words, in order to reuse a connection for an Endpoint, the same EventLoop must be selected.

When creating a derived client in RetryingClient, a new endpoint is selected for each retry, but since the EventLoop of the parent is used as is. That causes the Endpoint can't use the existing connection pool for multiplexing and makes a new connection.

Modifications:

  • Use EventLoopScheduler to return constant EventLoops for the same endpoint.
  • Allow setting EndpointGroup in ClientRequestContextBuilder for testability.

Result:

  • You no longer see a connection leak when using RetryingClient with EndpointGroup.

…client.

Motivation:

Armeria's `HttpChannelPool` is bound to an `EventLoop`. Different
`EventLoop`s have different `HttpChannelPool`s. In other words, in order
to reuse a connection for an `Endpoint`, the same `EventLoop` must be
selected.

When creating a derived client in `RetryingClient`, a new endpoint is
selected for each retry, but since the `EventLoop` of parent is used as
is. That causes that the `Endpoint` can't use the existing connection
pool for multiplexing and makes a new connection created.

Modifications:

- Use `EventLoopScheduler` to return constant `EventLoop`s for the same
  endpoint.
- Allow setting `EndpointGroup` in `ClientRequestContextBuilder` for
  testability.

Result:

- You no longer not see a connection leak when using `RetryingClient`
  with `EndpointGroup`.
@ikhoon ikhoon added the defect label Nov 7, 2023
@ikhoon ikhoon added this to the 1.26.2 milestone Nov 7, 2023
@ikhoon ikhoon changed the title Fix a bug where a connection may be not reused when using a retrying client. Fix a bug where a connection may be not reused when using RetryingClient Nov 7, 2023
acquireEventLoop(endpoint);
}

log = RequestLog.builder(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should do this before calling acquireEventLoop

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor questions 🙇

@@ -531,6 +528,14 @@ private DefaultClientRequestContext(DefaultClientRequestContext ctx,

this.endpointGroup = endpointGroup;
updateEndpoint(endpoint);
if (ctx.endpoint() == endpoint || endpoint == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question) Is this compare by reference as opposed to equality intentionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, reference comparison was used to know the derivation was the initial attempt. The logic is changed to check the initial attempt more explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference comparison was used to know the derivation was the initial attempt

Are you suggesting that ctx.endpoint() == endpoint implies that it is the initial attempt?

Anyways, I was actually implying that it is possible that an endpoint group updates its endpoints, and comparing by reference won't reflect it. I think it's fine if you are aware of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctx.log.children().isEmpty() would be enough but added reference equality for double check.
ctx.endpoint() == endpoint && ctx.log.children().isEmpty()

If a derivation is not an initial attempt, acquireEventloop() needs to be called even for the same Endpoint to update the internal state of DefaultEventLoopScheduler.

Comment on lines 533 to 537
if (ctx.endpoint() == endpoint || endpoint == null) {
eventLoop = ctx.eventLoop().withoutContext();
} else {
acquireEventLoop(endpoint);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we just always call acquireEventLoop()? (maybe except when endpoint == null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a good idea to update the lastActivityTimeNanos.

However if we always call acquireEventLoop(), the number of active requests may be increased for the initial attempt. Let me add an exception for the initial attempt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when ctx.endpoint() == endpoint and the maxNumEventLoopPerEndpoint is greater than 1, the different eventLoop might be used which is inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the different eventLoop might be used which is inefficient.

Although it is inefficient I think it would be good to try a different connection than the connection that has already failed.

@trustin
Copy link
Member

trustin commented Nov 8, 2023

And can we also add an integration test that involves RetryingClient to ensure the fix actually fixes the problem described in the PR description?

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ff9f2e0) 0.00% compared to head (d73eabc) 73.95%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #5290       +/-   ##
===========================================
+ Coverage        0   73.95%   +73.95%     
- Complexity      0    20104    +20104     
===========================================
  Files           0     1730     +1730     
  Lines           0    74139    +74139     
  Branches        0     9460     +9460     
===========================================
+ Hits            0    54831    +54831     
- Misses          0    14832    +14832     
- Partials        0     4476     +4476     
Files Coverage Δ
...a/internal/client/DefaultClientRequestContext.java 90.90% <100.00%> (ø)

... and 1729 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@jrhee17 jrhee17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy with the changes once the CI passes 👍 Thanks @ikhoon 🙇 👍 🙇

@ikhoon
Copy link
Contributor Author

ikhoon commented Nov 8, 2023

I've reverted all changes in *RequestContextBuilders since it caused additional side effects in tests which would be out of the scope of this PR.

Let me handle it in a separate PR to build a RequestContext with an EndpointGroup

@jrhee17
Copy link
Contributor

jrhee17 commented Nov 8, 2023

By the way, was this ever commented on/addressed?
#5290 (comment)

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just once nit - LGTM 🙇

@ikhoon
Copy link
Contributor Author

ikhoon commented Nov 8, 2023

By the way, was this ever commented on/addressed?

Commented. JFYI, if condition has changed.

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@minwoox minwoox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this. 🙇

@minwoox minwoox merged commit 88ec879 into line:main Nov 8, 2023
15 checks passed
@ikhoon ikhoon deleted the retry-client-eventloop branch June 28, 2024 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants