-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pickfirst: Implement Happy Eyeballs #7725
Conversation
7cb88fe
to
db0dda7
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7725 +/- ##
==========================================
- Coverage 82.00% 81.74% -0.27%
==========================================
Files 373 374 +1
Lines 37735 37930 +195
==========================================
+ Hits 30945 31004 +59
- Misses 5512 5615 +103
- Partials 1278 1311 +33
|
Should we mention the environment variables in the release note? Or at least in the PR description? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't complete a full pass, but some comment here to get satrted.
Updated the release notes. |
7f3065d
to
67f7a1a
Compare
af38951
to
9712ec5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some minor nits in the tests.
// Replace the timer channel so that the old timers don't attempt to read | ||
// messages pushed next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old timers should get canceled when subsequent subchannels are created, right? Why do we need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required since pickfirst
will stop the timer, but the fake TimeAfterFunc will still keep waiting on the timer channel till the context is cancelled. If there are multiple listeners on the timer channel, they will race to read from the channel.
This could be avoided by introducing an interface for a time.Timer
so that the test can intercept calls to Timer.Stop()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying. That seems better to me, unless it is too much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to have the internal.TimeAfterFunc return a cancelFunc()
instead of a timer. This allowed the test to stop the timer when pickfirst
cancels the timer. I also created a helper function to return a timer function and a function to trigger the timer manually instead of having the tests write on channel.
testutils.AwaitNotState(shortCtx, t, cc, connectivity.TransientFailure) | ||
|
||
// Third SubConn fails. | ||
shortCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Won't testutils.AwaitNotState
fail the test if the specified state is reached before the context expires?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required because of the way testutils.AwaitNotState
works. When I tried to ignore the first cancel function as follows:
shortCtx, _ := context.WithTimeout(ctx, defaultTestShortTimeout)
govet
complains about a possible context leak because it can't ensure that the context will be cancelled at compile time. If we re-assign the cancel
func later, govet
doesn't complain but I still called cancel
just to be consistent. Removed the call now.
// The happy eyeballs timer expires, skipping server[1] and requesting the creation | ||
// of a third SubConn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you say we are skipping server[1] here? IIUC correctly:
- we first started a connection to server[0]
- connection to server[0] failed before the HE timer fired
- so, we started a connection to server[1]
- now, the HE timer has fired
- so, we would start a connection to server[2]
I don't see where we are skipping server[1].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test doesn't skip the server but it skips waiting for the SubConn to report a success or failure and moves on to the next SubConn. The comment was copied taken from Java's test case. I've improved the wording now.
// The SubConn is being re-used and failed during a previous pass | ||
// over the addressList. It has not completed backoff yet. | ||
// Mark it as having failed and try the next address. | ||
scd.connectionFailed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connectionFailed
is a bit like lastErr != nil
. Do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastErr
is used to update the picker at the end of the first pass. In the case where the last address in the list hasn't completed it's backoff from a previous attempt, scd.lastErr
would store a non-nil error. This is why scd.lastErr
is not reset when starting the first pass over a new address list.
scd.connectionFailed
indicates if the subchannel has failed with the latest address list from the resolver. It is reset before staring the first pass.
Consider a subchannel is being re-used after getting a resolver update because it's address is present in the new address list. The subchannel has already failed, it has scd.lastErr
set and scd.connectionFailed
set to true
. When the first pass starts, scd.connectionFailed
is set to false
.
- If the subchannel completes backoff when the iteration over the address list reaches it, the subchannel will be connected since it's state is IDLE. When it fails again,
scd.connectionFailed
will be set totrue
andscd.lastErr
will be updated. - If the subchannel is in backoff when the iteration over the address list reaches it, the subchannel will not be re-tried.
scd.lastErr
will be retained andscd.connectionFailed
will be set totrue
.
The above steps ensure that the subchannel always has a non-nil error to update the picker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see what's happening here, thanks for the explanation.
Maybe name it connectionFailed(In/During)FirstPass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to connectionFailedInFirstPass
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the one request to change connectionFailed
to be a little more specific.
Thanks!!
// The SubConn is being re-used and failed during a previous pass | ||
// over the addressList. It has not completed backoff yet. | ||
// Mark it as having failed and try the next address. | ||
scd.connectionFailed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see what's happening here, thanks for the explanation.
Maybe name it connectionFailed(In/During)FirstPass
?
As part of the Dualstack design, the pickfirst policy should implement the happy eyeballs algorithm while connecting to multiple backends.
The timeout for the happy eyeballs connection timer is NOT configurable as that's an optional requirement in the gRFC.
RELEASE NOTES:
pickfirst
LB policy (disabled by default) supports Happy Eyeballs to attempt connections to multiple backends concurrently. The experimentalpickfirst
policy can be enabled by setting the environment variableGRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST
totrue
.