Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickfirst: Implement Happy Eyeballs #7725

Merged
merged 26 commits into from
Nov 12, 2024

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Oct 10, 2024

As part of the Dualstack design, the pickfirst policy should implement the happy eyeballs algorithm while connecting to multiple backends.

The timeout for the happy eyeballs connection timer is NOT configurable as that's an optional requirement in the gRFC.

RELEASE NOTES:

  • The new experimental pickfirst LB policy (disabled by default) supports Happy Eyeballs to attempt connections to multiple backends concurrently. The experimental pickfirst policy can be enabled by setting the environment variable GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST to true.

@arjan-bal arjan-bal added the Type: Feature New features or improvements in behavior label Oct 10, 2024
@arjan-bal arjan-bal added this to the 1.68 Release milestone Oct 10, 2024
@arjan-bal arjan-bal requested a review from easwars October 10, 2024 10:04
@arjan-bal arjan-bal force-pushed the grpc-go-happy-eyeballs branch from 7cb88fe to db0dda7 Compare October 10, 2024 10:08
Copy link

codecov bot commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 86.81319% with 12 lines in your changes missing coverage. Please review.

Project coverage is 81.74%. Comparing base (18d218d) to head (5c4ff49).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 86.36% 9 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7725      +/-   ##
==========================================
- Coverage   82.00%   81.74%   -0.27%     
==========================================
  Files         373      374       +1     
  Lines       37735    37930     +195     
==========================================
+ Hits        30945    31004      +59     
- Misses       5512     5615     +103     
- Partials     1278     1311      +33     
Files with missing lines Coverage Δ
balancer/pickfirst/internal/internal.go 100.00% <100.00%> (ø)
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go 88.83% <86.36%> (+0.06%) ⬆️

... and 39 files with indirect coverage changes

@easwars
Copy link
Contributor

easwars commented Oct 10, 2024

Should we mention the environment variables in the release note? Or at least in the PR description?

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't complete a full pass, but some comment here to get satrted.

balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
internal/envconfig/envconfig.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
@easwars easwars assigned arjan-bal and unassigned easwars Oct 10, 2024
@arjan-bal
Copy link
Contributor Author

Should we mention the environment variables in the release note? Or at least in the PR description?

Updated the release notes.

@arjan-bal arjan-bal assigned easwars and unassigned arjan-bal Oct 11, 2024
@purnesh42H purnesh42H modified the milestones: 1.68 Release, 1.69 Release Oct 16, 2024
@arjan-bal arjan-bal force-pushed the grpc-go-happy-eyeballs branch from 7f3065d to 67f7a1a Compare October 16, 2024 11:09
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor nits in the tests.

balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf_ext_test.go Outdated Show resolved Hide resolved
Comment on lines 1060 to 1061
// Replace the timer channel so that the old timers don't attempt to read
// messages pushed next.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old timers should get canceled when subsequent subchannels are created, right? Why do we need to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required since pickfirst will stop the timer, but the fake TimeAfterFunc will still keep waiting on the timer channel till the context is cancelled. If there are multiple listeners on the timer channel, they will race to read from the channel.

This could be avoided by introducing an interface for a time.Timer so that the test can intercept calls to Timer.Stop().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying. That seems better to me, unless it is too much work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to have the internal.TimeAfterFunc return a cancelFunc() instead of a timer. This allowed the test to stop the timer when pickfirst cancels the timer. I also created a helper function to return a timer function and a function to trigger the timer manually instead of having the tests write on channel.

@easwars easwars assigned arjan-bal and unassigned easwars Oct 22, 2024
@arjan-bal arjan-bal removed their assignment Oct 23, 2024
@arjan-bal arjan-bal removed their assignment Oct 23, 2024
testutils.AwaitNotState(shortCtx, t, cc, connectivity.TransientFailure)

// Third SubConn fails.
shortCancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Won't testutils.AwaitNotState fail the test if the specified state is reached before the context expires?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required because of the way testutils.AwaitNotState works. When I tried to ignore the first cancel function as follows:

shortCtx, _ := context.WithTimeout(ctx, defaultTestShortTimeout)

govet complains about a possible context leak because it can't ensure that the context will be cancelled at compile time. If we re-assign the cancel func later, govet doesn't complain but I still called cancel just to be consistent. Removed the call now.

Comment on lines 1024 to 1025
// The happy eyeballs timer expires, skipping server[1] and requesting the creation
// of a third SubConn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you say we are skipping server[1] here? IIUC correctly:

  • we first started a connection to server[0]
  • connection to server[0] failed before the HE timer fired
  • so, we started a connection to server[1]
  • now, the HE timer has fired
  • so, we would start a connection to server[2]

I don't see where we are skipping server[1].

Copy link
Contributor Author

@arjan-bal arjan-bal Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't skip the server but it skips waiting for the SubConn to report a success or failure and moves on to the next SubConn. The comment was copied taken from Java's test case. I've improved the wording now.

@arjan-bal arjan-bal added the Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. label Nov 7, 2024
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
balancer/pickfirst/pickfirstleaf/pickfirstleaf.go Outdated Show resolved Hide resolved
// The SubConn is being re-used and failed during a previous pass
// over the addressList. It has not completed backoff yet.
// Mark it as having failed and try the next address.
scd.connectionFailed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectionFailed is a bit like lastErr != nil. Do we need both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lastErr is used to update the picker at the end of the first pass. In the case where the last address in the list hasn't completed it's backoff from a previous attempt, scd.lastErr would store a non-nil error. This is why scd.lastErr is not reset when starting the first pass over a new address list.

scd.connectionFailed indicates if the subchannel has failed with the latest address list from the resolver. It is reset before staring the first pass.

Consider a subchannel is being re-used after getting a resolver update because it's address is present in the new address list. The subchannel has already failed, it has scd.lastErr set and scd.connectionFailed set to true. When the first pass starts, scd.connectionFailed is set to false.

  • If the subchannel completes backoff when the iteration over the address list reaches it, the subchannel will be connected since it's state is IDLE. When it fails again, scd.connectionFailed will be set to true and scd.lastErr will be updated.
  • If the subchannel is in backoff when the iteration over the address list reaches it, the subchannel will not be re-tried. scd.lastErr will be retained and scd.connectionFailed will be set to true.

The above steps ensure that the subchannel always has a non-nil error to update the picker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see what's happening here, thanks for the explanation.

Maybe name it connectionFailed(In/During)FirstPass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to connectionFailedInFirstPass.

internal/envconfig/envconfig.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned arjan-bal and unassigned dfawley Nov 7, 2024
@arjan-bal arjan-bal assigned dfawley and unassigned arjan-bal Nov 8, 2024
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the one request to change connectionFailed to be a little more specific.

Thanks!!

// The SubConn is being re-used and failed during a previous pass
// over the addressList. It has not completed backoff yet.
// Mark it as having failed and try the next address.
scd.connectionFailed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see what's happening here, thanks for the explanation.

Maybe name it connectionFailed(In/During)FirstPass?

@dfawley dfawley assigned arjan-bal and unassigned dfawley Nov 11, 2024
@arjan-bal arjan-bal merged commit e2b98f9 into grpc:master Nov 12, 2024
15 checks passed
@arjan-bal arjan-bal deleted the grpc-go-happy-eyeballs branch November 12, 2024 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants