-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: fix possible NPE within the connector #96161
ccl/sqlproxyccl: fix possible NPE within the connector #96161
Conversation
243a756
to
c58c5e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)
pkg/ccl/sqlproxyccl/connector.go
line 266 at r1 (raw file):
reportFailureErrs = 0 } err = errors.Wrap(err, reportErr.Error())
Not using errors.CombineErrors
is deliberate. Using that requires callers to manually extract the second error. In our case here, a simple Wrap is sufficient.
pkg/ccl/sqlproxyccl/connector_test.go
line 373 at r1 (raw file):
}) t.Run("context canceled after dial fails", func(t *testing.T) {
This test fails without the fix, which confirms the root cause analysis.
0c487e4
to
d9782cf
Compare
Previously, an invariant is violated where it was possible to return err=nil when the infinite retry dial loop exits. When that happens, callers would attempt to read from the net.Conn object, which is nil, leading to a panic. The invariant is violated whenever the context that was passed down to `dialTenantCluster` gets cancelled or expires. In particular, this can happen in two cases: 1. when the main stopper stops 2. when a connection migration process hits a timeout (of 15 seconds) The first case is rare since this has to happen in concert with a transient failure to dial the SQL server. Here's one example for the second case: 1. we block while dialing the SQL server 2. while we're waiting for (1), transfer hits a timeout, so context gets cancelled 3. (1) gets unblocked due to a timeout 4. err from (1) gets replaced with the error from ReportFailure 5. retry loop checks for context cancellation, and exits 6. we end up returning `nil, errors.Mark(nil, ctx.Err())` = `nil, nil` The root cause of this issue is that the error from ReportFailure replaced the original error, and usually ReportFailure suceeds. This commit fixes that issue by not reusing the same error variable for ReportFailure. Epic: none Release note: None
d9782cf
to
bdc365c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTR! bors r=JeffSwenson |
Build failed (retrying...): |
bors r=JeffSwenson |
Build failed (retrying...): |
Build succeeded: |
Related: https://github.com/cockroachlabs/support/issues/2040
Previously, an invariant is violated where it was possible to return err=nil when the infinite retry dial loop exits. When that happens, callers would attempt to read from the net.Conn object, which is nil, leading to a panic.
The invariant is violated whenever the context that was passed down to
dialTenantCluster
gets cancelled or expires. In particular, this can happen in two cases:The first case is rare since this has to happen in concert with a transient failure to dial the SQL server.
Here's one example for the second case:
nil, errors.Mark(nil, ctx.Err())
=nil, nil
The root cause of this issue is that the error from ReportFailure replaced the original error, and usually ReportFailure suceeds. This commit fixes that issue by not reusing the same error variable for ReportFailure.
Epic: none
Release note: None