Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: fix possible NPE within the connector #96161

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented Jan 30, 2023

Related: https://github.com/cockroachlabs/support/issues/2040

Previously, an invariant is violated where it was possible to return err=nil when the infinite retry dial loop exits. When that happens, callers would attempt to read from the net.Conn object, which is nil, leading to a panic.

The invariant is violated whenever the context that was passed down to dialTenantCluster gets cancelled or expires. In particular, this can happen in two cases:

  1. when the main stopper stops
  2. when a connection migration process hits a timeout (of 15 seconds)

The first case is rare since this has to happen in concert with a transient failure to dial the SQL server.

Here's one example for the second case:

  1. we block while dialing the SQL server
  2. while we're waiting for (1), transfer hits a timeout, so context gets cancelled
  3. (1) gets unblocked due to a timeout
  4. err from (1) gets replaced with the error from ReportFailure
  5. retry loop checks for context cancellation, and exits
  6. we end up returning nil, errors.Mark(nil, ctx.Err()) = nil, nil

The root cause of this issue is that the error from ReportFailure replaced the original error, and usually ReportFailure suceeds. This commit fixes that issue by not reusing the same error variable for ReportFailure.

Epic: none

Release note: None

@jaylim-crl jaylim-crl requested review from a team as code owners January 30, 2023 03:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jaylim-crl jaylim-crl force-pushed the jay/220129-fix-sqlproxy-context-cancellation-bug branch from 243a756 to c58c5e5 Compare January 30, 2023 03:33
Copy link
Collaborator Author

@jaylim-crl jaylim-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)


pkg/ccl/sqlproxyccl/connector.go line 266 at r1 (raw file):

						reportFailureErrs = 0
					}
					err = errors.Wrap(err, reportErr.Error())

Not using errors.CombineErrors is deliberate. Using that requires callers to manually extract the second error. In our case here, a simple Wrap is sufficient.


pkg/ccl/sqlproxyccl/connector_test.go line 373 at r1 (raw file):

	})

	t.Run("context canceled after dial fails", func(t *testing.T) {

This test fails without the fix, which confirms the root cause analysis.

@jaylim-crl jaylim-crl force-pushed the jay/220129-fix-sqlproxy-context-cancellation-bug branch 2 times, most recently from 0c487e4 to d9782cf Compare January 30, 2023 14:46
Previously, an invariant is violated where it was possible to return err=nil
when the infinite retry dial loop exits. When that happens, callers would
attempt to read from the net.Conn object, which is nil, leading to a panic.

The invariant is violated whenever the context that was passed down to
`dialTenantCluster` gets cancelled or expires. In particular, this can happen
in two cases:
1. when the main stopper stops
2. when a connection migration process hits a timeout (of 15 seconds)

The first case is rare since this has to happen in concert with a transient
failure to dial the SQL server.

Here's one example for the second case:
1. we block while dialing the SQL server
2. while we're waiting for (1), transfer hits a timeout, so context gets
   cancelled
3. (1) gets unblocked due to a timeout
4. err from (1) gets replaced with the error from ReportFailure
5. retry loop checks for context cancellation, and exits
6. we end up returning `nil, errors.Mark(nil, ctx.Err())` = `nil, nil`

The root cause of this issue is that the error from ReportFailure replaced the
original error, and usually ReportFailure suceeds. This commit fixes that
issue by not reusing the same error variable for ReportFailure.

Epic: none

Release note: None
@jaylim-crl jaylim-crl force-pushed the jay/220129-fix-sqlproxy-context-cancellation-bug branch from d9782cf to bdc365c Compare January 30, 2023 15:05
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaylim-crl
Copy link
Collaborator Author

TFTR!

bors r=JeffSwenson

@craig
Copy link
Contributor

craig bot commented Jan 30, 2023

Build failed (retrying...):

@jaylim-crl
Copy link
Collaborator Author

bors r=JeffSwenson

@craig
Copy link
Contributor

craig bot commented Jan 31, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 31, 2023

Build succeeded:

@craig craig bot merged commit ac65603 into cockroachdb:master Jan 31, 2023
@jaylim-crl jaylim-crl deleted the jay/220129-fix-sqlproxy-context-cancellation-bug branch January 31, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants