Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: make deadline errors retryable #99760

Conversation

ajwerner
Copy link
Contributor

If a transaction gets finds itself with a deadline before its current timestamp, something that can happen when there are schema changes or when the sqlliveness subsystem is unavailable, we ought to retry the transaction. Today, an assertion failure is sent. This is the wrong behavior.

In order to achieve the desired goal, we detect the scenario in MaybeUpdateDeadline and we return an error which implements the appropriate interface to be interpreted as a retry error in the pgwire and sql layers.

This change does a few other little things:

  1. It simplies the logic to check whether an error is retriable so that
    all errors which pgwire will treat as client-visible retries are treated
    as retriable internally. In addition, we retain certain additional checks
    for retriable errors the sql layer handles explicitly.
  2. It makes sure to reset the sqlliveness session in the descs.Collection
    when restarting transactions. Not doing this, on the one hand, was an
    oversight that meant that if the session turned over internally, then
    restarts wouldn't see it. However, it's not actually a bug today because
    only secondary tenants set the session, and they never acquire a new
    session.

The test utilizes sessions rather than descriptors because the testing knob infrastructure is easier to manipulate.

Fixes: #96336

Backports will deal with #76727

Release note (bug fix): In rare cases involving overload and schema changes, users could sometimes, transiently, see errors of the form "deadline below read timestamp is nonsensical; txn has would have no chance to commit". These errors carried and internal pgcode and could not be retried. This form of error is now classified as a retriable error and will be retried automatically either by the client or internally.

@ajwerner ajwerner requested a review from a team March 28, 2023 04:56
@ajwerner ajwerner requested a review from a team as a code owner March 28, 2023 04:56
@ajwerner ajwerner requested a review from DrewKimball March 28, 2023 04:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: pretty straightforward and clean

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @DrewKimball)


pkg/sql/catalog/descs/txn_external_test.go line 166 at r1 (raw file):

}

type fakeSession struct{ exp hlc.Timestamp }

Would it make sense to lift this into a common package, since we have another fake session implementation inside sql_test

@ajwerner ajwerner force-pushed the ajwerner/make-deadline-expiration-errors-retryable branch 2 times, most recently from 0f2df26 to 9ddccd6 Compare March 28, 2023 13:57
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball and @fqazi)


pkg/sql/catalog/descs/txn_external_test.go line 166 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

Would it make sense to lift this into a common package, since we have another fake session implementation inside sql_test

Done.

@ajwerner ajwerner force-pushed the ajwerner/make-deadline-expiration-errors-retryable branch from 9ddccd6 to a7ab071 Compare March 28, 2023 14:22
@ajwerner
Copy link
Contributor Author

TFTR!

bors r+

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @DrewKimball, and @fqazi)


pkg/sql/conn_executor.go line 3110 at r2 (raw file):

		errors.Is(err, retriableMinTimestampBoundUnsatisfiableError) ||
		errors.Is(err, descidgen.ErrDescIDSequenceMigrationInProgress) ||
		execinfra.IsDynamicQueryHasNoHomeRegionError(err) ||

nit: add pgerror.ClientVisibleRetryError interface assertions to the errors that are no longer explicitly mentioned in this method.

@ajwerner
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Mar 28, 2023

Canceled.

@ajwerner ajwerner force-pushed the ajwerner/make-deadline-expiration-errors-retryable branch from a7ab071 to 3e17136 Compare March 28, 2023 17:32
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @fqazi, and @yuzefovich)


pkg/sql/conn_executor.go line 3110 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: add pgerror.ClientVisibleRetryError interface assertions to the errors that are no longer explicitly mentioned in this method.

var _ ClientVisibleRetryError = &TransactionRetryWithProtoRefreshError{}

and then otherwise done.

If a transaction gets finds itself with a deadline before its current
timestamp, something that can happen when there are schema changes or
when the sqlliveness subsystem is unavailable, we ought to retry the
transaction. Today, an assertion failure is sent. This is the wrong
behavior.

In order to achieve the desired goal, we detect the scenario in
MaybeUpdateDeadline and we return an error which implements the
appropriate interface to be interpreted as a retry error in the
pgwire and sql layers.

This change does a few other little things:

1) It simplies the logic to check whether an error is retriable so that
   all errors which pgwire will treat as client-visible retries are treated
   as retriable internally. In addition, we retain certain additional checks
   for retriable errors the sql layer handles explicitly.
2) It makes sure to reset the sqlliveness session in the descs.Collection
   when restarting transactions. Not doing this, on the one hand, was an
   oversight that meant that if the session turned over internally, then
   restarts wouldn't see it. However, it's not actually a bug today because
   only secondary tenants set the session, and they never acquire a new
   session.

The test utilizes sessions rather than descriptors because the testing
knob infrastructure is easier to manipulate.

Fixes: cockroachdb#96336

Backports will deal with cockroachdb#76727

Release note (bug fix): In rare cases involving overload and schema changes,
users could sometimes, transiently, see errors of the form "deadline below
read timestamp is nonsensical; txn has would have no chance to commit". These
errors carried and internal pgcode and could not be retried. This form of error
is now classified as a retriable error and will be retried automatically either
by the client or internally.
@ajwerner ajwerner force-pushed the ajwerner/make-deadline-expiration-errors-retryable branch from 3e17136 to d00dace Compare March 28, 2023 17:33
@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 28, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 29, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 29, 2023

Build succeeded:

@craig craig bot merged commit bb5c97b into cockroachdb:master Mar 29, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 29, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ffb16ca to blathers/backport-release-22.1-99760: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from ffb16ca to blathers/backport-release-22.2-99760: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bench: BenchmarkOrderBy fails with liveness session expired
5 participants