Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

Closed
tbg opened this issue Jun 22, 2023 · 2 comments · Fixed by #107882
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Jun 22, 2023

Describe the problem

In #97779 we're introducing reproposals and reproposal errors in kvnemesis.

For some reason, under stress, kvnemesis will now sometimes find a *TransactionPushError bubble up from db.Get and db.Scan (and likely other methods), which should not be possible.

@arulajmani is looking at this in https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1687356168950489.

Jira issue: CRDB-28983

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Jun 22, 2023
tbg added a commit to tbg/cockroach that referenced this issue Jun 22, 2023
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`.

Accept this as "ok" while we investigate the failure, to avoid delaying merging
the kvnemesis updates.

cockroachdb#105330
tbg added a commit to tbg/cockroach that referenced this issue Jun 22, 2023
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`.

Accept this as "ok" while we investigate the failure, to avoid delaying merging
the kvnemesis updates.

cockroachdb#105330
tbg added a commit to tbg/cockroach that referenced this issue Jun 27, 2023
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`.

Accept this as "ok" while we investigate the failure, to avoid delaying merging
the kvnemesis updates.

cockroachdb#105330
@tbg
Copy link
Member Author

tbg commented Jun 27, 2023

I am merging d7ae4d9 now, which works around this issue - please revert when this issue is addressed.

@nvanbenschoten
Copy link
Member

In the test, we see a pusher request (always non-txn'al, but it's not clear that this matters) issue a PushTxnRequest(PUSH_TIMESTAMP) against a weak isolation transaction. Since 5a53d25, these pushes should be able to push the pushee's timestamp. However, we hit this logic:

// If the pusher is aware that the pushee's currently recorded attempt at a
// parallel commit failed, either because it found intents at a higher
// timestamp than the parallel commit attempt or because it found intents at
// a higher epoch than the parallel commit attempt, it should not consider
// the pushee to be performing a parallel commit. Its commit status is not
// indeterminate.
if (knownHigherTimestamp || knownHigherEpoch) && reply.PusheeTxn.Status == roachpb.STAGING {
reply.PusheeTxn.Status = roachpb.PENDING
reply.PusheeTxn.InFlightWrites = nil
// If the pusher is aware that the pushee's currently recorded attempt
// at a parallel commit failed, upgrade PUSH_TIMESTAMPs to PUSH_ABORTs.
// We don't want to move the transaction back to PENDING, as this is not
// (currently) allowed by the recovery protocol. We also don't want to
// move the transaction to a new timestamp while retaining the STAGING
// status, as this could allow the transaction to enter an implicit
// commit state without its knowledge, leading to atomicity violations.
//
// This has no effect on pushes that fail with a TransactionPushError.
// Such pushes will still wait on the pushee to retry its commit and
// eventually commit or abort. It also has no effect on expired pushees,
// as they would have been aborted anyway. This only impacts pushes
// which would have succeeded due to priority mismatches. In these
// cases, the push acts the same as a short-circuited transaction
// recovery process, because the transaction recovery procedure always
// finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.
if pushType == kvpb.PUSH_TIMESTAMP {
pushType = kvpb.PUSH_ABORT
}
}

So we promote the PUSH_TIMESTAMP (which would be allowed) into a PUSH_ABORT (which is not), so command evaluation throws a TransactionPushError. We then catch this error and determine whether to wait in the txnWaitQueue:

// On a transaction push error, retry immediately if doing so will enqueue
// into the txnWaitQueue in order to await further updates to the unpushed
// txn's status. We check ShouldPushImmediately to avoid retrying
// non-queueable PushTxnRequests (see #18191).
dontRetry := r.store.cfg.TestingKnobs.DontRetryPushTxnFailures
if !dontRetry && ba.IsSinglePushTxnRequest() {
pushReq := ba.Requests[0].GetInner().(*kvpb.PushTxnRequest)
dontRetry = txnwait.ShouldPushImmediately(pushReq)
}
if dontRetry {
return g, pErr
}

Since we once again determine that we should be able to perform the PUSH_TIMESTAMP (this code doesn't know about the PUSH_TIMESTAMP -> PUSH_ABORT promotion), we throw the error up to the caller.

@nvanbenschoten nvanbenschoten changed the title kvserver: unhandled TransactionPushError from db.Get kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction Jul 13, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 31, 2023
…iority

DNM: needs more testing.

Fixes cockroachdb#105330.
Fixes cockroachdb#101721.

This commit updates the transaction push logic to stop pushes from completing
successfully when the pushee is STAGING and the pusher has equal priority. Prior
to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP
if at least one of the two transactions involved used a weak isolation level.

This had two undesirable effects:
- if the pushee was still possibly committable and requiring recovery
  (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher
  would immediately begin parallel commit recovery, attempting to disrupt the
  commit and abort the pushee. This is undesirable because the pushee may still
  be in progress and in cases of equal priority, we'd like to wait for the
  parallel commit to complete before kicking off recovery, deferring recovery to
  only the cases where the pushee/committers's heartbeat has expired.
- if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch`
  in the code), then txn recovery was not kicked off but the pusher still could
  not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a
  `TransactionPushError`. This confused logic in `handleTransactionPushError`,
  allowing the error to escape to the client.

This commit resolves both issues by considering the pushee's transaction status
in `txnwait.ShouldPushImmediately`.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 2, 2023
Informs cockroachdb#105330.

This commit adds a new unit test named `TestTxnRecoveryFromStagingWithoutHighPriority`,
which reproduces the error described in cockroachdb#105330.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 2, 2023
Fixes cockroachdb#105330.
Fixes cockroachdb#101721.

This commit updates the transaction push logic to stop pushes from completing
successfully when the pushee is STAGING and the pusher has equal priority. Prior
to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP
if at least one of the two transactions involved used a weak isolation level.

This had two undesirable effects:
- if the pushee was still possibly committable and requiring recovery
  (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher
  would immediately begin parallel commit recovery, attempting to disrupt the
  commit and abort the pushee. This is undesirable because the pushee may still
  be in progress and in cases of equal priority, we'd like to wait for the
  parallel commit to complete before kicking off recovery, deferring recovery to
  only the cases where the pushee/committers's heartbeat has expired.
- if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch`
  in the code), then txn recovery was not kicked off but the pusher still could
  not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a
  `TransactionPushError`. This confused logic in `handleTransactionPushError`,
  allowing the error to escape to the client.

This commit resolves both issues by considering the pushee's transaction status
in `txnwait.ShouldPushImmediately`.

Release note: None
craig bot pushed a commit that referenced this issue Aug 3, 2023
107882: kv: don't let pusher win when pushing STAGING txn with equal priority r=miraradeva a=nvanbenschoten

Fixes #105330.
Fixes #101721.

This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level.

This had two undesirable effects:
- if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired.
- if the pushee was known to be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client.

This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`.

Release note: None

107962: docs: do not generate docs for crdb_internal functions r=knz a=rafiss

Epic: None
Release note: None

108062: server: Add rangefeed metrics r=miretskiy a=miretskiy

Add rangefeed related metrics to keep track of the number of actively running rangefeeds on the server.

Epic: CRDB-26372
Release note: None

108071: stmtdiagnostics: use background context when building the bundle r=yuzefovich a=yuzefovich

When the context is canceled, we still want to build the bundle as best as possible. Over in 532274b we introduced the usage of the background context in order to insert the bundle into the system tables, but we still built the bundle with the canceled context. This commit fixes that oversight - in particular, we should now get `env.sql` correctly.

Informs: https://github.com/cockroachlabs/support/issues/2494.
Epic: None

Release note: None

108085: builtins: improve docs and refactor code for unordered_unique_id r=rafiss,rharding6373 a=andyyang890

**builtins: improve documentation for unordered_unique_rowid**

This patch clarifies that `unordered_unique_rowid` generates unique IDs
that are statistically likely to be unordered because it bit-reverses
the insert timestamp for use as the ID's timestamp bit segment.

Release note: None

----

**builtinconstants: define constants for unique int bit segments**

This patch defines constants for the sizes and bitmasks for each
bit segment in a unique int generated for row IDs.

Release note: None

----

**builtins: refactor bit shifting logic in mapToUnorderedUniqueInt**

This patch refactors the bit shifting logic in `mapToUnorderedUniqueInt`
to use constants instead of magic numbers.

Release note: None

----

Epic: None


Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Andy Yang <[email protected]>
@craig craig bot closed this as completed in 90516db Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23 T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants