kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

tbg · 2023-06-22T09:12:52Z

Describe the problem

In #97779 we're introducing reproposals and reproposal errors in kvnemesis.

For some reason, under stress, kvnemesis will now sometimes find a *TransactionPushError bubble up from db.Get and db.Scan (and likely other methods), which should not be possible.

@arulajmani is looking at this in https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1687356168950489.

Jira issue: CRDB-28983

The text was updated successfully, but these errors were encountered:

We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`. Accept this as "ok" while we investigate the failure, to avoid delaying merging the kvnemesis updates. cockroachdb#105330

tbg · 2023-06-27T12:00:40Z

I am merging d7ae4d9 now, which works around this issue - please revert when this issue is addressed.

nvanbenschoten · 2023-07-05T15:57:25Z

In the test, we see a pusher request (always non-txn'al, but it's not clear that this matters) issue a PushTxnRequest(PUSH_TIMESTAMP) against a weak isolation transaction. Since 5a53d25, these pushes should be able to push the pushee's timestamp. However, we hit this logic:

cockroach/pkg/kv/kvserver/batcheval/cmd_push_txn.go

Lines 216 to 244 in 972c379

    
           // If the pusher is aware that the pushee's currently recorded attempt at a 
        
           // parallel commit failed, either because it found intents at a higher 
        
           // timestamp than the parallel commit attempt or because it found intents at 
        
           // a higher epoch than the parallel commit attempt, it should not consider 
        
           // the pushee to be performing a parallel commit. Its commit status is not 
        
           // indeterminate. 
        
           if (knownHigherTimestamp || knownHigherEpoch) && reply.PusheeTxn.Status == roachpb.STAGING { 
        
           	reply.PusheeTxn.Status = roachpb.PENDING 
        
           	reply.PusheeTxn.InFlightWrites = nil 
        
           	// If the pusher is aware that the pushee's currently recorded attempt 
        
           	// at a parallel commit failed, upgrade PUSH_TIMESTAMPs to PUSH_ABORTs. 
        
           	// We don't want to move the transaction back to PENDING, as this is not 
        
           	// (currently) allowed by the recovery protocol. We also don't want to 
        
           	// move the transaction to a new timestamp while retaining the STAGING 
        
           	// status, as this could allow the transaction to enter an implicit 
        
           	// commit state without its knowledge, leading to atomicity violations. 
        
           	// 
        
           	// This has no effect on pushes that fail with a TransactionPushError. 
        
           	// Such pushes will still wait on the pushee to retry its commit and 
        
           	// eventually commit or abort. It also has no effect on expired pushees, 
        
           	// as they would have been aborted anyway. This only impacts pushes 
        
           	// which would have succeeded due to priority mismatches. In these 
        
           	// cases, the push acts the same as a short-circuited transaction 
        
           	// recovery process, because the transaction recovery procedure always 
        
           	// finalizes target transactions, even if initiated by a PUSH_TIMESTAMP. 
        
           	if pushType == kvpb.PUSH_TIMESTAMP { 
        
           		pushType = kvpb.PUSH_ABORT 
        
           	} 
        
           }

So we promote the PUSH_TIMESTAMP (which would be allowed) into a PUSH_ABORT (which is not), so command evaluation throws a TransactionPushError. We then catch this error and determine whether to wait in the txnWaitQueue:

cockroach/pkg/kv/kvserver/replica_send.go

Lines 769 to 780 in 972c379

    
           // On a transaction push error, retry immediately if doing so will enqueue 
        
           // into the txnWaitQueue in order to await further updates to the unpushed 
        
           // txn's status. We check ShouldPushImmediately to avoid retrying 
        
           // non-queueable PushTxnRequests (see #18191). 
        
           dontRetry := r.store.cfg.TestingKnobs.DontRetryPushTxnFailures 
        
           if !dontRetry && ba.IsSinglePushTxnRequest() { 
        
           	pushReq := ba.Requests[0].GetInner().(*kvpb.PushTxnRequest) 
        
           	dontRetry = txnwait.ShouldPushImmediately(pushReq) 
        
           } 
        
           if dontRetry { 
        
           	return g, pErr 
        
           }

Since we once again determine that we should be able to perform the PUSH_TIMESTAMP (this code doesn't know about the PUSH_TIMESTAMP -> PUSH_ABORT promotion), we throw the error up to the caller.

…iority DNM: needs more testing. Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None

Informs cockroachdb#105330. This commit adds a new unit test named `TestTxnRecoveryFromStagingWithoutHighPriority`, which reproduces the error described in cockroachdb#105330. Release note: None

Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None

107882: kv: don't let pusher win when pushing STAGING txn with equal priority r=miraradeva a=nvanbenschoten Fixes #105330. Fixes #101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known to be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None 107962: docs: do not generate docs for crdb_internal functions r=knz a=rafiss Epic: None Release note: None 108062: server: Add rangefeed metrics r=miretskiy a=miretskiy Add rangefeed related metrics to keep track of the number of actively running rangefeeds on the server. Epic: CRDB-26372 Release note: None 108071: stmtdiagnostics: use background context when building the bundle r=yuzefovich a=yuzefovich When the context is canceled, we still want to build the bundle as best as possible. Over in 532274b we introduced the usage of the background context in order to insert the bundle into the system tables, but we still built the bundle with the canceled context. This commit fixes that oversight - in particular, we should now get `env.sql` correctly. Informs: https://github.com/cockroachlabs/support/issues/2494. Epic: None Release note: None 108085: builtins: improve docs and refactor code for unordered_unique_id r=rafiss,rharding6373 a=andyyang890 **builtins: improve documentation for unordered_unique_rowid** This patch clarifies that `unordered_unique_rowid` generates unique IDs that are statistically likely to be unordered because it bit-reverses the insert timestamp for use as the ID's timestamp bit segment. Release note: None ---- **builtinconstants: define constants for unique int bit segments** This patch defines constants for the sizes and bitmasks for each bit segment in a unique int generated for row IDs. Release note: None ---- **builtins: refactor bit shifting logic in mapToUnorderedUniqueInt** This patch refactors the bit shifting logic in `mapToUnorderedUniqueInt` to use constants instead of magic numbers. Release note: None ---- Epic: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andy Yang <[email protected]>

tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Jun 22, 2023

tbg assigned arulajmani Jun 22, 2023

tbg mentioned this issue Jun 22, 2023

kvnemesis,kvserver: add coverage for illegal lease applied index and reproposal errors #105272

Merged

tbg mentioned this issue Jun 27, 2023

kv/kvnemesis: TestKVNemesisMultiNode failed #104476

Closed

nvanbenschoten assigned nvanbenschoten and unassigned arulajmani Jun 29, 2023

nvanbenschoten changed the title ~~kvserver: unhandled TransactionPushError from db.Get~~ kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction Jul 13, 2023

tbg mentioned this issue Jul 17, 2023

cli: unskip test_demo_partitioning interactive test #106722

Merged

lunevalex added the db-cy-23 label Jul 27, 2023

nvanbenschoten mentioned this issue Jul 31, 2023

kv: don't let pusher win when pushing STAGING txn with equal priority #107882

Merged

craig bot closed this as completed in 90516db Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

tbg commented Jun 22, 2023 •

edited by cockroach-jira-scripts

Loading

tbg commented Jun 27, 2023

nvanbenschoten commented Jul 5, 2023

kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330

Comments

tbg commented Jun 22, 2023 • edited by cockroach-jira-scripts Loading

tbg commented Jun 27, 2023

nvanbenschoten commented Jul 5, 2023

tbg commented Jun 22, 2023 •

edited by cockroach-jira-scripts

Loading