kv: prevent STAGING -> PENDING transition during high-priority push #62761

nvanbenschoten · 2021-03-29T22:02:27Z

Fixes #61992.
Fixes #62064.

This commit fixes a bug uncovered recently (for less than obvious reasons) in cdc roachtests where a STAGING transaction could have its transaction record moved back to a PENDING state without changing epochs but after its timestamp was bumped. This could result in concurrent transaction recovery attempts returning programming error: cannot recover PENDING transaction in same epoch errors, because such a state transition was not expected to be possible by transaction recovery. However, as we found in #61992, this has actually been possible since 01bc20e.

This commit fixes the bug by detecting cases where a pusher knows of a failed parallel commit and selectively upgrading PUSH_TIMESTAMP push attempts to PUSH_ABORTs. This has no effect on pushes that fail with a TransactionPushError. Such pushes will still wait on the pushee to retry its commit and eventually commit or abort. It also has no effect on expired pushees, as they would have been aborted anyway. This only impacts pushes which would have succeeded due to priority mismatches. In these cases, the push acts the same as a short-circuited transaction recovery process, because the transaction recovery procedure always finalizes target transactions, even if initiated by a PUSH_TIMESTAMP.

This seems very rare in practice, as it requires a few specific interactions to line up just right, including:

a STAGING transaction that has one of its in-flight intent writes bumped
a rangefeed processor listening to that intent write
a separate request that conflicts with a different intent
a STAGING transaction which expires to allow transaction recovery
a rangefeed processor push between the time of the request push and the request recovery

Still, this fix well contained, so I think we should backport it to all of the release branches. However, since this issue does seem rare and also can not cause corruption or atomicity violations, I wanted to be conservative with the backport, so I'm going to let this bake on master + release-21.1 for a few weeks before merging the backport.

Release notes (bug fix): an improper interaction between conflicting transactions which could result in spurious cannot recover PENDING transaction in same epoch errors was fixed.

cockroach-teamcity · 2021-03-29T22:02:35Z

This change is

erikgrinaker

Good catch

Reviewed 4 of 4 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/txn_recovery_integration_test.go, line 244 at r1 (raw file):

		h := roachpb.Header{Txn: txn}
		_, pErr := kv.SendWrappedWith(ctx, store.TestSender(), h, &pArgs)
		require.Nil(t, pErr)

nit: would be helpful to include the actual error in the output, i.e. require.Nil(t, pErr, "error: %s", pErr). That goes for all of these assertions.

tbg

This seems very rare in practice, as it requires a few specific
interactions to line up just right, including:

You don't "need" the rangefeed push though, right? The general mechanism is:

lay down staging, implicitly-not-committed txn record that hasn't been heartbeat in a while
timestamp push comes along and moves it back to pending

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/txn_recovery_integration_test.go, line 244 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: would be helpful to include the actual error in the output, i.e. require.Nil(t, pErr, "error: %s", pErr). That goes for all of these assertions.

This is already automatic, so there shouldn't be a need to add it manually.

erikgrinaker

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/txn_recovery_integration_test.go, line 244 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is already automatic, so there shouldn't be a need to add it manually.

True, but it'll usually be a struct dump which isn't always very readable. Anyway, not important.

Fixes cockroachdb#61992. Fixes cockroachdb#62064. This commit fixes a bug uncovered recently (for less than obvious reasons) in cdc roachtests where a STAGING transaction could have its transaction record moved back to a PENDING state without changing epochs but after its timestamp was bumped. This could result in concurrent transaction recovery attempts returning `programming error: cannot recover PENDING transaction in same epoch` errors, because such a state transition was not expected to be possible by transaction recovery. However, as we found in cockroachdb#61992, this has actually been possible since 01bc20e. This commit fixes the bug by detecting cases where a pusher knows of a failed parallel commit and selectively upgrading PUSH_TIMESTAMP push attempts to PUSH_ABORTs. This has no effect on pushes that fail with a TransactionPushError. Such pushes will still wait on the pushee to retry its commit and eventually commit or abort. It also has no effect on expired pushees, as they would have been aborted anyway. This only impacts pushes which would have succeeded due to priority mismatches. In these cases, the push acts the same as a short-circuited transaction recovery process, because the transaction recovery procedure always finalizes target transactions, even if initiated by a PUSH_TIMESTAMP. This seems very rare in practice, as it requires a few specific interactions to line up just right, including: - a STAGING transaction that has one of its in-flight intent writes bumped - a rangefeed processor listening to that intent write - a separate request that conflicts with a different intent - a STAGING transaction which expires to allow transaction recovery - a rangefeed processor push between the time of the request push and the request recovery Still, this fix well contained, so I think we should backport it to all of the release branches. However, since this issue does seem rare and also can not cause corruption or atomicity violations, I wanted to be conservative with the backport, so I'm going to let this bake on master + release-21.1 for a few weeks before merging the backport. Release notes (bug fix): an improper interaction between conflicting transactions which could result in spurious `cannot recover PENDING transaction in same epoch` errors was fixed.

nvanbenschoten

You don't "need" the rangefeed push though, right? The general mechanism is:

lay down staging, implicitly-not-committed txn record that hasn't been heartbeat in a while
timestamp push comes along and moves it back to pending

You're right that the rangefeed isn't strictly necessary, but what is necessary is a high-priority push. Rangefeed is the most common source of these.

Thanks for the reviews!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)

pkg/kv/kvserver/txn_recovery_integration_test.go, line 244 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

True, but it'll usually be a struct dump which isn't always very readable. Anyway, not important.

Done.

craig · 2021-03-30T16:31:07Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested review from erikgrinaker and tbg March 29, 2021 22:02

erikgrinaker approved these changes Mar 30, 2021

View reviewed changes

tbg approved these changes Mar 30, 2021

View reviewed changes

erikgrinaker approved these changes Mar 30, 2021

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/txnRecoveryFix branch from 90fd58a to e40c1b4 Compare March 30, 2021 14:11

nvanbenschoten commented Mar 30, 2021

View reviewed changes

craig bot merged commit ec503c0 into cockroachdb:master Mar 30, 2021

nvanbenschoten mentioned this pull request Mar 30, 2021

release-21.1: kv: prevent STAGING -> PENDING transition during high-priority push #62810

Merged

nvanbenschoten deleted the nvanbenschoten/txnRecoveryFix branch March 30, 2021 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: prevent STAGING -> PENDING transition during high-priority push #62761

kv: prevent STAGING -> PENDING transition during high-priority push #62761

nvanbenschoten commented Mar 29, 2021

cockroach-teamcity commented Mar 29, 2021

erikgrinaker left a comment

tbg left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

craig bot commented Mar 30, 2021

kv: prevent STAGING -> PENDING transition during high-priority push #62761

kv: prevent STAGING -> PENDING transition during high-priority push #62761

Conversation

nvanbenschoten commented Mar 29, 2021

cockroach-teamcity commented Mar 29, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Mar 30, 2021