-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: unhandled TransactionPushError when push finds STAGING but known-uncommitted Read Committed transaction #105330
Comments
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`. Accept this as "ok" while we investigate the failure, to avoid delaying merging the kvnemesis updates. cockroachdb#105330
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`. Accept this as "ok" while we investigate the failure, to avoid delaying merging the kvnemesis updates. cockroachdb#105330
We're frequently getting a TransactionPushError back from `db.{Get,Scan,...}`. Accept this as "ok" while we investigate the failure, to avoid delaying merging the kvnemesis updates. cockroachdb#105330
I am merging d7ae4d9 now, which works around this issue - please revert when this issue is addressed. |
In the test, we see a pusher request (always non-txn'al, but it's not clear that this matters) issue a cockroach/pkg/kv/kvserver/batcheval/cmd_push_txn.go Lines 216 to 244 in 972c379
So we promote the PUSH_TIMESTAMP (which would be allowed) into a PUSH_ABORT (which is not), so command evaluation throws a cockroach/pkg/kv/kvserver/replica_send.go Lines 769 to 780 in 972c379
Since we once again determine that we should be able to perform the PUSH_TIMESTAMP (this code doesn't know about the PUSH_TIMESTAMP -> PUSH_ABORT promotion), we throw the error up to the caller. |
…iority DNM: needs more testing. Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None
Informs cockroachdb#105330. This commit adds a new unit test named `TestTxnRecoveryFromStagingWithoutHighPriority`, which reproduces the error described in cockroachdb#105330. Release note: None
Fixes cockroachdb#105330. Fixes cockroachdb#101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known too be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None
107882: kv: don't let pusher win when pushing STAGING txn with equal priority r=miraradeva a=nvanbenschoten Fixes #105330. Fixes #101721. This commit updates the transaction push logic to stop pushes from completing successfully when the pushee is STAGING and the pusher has equal priority. Prior to this change, the pusher would win in these cases when using a PUSH_TIMESTAMP if at least one of the two transactions involved used a weak isolation level. This had two undesirable effects: - if the pushee was still possibly committable and requiring recovery (`!(knownHigherTimestamp || knownHigherEpoch)` in the code) then the pusher would immediately begin parallel commit recovery, attempting to disrupt the commit and abort the pushee. This is undesirable because the pushee may still be in progress and in cases of equal priority, we'd like to wait for the parallel commit to complete before kicking off recovery, deferring recovery to only the cases where the pushee/committers's heartbeat has expired. - if the pushee was known to be uncommittable (`knownHigherTimestamp || knownHigherEpoch` in the code), then txn recovery was not kicked off but the pusher still could not perform the PUSH_TIMESTAMP (see e40c1b4), so it would return a `TransactionPushError`. This confused logic in `handleTransactionPushError`, allowing the error to escape to the client. This commit resolves both issues by considering the pushee's transaction status in `txnwait.ShouldPushImmediately`. Release note: None 107962: docs: do not generate docs for crdb_internal functions r=knz a=rafiss Epic: None Release note: None 108062: server: Add rangefeed metrics r=miretskiy a=miretskiy Add rangefeed related metrics to keep track of the number of actively running rangefeeds on the server. Epic: CRDB-26372 Release note: None 108071: stmtdiagnostics: use background context when building the bundle r=yuzefovich a=yuzefovich When the context is canceled, we still want to build the bundle as best as possible. Over in 532274b we introduced the usage of the background context in order to insert the bundle into the system tables, but we still built the bundle with the canceled context. This commit fixes that oversight - in particular, we should now get `env.sql` correctly. Informs: https://github.com/cockroachlabs/support/issues/2494. Epic: None Release note: None 108085: builtins: improve docs and refactor code for unordered_unique_id r=rafiss,rharding6373 a=andyyang890 **builtins: improve documentation for unordered_unique_rowid** This patch clarifies that `unordered_unique_rowid` generates unique IDs that are statistically likely to be unordered because it bit-reverses the insert timestamp for use as the ID's timestamp bit segment. Release note: None ---- **builtinconstants: define constants for unique int bit segments** This patch defines constants for the sizes and bitmasks for each bit segment in a unique int generated for row IDs. Release note: None ---- **builtins: refactor bit shifting logic in mapToUnorderedUniqueInt** This patch refactors the bit shifting logic in `mapToUnorderedUniqueInt` to use constants instead of magic numbers. Release note: None ---- Epic: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andy Yang <[email protected]>
Describe the problem
In #97779 we're introducing reproposals and reproposal errors in kvnemesis.
For some reason, under stress, kvnemesis will now sometimes find a
*TransactionPushError
bubble up fromdb.Get
anddb.Scan
(and likely other methods), which should not be possible.@arulajmani is looking at this in https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1687356168950489.
Jira issue: CRDB-28983
The text was updated successfully, but these errors were encountered: