-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: properly handle ambiguous errors that occur alongside commits #103817
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-kv
KV Team
Comments
AlexTalks
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-transactions
Relating to MVCC and the transactional model.
T-kv
KV Team
labels
May 24, 2023
AlexTalks
changed the title
kv: properly handle ambiguous errors that occur on commits
kv: properly handle ambiguous errors that occur alongside commits
May 24, 2023
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 14, 2023
WIP: This intends to raise an error on a KV batch that has an ambiguous error on one of its writes, and also has an `EndTxn` request in the batch. Part of: cockroachdb#103817 Release note (bug fix): TBD
shralex
added
the
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
label
Jul 18, 2023
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 22, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
As cockroachdb#103817 is now a known issue for which a root cause has been identified, the sanity check assertion in place here is disabled, with errors occuring due to the known bug properly linking to the issue ticket. While the transaction coordinator still performs sanity checks to ensure that operations cannot continue on an already-committed transaction, it will no longer log the error with severity `FATAL`, resulting in node crash, and will instead simply return the error. Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
Description TBD. Depends on cockroachdb#107596. Fixes: cockroachdb#103817 Release note (bug fix): TBD
craig bot
pushed a commit
that referenced
this issue
Jul 26, 2023
107596: kvcoord: disable fatal assertion on transaction commit sanity check r=AlexTalks a=AlexTalks As #103817 is now a known issue for which a root cause has been identified, the sanity check assertion in place here is disabled, with errors occuring due to the known bug properly linking to the issue ticket. While the transaction coordinator still performs sanity checks to ensure that operations cannot continue on an already-committed transaction, it will no longer log the error with severity `FATAL`, resulting in node crash, and will instead simply return the error. Part of: #103817 Release note: None Co-authored-by: Alex Sarkesian <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Jul 26, 2023
As #103817 is now a known issue for which a root cause has been identified, the sanity check assertion in place here is disabled, with errors occuring due to the known bug properly linking to the issue ticket. While the transaction coordinator still performs sanity checks to ensure that operations cannot continue on an already-committed transaction, it will no longer log the error with severity `FATAL`, resulting in node crash, and will instead simply return the error. Part of: #103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
!DNM! This is intended to test the behavior of enabling replay protection for all writes, such that they will fail if an idempotent write replay is performed with a different timestamp at the same epoch and sequence for a transaction. Depends on cockroachdb#107596. Fixes: cockroachdb#103817 Release note (bug fix): TBD
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 26, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Jul 27, 2023
This change introduces `MVCCWriteOptions`, a structure for bundling parameters for `MVCCPut`, `MVCCDelete`, and their many variants, and refactors usages of these functions across the code base in order to move the existing function arguments into this structure. In addition to allowing the code to eliminate specifying default values in many callers, this enables the ability to pass new flags to write operations such as the replay protection needed to address cockroachdb#103817. Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 5, 2023
In cockroachdb#107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch. Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 5, 2023
This change builds on the testing introduced in cockroachdb#107323 with additional subtests evaluating the behavior of different race outcomes with contended transactions where the first transaction experiences an RPC failure (i.e. an ambiguous write). Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 5, 2023
While previously, RPC failures were assumed to be retryable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001 Fixes: cockroachdb#103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retryable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`.
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 6, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 6, 2023
In cockroachdb#107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch. Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 6, 2023
This change builds on the testing introduced in cockroachdb#107323 with additional subtests evaluating the behavior of different race outcomes with contended transactions where the first transaction experiences an RPC failure (i.e. an ambiguous write). Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 6, 2023
This change builds upon the tests added to validate cockroachdb#107658, in this case exercising the code path that would occur if the client did not send this flag in a `BatchRequest`, or the server did not respect it. It mocks a "mixed version" cluster by simply dropping the flag in transport, and tests that the previous result - i.e. a `TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned, without any other side effects or unexpected behavior. Part of: cockroachdb#103817 Release note: None
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 6, 2023
While previously, RPC failures were assumed to be retryable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001 Fixes: cockroachdb#103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retryable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`.
AlexTalks
added a commit
to AlexTalks/cockroach
that referenced
this issue
Oct 11, 2023
This change builds upon the tests added to validate cockroachdb#107658, in this case exercising the code path that would occur if the client did not send this flag in a `BatchRequest`, or the server did not respect it. It mocks a "mixed version" cluster by simply dropping the flag in transport, and tests that the previous result - i.e. a `TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned, without any other side effects or unexpected behavior. Part of: cockroachdb#103817 Release note: None
craig bot
pushed a commit
that referenced
this issue
Oct 11, 2023
111945: kvcoord: add "mixed version" test for `AmbiguousReplayProtection` flag r=AlexTalks a=AlexTalks This change builds upon the tests added to validate #107658, in this case exercising the code path that would occur if the client did not send this flag in a `BatchRequest`, or the server did not respect it. It mocks a "mixed version" cluster by simply dropping the flag in transport, and tests that the previous result - i.e. a `TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned, without any other side effects or unexpected behavior. Part of: #103817 Release note: None Co-authored-by: Alex Sarkesian <[email protected]>
blathers-crl bot
pushed a commit
that referenced
this issue
Oct 11, 2023
This change builds upon the tests added to validate #107658, in this case exercising the code path that would occur if the client did not send this flag in a `BatchRequest`, or the server did not respect it. It mocks a "mixed version" cluster by simply dropping the flag in transport, and tests that the previous result - i.e. a `TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned, without any other side effects or unexpected behavior. Part of: #103817 Release note: None
Affects this customer: https://cockroachdb.zendesk.com/agent/tickets/19344 |
1 similar comment
Affects this customer: https://cockroachdb.zendesk.com/agent/tickets/19344 |
nvanbenschoten
pushed a commit
to nvanbenschoten/cockroach
that referenced
this issue
Nov 22, 2023
As cockroachdb#103817 is now a known issue for which a root cause has been identified, the sanity check assertion in place here is disabled, with errors occuring due to the known bug properly linking to the issue ticket. While the transaction coordinator still performs sanity checks to ensure that operations cannot continue on an already-committed transaction, it will no longer log the error with severity `FATAL`, resulting in node crash, and will instead simply return the error. Part of: cockroachdb#103817 Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
T-kv
KV Team
As observed in the wild (#67765), in autonomous simulation testing (internal link), and in our own reproductions (#103123, #103756), we have several bugs relating to the way we handle ambiguous errors in KV batches that commit, potentially manifesting as
transaction unexpectedly committed
errors (which will cause a node crash), or as transactions that are incorrectly reported to the client as failed when they may actually be end up being explicitly committed.These ambiguous errors - i.e., a network error on one of a transaction's
BatchRequest
RPCs that cannot be determined to if it has succeeded or failed - are inherent to any distributed system, but need to be handled properly in order to avoid issues with our implementation of the parallel commit protocol. When one of these errors occurs in theDistSender
on a batch that contains anEndTxn
request, we currently attempt to elide the error like so:cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 2235 to 2237 in 775b005
While we attempt to reissue (potentially: idempotently replay) any failed
BatchRequest
s, sending them to subsequent replicas in the replica set in an attempt to sidestep the ambiguous error, this can run into problems:STAGING
) despite the ambiguous error.RecoverTxn
request.EndTxn
request at a higher timestamp (transaction unexpectedly committed
) or even fail on the reissued request, reporting a transaction failure to the client for a transaction that was ultimately committed.This should be fixed for both correctness in the near term, and improvement in the longer term. A discussed approach is as follows:
Near Term:
As a quicker fix, we should avoid attempting to sidestep the ambiguous errors on batches with commits in the
DistSender
and instead propagate them asAmbiguousResultError
s.AmbiguousResultError
s, and should be tested using the data-driven test reproductions linked above.Medium Term:
We should be able to avoid raising all RPC failures as
AmbiguousResultError
s, and minimize the cases to only those necessary by "remembering" the earlier ambiguity in retrying our failed writes and rejecting them as non-idempotent if they change the timestamp of a transaction that may already be implicitly committed.Longer Term:
See #108342.
Jira issue: CRDB-28217
The text was updated successfully, but these errors were encountered: