Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: properly handle ambiguous errors that occur alongside commits #103817

Closed
AlexTalks opened this issue May 24, 2023 · 3 comments · Fixed by #107658
Closed

kv: properly handle ambiguous errors that occur alongside commits #103817

AlexTalks opened this issue May 24, 2023 · 3 comments · Fixed by #107658
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team

Comments

@AlexTalks
Copy link
Contributor

AlexTalks commented May 24, 2023

As observed in the wild (#67765), in autonomous simulation testing (internal link), and in our own reproductions (#103123, #103756), we have several bugs relating to the way we handle ambiguous errors in KV batches that commit, potentially manifesting as transaction unexpectedly committed errors (which will cause a node crash), or as transactions that are incorrectly reported to the client as failed when they may actually be end up being explicitly committed.

These ambiguous errors - i.e., a network error on one of a transaction's BatchRequest RPCs that cannot be determined to if it has succeeded or failed - are inherent to any distributed system, but need to be handled properly in order to avoid issues with our implementation of the parallel commit protocol. When one of these errors occurs in the DistSender on a batch that contains an EndTxn request, we currently attempt to elide the error like so:

if withCommit && !grpcutil.RequestDidNotStart(err) {
ambiguousError = err
}

While we attempt to reissue (potentially: idempotently replay) any failed BatchRequests, sending them to subsequent replicas in the replica set in an attempt to sidestep the ambiguous error, this can run into problems:

  • If the transaction actually did succeed in all of its writes, the transaction would be implicitly committed (and in transaction state STAGING) despite the ambiguous error.
  • In this state, some other operation encountering the transaction record (e.g. contending on keys) may explicitly commit it via a RecoverTxn request.
  • Since the transaction coordinator is unaware of this recovery, it may result in a duplicate EndTxn request at a higher timestamp (transaction unexpectedly committed) or even fail on the reissued request, reporting a transaction failure to the client for a transaction that was ultimately committed.

This should be fixed for both correctness in the near term, and improvement in the longer term. A discussed approach is as follows:

Near Term:
As a quicker fix, we should avoid attempting to sidestep the ambiguous errors on batches with commits in the DistSender and instead propagate them as AmbiguousResultErrors.

  • This avoids the potential issues caused by reissuing the requests after the ambiguous error, and indicates to the client that the transaction's commit state needs to be independently validated.
  • This approach should be validated that it does not cause a marked increase in unnecessary/avoidable AmbiguousResultErrors, and should be tested using the data-driven test reproductions linked above.

Medium Term:
We should be able to avoid raising all RPC failures as AmbiguousResultErrors, and minimize the cases to only those necessary by "remembering" the earlier ambiguity in retrying our failed writes and rejecting them as non-idempotent if they change the timestamp of a transaction that may already be implicitly committed.

  • This avoids potentially reporting a transaction as incorrectly "failed" or "double committing", instead returning an ambiguous error only in the necessary cases, indicating to the client that the transaction's commit state needs to be idependently validated.

Longer Term:
See #108342.

Jira issue: CRDB-28217

@AlexTalks AlexTalks added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. T-kv KV Team labels May 24, 2023
@AlexTalks AlexTalks changed the title kv: properly handle ambiguous errors that occur on commits kv: properly handle ambiguous errors that occur alongside commits May 24, 2023
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 14, 2023
WIP: This intends to raise an error on a KV batch that has an ambiguous
error on one of its writes, and also has an `EndTxn` request in the
batch.

Part of: cockroachdb#103817

Release note (bug fix): TBD
@shralex shralex added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Jul 18, 2023
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 22, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and
seen in cockroachdb#67765, which currently is a bug in our implementation of the
parallel commit protocol that results in the assertion failure known as
`transaction unexpectedly committed`. The test currently validates the
incorrect behavior of the known issue, though it is inded to be used to
validate the potential fixes as proposed in cockroachdb#103817.

Release note: None

Part of: cockroachdb#103817
@AlexTalks AlexTalks self-assigned this Jul 25, 2023
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
As cockroachdb#103817 is now a known issue for which a root cause has been
identified, the sanity check assertion in place here is disabled, with
errors occuring due to the known bug properly linking to the issue
ticket. While the transaction coordinator still performs sanity checks
to ensure that operations cannot continue on an already-committed
transaction, it will no longer log the error with severity `FATAL`,
resulting in node crash, and will instead simply return the error.

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and
seen in cockroachdb#67765, which currently is a bug in our implementation of the
parallel commit protocol that results in the assertion failure known as
`transaction unexpectedly committed`. The test currently validates the
incorrect behavior of the known issue, though it is inded to be used to
validate the potential fixes as proposed in cockroachdb#103817.

Release note: None

Part of: cockroachdb#103817
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
Description TBD.

Depends on cockroachdb#107596.

Fixes: cockroachdb#103817

Release note (bug fix): TBD
craig bot pushed a commit that referenced this issue Jul 26, 2023
107596: kvcoord: disable fatal assertion on transaction commit sanity check r=AlexTalks a=AlexTalks

As #103817 is now a known issue for which a root cause has been identified, the sanity check assertion in place here is disabled, with errors occuring due to the known bug properly linking to the issue ticket. While the transaction coordinator still performs sanity checks to ensure that operations cannot continue on an already-committed transaction, it will no longer log the error with severity `FATAL`, resulting in node crash, and will instead simply return the error.

Part of: #103817

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jul 26, 2023
As #103817 is now a known issue for which a root cause has been
identified, the sanity check assertion in place here is disabled, with
errors occuring due to the known bug properly linking to the issue
ticket. While the transaction coordinator still performs sanity checks
to ensure that operations cannot continue on an already-committed
transaction, it will no longer log the error with severity `FATAL`,
resulting in node crash, and will instead simply return the error.

Part of: #103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
!DNM!
This is intended to test the behavior of enabling replay protection for
all writes, such that they will fail if an idempotent write replay is
performed with a different timestamp at the same epoch and sequence for
a transaction.

Depends on cockroachdb#107596.

Fixes: cockroachdb#103817

Release note (bug fix): TBD
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 26, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and
seen in cockroachdb#67765, which currently is a bug in our implementation of the
parallel commit protocol that results in the assertion failure known as
`transaction unexpectedly committed`. The test currently validates the
incorrect behavior of the known issue, though it is inded to be used to
validate the potential fixes as proposed in cockroachdb#103817.

Release note: None

Part of: cockroachdb#103817
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Jul 27, 2023
This change introduces `MVCCWriteOptions`, a structure for bundling
parameters for `MVCCPut`, `MVCCDelete`, and their many variants, and
refactors usages of these functions across the code base in order to
move the existing function arguments into this structure. In addition to
allowing the code to eliminate specifying default values in many
callers, this enables the ability to pass new flags to write operations
such as the replay protection needed to address cockroachdb#103817.

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 5, 2023
In cockroachdb#107323, testing for the ambiguous write case that leads to the
"transaction unexpectedly committed" bug were introduced, however to
increase test coverage of the fix, multiple schedules of operations need
to be tested. This change simply refactors the framework of the existing
test in order to enable the addition of muliple subtests. The subtests
are included in a separate patch.

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 5, 2023
This change builds on the testing introduced in cockroachdb#107323 with additional
subtests evaluating the behavior of different race outcomes with
contended transactions where the first transaction experiences an RPC
failure (i.e. an ambiguous write).

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 5, 2023
While previously, RPC failures were assumed to be retryable, as write
operations (with the notable exception of `EndTxn`) were assumed to be
idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that
RPC failures on write operations that occur in parallel with a commit
(i.e. a partial batch where `withCommit==true`), it is not always
possible to assume idempotency and retry the "ambiguous" writes.
This is due to the fact that the retried write RPC could result in the
transaction's `WriteTimestamp` being bumped, changing the commit
timestamp of the transaction that may in fact already be implicitly
committed if the initial "ambiguous" write actually succeeded.

This change modifies the protocol of the DistSender to flag in
subsequent retries that a batch with a commit has previously
experienced ambiguity, as well as the handling of the retried write in
the MVCC layer to detect this previous ambiguity and reject retries
that change the write timestamp as a non-idempotent replay. The flag
allows subsequent retries to "remember" the earlier ambiguous write and
evaluate accordingly.

The flag allows us to properly handle RPC failures (i.e. ambiguous
writes) that occur on commit, as a transaction that is implicitly
committed is eligible to be marked as explicitly committed by contending
transactions via the `RecoverTxn` operation, resulting in a race between
retries by the transaction coordinator and recovery by contending
transactions that could result in either incorrectly reporting a
transaction as having failed with a `RETRY_SERIALIZABLE` error (despite
the possibility that it already was or could be recovered and
successfully committed), or in attempting to explicitly commit an
already-recovered and committed transaction, resulting in seeing an
assertion failure due to `transaction unexpectedly committed`.

The replay protection introduced here allows us to avoid both of these
situations by detecting a replay that should be considered
non-idempotent and returning an error, causing the original RPC error
remembered by the DistSender to be propagated as an
`AmbiguousResultError`. As such, this can be handled by application code
by validating the success/failure of a transaction when receiving this
error.

Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001

Fixes: cockroachdb#103817

Release note (bug fix): Properly handles RPC failures on writes using
the parallel commit protocol that execute in parallel to the commit
operation, avoiding incorrect retryable failures and `transaction
unexpectedly committed` assertions by detecting when writes cannot
be retried idempotently, instead returning an `AmbiguousResultError`.
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 6, 2023
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and
seen in cockroachdb#67765, which currently is a bug in our implementation of the
parallel commit protocol that results in the assertion failure known as
`transaction unexpectedly committed`. The test currently validates the
incorrect behavior of the known issue, though it is inded to be used to
validate the potential fixes as proposed in cockroachdb#103817.

Release note: None

Part of: cockroachdb#103817
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 6, 2023
In cockroachdb#107323, testing for the ambiguous write case that leads to the
"transaction unexpectedly committed" bug were introduced, however to
increase test coverage of the fix, multiple schedules of operations need
to be tested. This change simply refactors the framework of the existing
test in order to enable the addition of muliple subtests. The subtests
are included in a separate patch.

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 6, 2023
This change builds on the testing introduced in cockroachdb#107323 with additional
subtests evaluating the behavior of different race outcomes with
contended transactions where the first transaction experiences an RPC
failure (i.e. an ambiguous write).

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 6, 2023
This change builds upon the tests added to validate cockroachdb#107658, in this
case exercising the code path that would occur if the client did not
send this flag in a `BatchRequest`, or the server did not respect it. It
mocks a "mixed version" cluster by simply dropping the flag in
transport, and tests that the previous result - i.e. a
`TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned,
without any other side effects or unexpected behavior.

Part of: cockroachdb#103817

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 6, 2023
While previously, RPC failures were assumed to be retryable, as write
operations (with the notable exception of `EndTxn`) were assumed to be
idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that
RPC failures on write operations that occur in parallel with a commit
(i.e. a partial batch where `withCommit==true`), it is not always
possible to assume idempotency and retry the "ambiguous" writes.
This is due to the fact that the retried write RPC could result in the
transaction's `WriteTimestamp` being bumped, changing the commit
timestamp of the transaction that may in fact already be implicitly
committed if the initial "ambiguous" write actually succeeded.

This change modifies the protocol of the DistSender to flag in
subsequent retries that a batch with a commit has previously
experienced ambiguity, as well as the handling of the retried write in
the MVCC layer to detect this previous ambiguity and reject retries
that change the write timestamp as a non-idempotent replay. The flag
allows subsequent retries to "remember" the earlier ambiguous write and
evaluate accordingly.

The flag allows us to properly handle RPC failures (i.e. ambiguous
writes) that occur on commit, as a transaction that is implicitly
committed is eligible to be marked as explicitly committed by contending
transactions via the `RecoverTxn` operation, resulting in a race between
retries by the transaction coordinator and recovery by contending
transactions that could result in either incorrectly reporting a
transaction as having failed with a `RETRY_SERIALIZABLE` error (despite
the possibility that it already was or could be recovered and
successfully committed), or in attempting to explicitly commit an
already-recovered and committed transaction, resulting in seeing an
assertion failure due to `transaction unexpectedly committed`.

The replay protection introduced here allows us to avoid both of these
situations by detecting a replay that should be considered
non-idempotent and returning an error, causing the original RPC error
remembered by the DistSender to be propagated as an
`AmbiguousResultError`. As such, this can be handled by application code
by validating the success/failure of a transaction when receiving this
error.

Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001

Fixes: cockroachdb#103817

Release note (bug fix): Properly handles RPC failures on writes using
the parallel commit protocol that execute in parallel to the commit
operation, avoiding incorrect retryable failures and `transaction
unexpectedly committed` assertions by detecting when writes cannot
be retried idempotently, instead returning an `AmbiguousResultError`.
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Oct 11, 2023
This change builds upon the tests added to validate cockroachdb#107658, in this
case exercising the code path that would occur if the client did not
send this flag in a `BatchRequest`, or the server did not respect it. It
mocks a "mixed version" cluster by simply dropping the flag in
transport, and tests that the previous result - i.e. a
`TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned,
without any other side effects or unexpected behavior.

Part of: cockroachdb#103817

Release note: None
craig bot pushed a commit that referenced this issue Oct 11, 2023
111945: kvcoord: add "mixed version" test for `AmbiguousReplayProtection` flag r=AlexTalks a=AlexTalks

This change builds upon the tests added to validate #107658, in this case exercising the code path that would occur if the client did not send this flag in a `BatchRequest`, or the server did not respect it. It mocks a "mixed version" cluster by simply dropping the flag in transport, and tests that the previous result - i.e. a `TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned, without any other side effects or unexpected behavior.

Part of: #103817

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 11, 2023
This change builds upon the tests added to validate #107658, in this
case exercising the code path that would occur if the client did not
send this flag in a `BatchRequest`, or the server did not respect it. It
mocks a "mixed version" cluster by simply dropping the flag in
transport, and tests that the previous result - i.e. a
`TransactionStatusError` with `REASON_TXN_COMMITTED` - is returned,
without any other side effects or unexpected behavior.

Part of: #103817

Release note: None
@daniel-crlabs
Copy link
Contributor

Affects this customer: https://cockroachdb.zendesk.com/agent/tickets/19344

1 similar comment
@daniel-crlabs
Copy link
Contributor

Affects this customer: https://cockroachdb.zendesk.com/agent/tickets/19344

nvanbenschoten pushed a commit to nvanbenschoten/cockroach that referenced this issue Nov 22, 2023
As cockroachdb#103817 is now a known issue for which a root cause has been
identified, the sanity check assertion in place here is disabled, with
errors occuring due to the known bug properly linking to the issue
ticket. While the transaction coordinator still performs sanity checks
to ensure that operations cannot continue on an already-committed
transaction, it will no longer log the error with severity `FATAL`,
resulting in node crash, and will instead simply return the error.

Part of: cockroachdb#103817

Release note: None
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs T-kv KV Team
Projects
No open projects
Archived in project
3 participants