-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-23.1: kv, storage: fix handling of ambiguous failures on commit #110757
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 80 of 80 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @jbowens)
pkg/kv/kvserver/logstore/logstore.go
line 36 at r1 (raw file):
"github.com/cockroachdb/cockroach/pkg/util/timeutil" "github.com/cockroachdb/errors" "go.etcd.io/raft/v3"
Did you mean to remove this import? From the diff, I don't see what it would be removed.
pkg/kv/kvserver/logstore/stateloader.go
line 185 at r1 (raw file):
ctx context.Context, writer storage.Writer, replicaID roachpb.ReplicaID, ) error { <<<<<<< HEAD
Cherry-pick skew?
pkg/storage/engine_test.go
line 1606 at r1 (raw file):
// However, certain clearers cannot clear intents, range keys, or point keys. writeInitialData := func(t *testing.T, rw ReadWriter) { var localTS hlc.ClockTimestamp
Should we just delete localTS
?
5cbb195
to
283a941
Compare
This change introduces `MVCCWriteOptions`, a structure for bundling parameters for `MVCCPut`, `MVCCDelete`, and their many variants, and refactors usages of these functions across the code base in order to move the existing function arguments into this structure. In addition to allowing the code to eliminate specifying default values in many callers, this enables the ability to pass new flags to write operations such as the replay protection needed to address cockroachdb#103817. Part of: cockroachdb#103817 Release note: None
283a941
to
eff2a5d
Compare
This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817
In cockroachdb#107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch. Part of: cockroachdb#103817 Release note: None
This change builds on the testing introduced in cockroachdb#107323 with additional subtests evaluating the behavior of different race outcomes with contended transactions where the first transaction experiences an RPC failure (i.e. an ambiguous write). Part of: cockroachdb#103817 Release note: None
828c4ba
to
178f788
Compare
It is possible we may require backporting #102808 and #102809 in order for the tests (at least, those that depend on |
The tests for the ambiguous write on commits bug required modification to work on 23.1 due to changes in the behavior of handling a `WriteTooOldError`. In order to ensure a `WriteTooOldError` is not deferred and expected to be handled by the client, CPuts are used instead, which don't permit this deferral. Part of: cockroachdb#103817 Release note: None
While previously, RPC failures were assumed to be retryable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001 Fixes: cockroachdb#103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retryable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`.
2f2fe62
to
96c3243
Compare
As discussed on Slack, I've updated some of the tests to use It feels like we have a few options from here:
I'm not sure if it would be ok to leave around the intent written on the retry though, or if additional handling would be necessary? @nvanbenschoten feel free to let me know if you have some thoughts or preferences here! |
Prior to cockroachdb#102808 and cockroachdb#102809, `WriteTooOldError`s were deferred for client-side handling, allowing intent writes to proceed. This unfortunately limits the ability to eliminate `transaction unexpectedly committed` errors when retrying a blind write. This fix modifies the tests to reflect this gap. Part of: cockroachdb#103817 Release note: None
96c3243
to
4530e43
Compare
These changes feel too large and too risky to be backported. They also comes with some user-visible behavior changes that aren't appropriate for a patch release.
This seems like a reasonable change, although making it will require additional testing and introduces some additional risk for the backport. How do you feel about it? Independent of that, we should get to the bottom of the flakiness in #110187 before backporting the flaky test to other release branches. |
@nvanbenschoten agreed. I actually put up a draft of the fix I am proposing in #111017. |
Closing this in favor of #111017 . |
Affects this customer: https://cockroachdb.zendesk.com/agent/tickets/19344 |
Backport:
Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: Backporting fix for outstanding bug described in #103817.