-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: unexpected REASON_TXN_COMMITTED from CommitInBatch #44315
Comments
I was able to reproduce this following the steps provided above. At first glance, it looks like this error is coming from the TxnWaitQueue, not the EndTxn evaluation. Specifically, a transaction is in the process of pushing another transaction and notices that it itself was committed: cockroach/pkg/storage/txnwait/queue.go Line 602 in 1a3c232
I don't see how this is possible unless we're in a case where we've re-issued the final batch of a transaction (after a refresh?) and get stuck pushing due to contention or we're sending the reads concurrently with the |
No, it doesn't look like the EndTxn is being split from the
So we're performing a parallel commit and concurrently sending |
Yes, it looks like that was it. Here's the fix, which appears to be holding up to stress: diff --git a/pkg/kv/txn_interceptor_committer.go b/pkg/kv/txn_interceptor_committer.go
index c4ee180af4..a3be512e6a 100644
--- a/pkg/kv/txn_interceptor_committer.go
+++ b/pkg/kv/txn_interceptor_committer.go
@@ -297,13 +297,35 @@ func (tc *txnCommitter) canCommitInParallelWithWrites(
return false
}
- // Similar to how we can't pipeline ranged writes, we also can't commit in
- // parallel with them. The reason for this is that the status resolution
- // process for STAGING transactions wouldn't know where to look for the
- // intents.
+ // Check whether every request in the batch is compatable with a parallel
+ // commit. If any are incompatible then we cannot perform a parallel commit.
for _, ru := range ba.Requests[:len(ba.Requests)-1] {
req := ru.GetInner()
- if roachpb.IsTransactionWrite(req) && roachpb.IsRange(req) {
+ switch {
+ case roachpb.IsTransactionWrite(req):
+ if roachpb.IsRange(req) {
+ // Similar to how we can't pipeline ranged writes, we also can't
+ // commit in parallel with them. The reason for this is that the
+ // status resolution process for STAGING transactions wouldn't
+ // know where to look for the corresponding intents.
+ return false
+ }
+ // All other point writes are included in the EndTxn request's
+ // InFlightWrites set and are visible to the status resolution
+ // process for STAGING transactions. Populating InFlightWrites
+ // has already been done by the txnPipeliner.
+
+ case req.Method() == roachpb.QueryIntent:
+ // QueryIntent requests are compatable with parallel commits. The
+ // intents being queried are also attached to the EndTxn request's
+ // InFlightWrites set and are visible to the status resolution
+ // process for STAGING transactions. Populating InFlightWrites has
+ // already been done by the txnPipeliner.
+
+ default:
+ // All other request types (notably Get and Scan requests) are
+ // incompatible with parallel commits because their outcome is
+ // not taken into consideration by the parallel commit protocol.
return false
}
} |
Fixes cockroachdb#44315. This commit fixes an issue where KV batches with read-only requests were allowed to be issued as parallel commits. This should not have been allowed because the outcome of these read-only requests, notably Get and Scan requests, is not taken into consideration by the status resolution process for STAGING transactions. This would have caused serious atomicity issues if parallel commit batches (non-1PC committing batches) were issued with read-only requests by users of the KV API. Luckily, in practice that wouldn't actually happen as we never issue batches like that so there's no concern about production clusters hitting this bug. Still, this is a very good catch as this was a serious footgun and fixing this hardens the KV API. We should also still backport this to v19.2 just to be safe. Release note: None
44608: kv: disallow parallel commit of batch with read-only requests r=nvanbenschoten a=nvanbenschoten Fixes #44315. This commit fixes an issue where KV batches with read-only requests were allowed to be issued as parallel commits. This should not have been allowed because the outcome of these read-only requests, notably Get and Scan requests, is not taken into consideration by the status resolution process for STAGING transactions. This would have caused serious atomicity issues if parallel commit batches (non-1PC committing batches) were issued with read-only requests by users of the KV API. Luckily, in practice that wouldn't actually happen as we never issue batches like that so there's no concern about production clusters hitting this bug. Still, this is a very good catch as this was a serious footgun and fixing this hardens the KV API. We should also still backport this to v19.2 just to be safe. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#44315. This commit fixes an issue where KV batches with read-only requests were allowed to be issued as parallel commits. This should not have been allowed because the outcome of these read-only requests, notably Get and Scan requests, is not taken into consideration by the status resolution process for STAGING transactions. This would have caused serious atomicity issues if parallel commit batches (non-1PC committing batches) were issued with read-only requests by users of the KV API. Luckily, in practice that wouldn't actually happen as we never issue batches like that so there's no concern about production clusters hitting this bug. Still, this is a very good catch as this was a serious footgun and fixing this hardens the KV API. We should also still backport this to v19.2 just to be safe. Release note: None
The PR to add support to kvnemeses for exercising txn.CommitInBatch (#44229) will occasionally flake on a
TransactionStatusError: already committed (REASON_TXN_COMMITTED)
error. At Nathan's request, I checked the txn.Epoch after one of these failures and it was0
indicating no restarts. Perhaps this is missing a retry somewhere?Repro: Check out #44229 and run the following. It tends to fail within a minute on my laptop.
Found with kvnemeses
The text was updated successfully, but these errors were encountered: