Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: unexpected REASON_TXN_COMMITTED from CommitInBatch #44315

Closed
danhhz opened this issue Jan 23, 2020 · 3 comments · Fixed by #44608
Closed

kv: unexpected REASON_TXN_COMMITTED from CommitInBatch #44315

danhhz opened this issue Jan 23, 2020 · 3 comments · Fixed by #44608
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface.

Comments

@danhhz
Copy link
Contributor

danhhz commented Jan 23, 2020

The PR to add support to kvnemeses for exercising txn.CommitInBatch (#44229) will occasionally flake on a TransactionStatusError: already committed (REASON_TXN_COMMITTED) error. At Nathan's request, I checked the txn.Epoch after one of these failures and it was 0 indicating no restarts. Perhaps this is missing a retry somewhere?

  db.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
    txn.Put(ctx, /Table/50/"e14094f1", v-15) // nil
    b := &Batch{}
    b.Get(ctx, /Table/50/"0b10b00b") // (nil, TransactionStatusError: already committed (REASON_TXN_COMMITTED))
    b.Get(ctx, /Table/50/"0b10b00b") // (nil, TransactionStatusError: already committed (REASON_TXN_COMMITTED))
    txn.CommitInBatch(ctx, b) // TransactionStatusError: already committed (REASON_TXN_COMMITTED)
    return nil
  }) // TransactionStatusError: already committed (REASON_TXN_COMMITTED)

Repro: Check out #44229 and run the following. It tends to fail within a minute on my laptop.

make stress PKG=./pkg/kv/kvnemeses TESTS=TestKVNemesesSingleNode TESTTIMEOUT=45s TESTFLAGS='-v -show-logs' STRESSFLAGS='-stderr -maxfails 1'

Found with kvnemeses

@danhhz danhhz added the A-kv-client Relating to the KV client and the KV interface. label Jan 23, 2020
@nvanbenschoten
Copy link
Member

I was able to reproduce this following the steps provided above.

At first glance, it looks like this error is coming from the TxnWaitQueue, not the EndTxn evaluation. Specifically, a transaction is in the process of pushing another transaction and notices that it itself was committed:

log.VEventf(ctx, 1, "pusher committed: %v", updatedPusher)

I don't see how this is possible unless we're in a case where we've re-issued the final batch of a transaction (after a refresh?) and get stuck pushing due to contention or we're sending the reads concurrently with the EndTxn. But what's strange is that the final batch is made up of two Get and then the EndTxn, so these operations should be split from each other and not run concurrently. We'll see if that's actually happening soon enough.

@nvanbenschoten
Copy link
Member

No, it doesn't look like the EndTxn is being split from the Gets. We see the final batch looks like:

2 Get, 1 EndTxn, 1 QueryIntent

So we're performing a parallel commit and concurrently sending Get requests to other ranges. That shouldn't be allowed because the parallel commit will be able to succeed even if some of the Gets failed.

@nvanbenschoten
Copy link
Member

Yes, it looks like that was it. Here's the fix, which appears to be holding up to stress:

diff --git a/pkg/kv/txn_interceptor_committer.go b/pkg/kv/txn_interceptor_committer.go
index c4ee180af4..a3be512e6a 100644
--- a/pkg/kv/txn_interceptor_committer.go
+++ b/pkg/kv/txn_interceptor_committer.go
@@ -297,13 +297,35 @@ func (tc *txnCommitter) canCommitInParallelWithWrites(
                return false
        }

-       // Similar to how we can't pipeline ranged writes, we also can't commit in
-       // parallel with them. The reason for this is that the status resolution
-       // process for STAGING transactions wouldn't know where to look for the
-       // intents.
+       // Check whether every request in the batch is compatable with a parallel
+       // commit. If any are incompatible then we cannot perform a parallel commit.
        for _, ru := range ba.Requests[:len(ba.Requests)-1] {
                req := ru.GetInner()
-               if roachpb.IsTransactionWrite(req) && roachpb.IsRange(req) {
+               switch {
+               case roachpb.IsTransactionWrite(req):
+                       if roachpb.IsRange(req) {
+                               // Similar to how we can't pipeline ranged writes, we also can't
+                               // commit in parallel with them. The reason for this is that the
+                               // status resolution process for STAGING transactions wouldn't
+                               // know where to look for the corresponding intents.
+                               return false
+                       }
+                       // All other point writes are included in the EndTxn request's
+                       // InFlightWrites set and are visible to the status resolution
+                       // process for STAGING transactions. Populating InFlightWrites
+                       // has already been done by the txnPipeliner.
+
+               case req.Method() == roachpb.QueryIntent:
+                       // QueryIntent requests are compatable with parallel commits. The
+                       // intents being queried are also attached to the EndTxn request's
+                       // InFlightWrites set and are visible to the status resolution
+                       // process for STAGING transactions. Populating InFlightWrites has
+                       // already been done by the txnPipeliner.
+
+               default:
+                       // All other request types (notably Get and Scan requests) are
+                       // incompatible with parallel commits because their outcome is
+                       // not taken into consideration by the parallel commit protocol.
                        return false
                }
        }

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 31, 2020
Fixes cockroachdb#44315.

This commit fixes an issue where KV batches with read-only requests were allowed
to be issued as parallel commits. This should not have been allowed because the
outcome of these read-only requests, notably Get and Scan requests, is not taken
into consideration by the status resolution process for STAGING transactions.

This would have caused serious atomicity issues if parallel commit batches
(non-1PC committing batches) were issued with read-only requests by users of the
KV API. Luckily, in practice that wouldn't actually happen as we never issue
batches like that so there's no concern about production clusters hitting this
bug. Still, this is a very good catch as this was a serious footgun and fixing
this hardens the KV API. We should also still backport this to v19.2 just to be
safe.

Release note: None
craig bot pushed a commit that referenced this issue Feb 4, 2020
44608: kv: disallow parallel commit of batch with read-only requests r=nvanbenschoten a=nvanbenschoten

Fixes #44315.

This commit fixes an issue where KV batches with read-only requests were allowed
to be issued as parallel commits. This should not have been allowed because the
outcome of these read-only requests, notably Get and Scan requests, is not taken
into consideration by the status resolution process for STAGING transactions.

This would have caused serious atomicity issues if parallel commit batches
(non-1PC committing batches) were issued with read-only requests by users of the
KV API. Luckily, in practice that wouldn't actually happen as we never issue
batches like that so there's no concern about production clusters hitting this
bug. Still, this is a very good catch as this was a serious footgun and fixing
this hardens the KV API. We should also still backport this to v19.2 just to be
safe.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 56a9934 Feb 4, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 4, 2020
Fixes cockroachdb#44315.

This commit fixes an issue where KV batches with read-only requests were allowed
to be issued as parallel commits. This should not have been allowed because the
outcome of these read-only requests, notably Get and Scan requests, is not taken
into consideration by the status resolution process for STAGING transactions.

This would have caused serious atomicity issues if parallel commit batches
(non-1PC committing batches) were issued with read-only requests by users of the
KV API. Luckily, in practice that wouldn't actually happen as we never issue
batches like that so there's no concern about production clusters hitting this
bug. Still, this is a very good catch as this was a serious footgun and fixing
this hardens the KV API. We should also still backport this to v19.2 just to be
safe.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants