kvcoord: add test for transaction unexpectedly committed #107323

AlexTalks · 2023-07-21T00:18:10Z

This adds a unit test to reproduce the behavior described in #103817 and seen in #67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as transaction unexpectedly committed. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in #103817.

Release note: None

Part of: #103817

cockroach-teamcity · 2023-07-21T00:18:19Z

This change is

pkg/kv/kvserver/client_unexpected_commit_test.go

arulajmani

Nice job debugging this hairy issue and translating it to a test. Most of my comments on this are minor/nits.

Reviewed 2 of 3 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nvanbenschoten)

pkg/kv/kvserver/batcheval/cmd_lease_request.go line 163 at r3 (raw file):

	}

	log.VEventf(ctx, 2, "lease request: prev lease: %+v, new lease: %+v", prevLease, newLease)

nit: consider pulling this out into its own commit as it is unrelated to the test we're adding here.

pkg/kv/kvserver/client_unexpected_commit_test.go line 122 at r3 (raw file):

// a parallel commit transaction with an ambiguous error on a write races with
// a contending transaction's recovery attempt. In the case that the recovery
// succeeds prior to the original transaction's retries, an ambiguous error

nit: Should we reword this comment to describe what the current behaviour is (and what the test is asserting) instead? And then describe the desired behaviour we hope to achieve with a future fix ("an ambiguous error should be returned to the client")

pkg/kv/kvserver/client_unexpected_commit_test.go line 142 at r3 (raw file):

	// Key constants.
	tablePrefix := bootstrap.TestingUserTableDataMin()

Do we need this? Or could we write at raw keys instead?

pkg/kv/kvserver/client_unexpected_commit_test.go line 143 at r3 (raw file):

	// Key constants.
	tablePrefix := bootstrap.TestingUserTableDataMin()
	key1 := roachpb.Key(encoding.EncodeUvarintAscending(tablePrefix.Clone(), 1))

nit: below, where you outline the test, you're using "a" and "b". Should we rename these variables to keyA and keyB instead?

pkg/kv/kvserver/client_unexpected_commit_test.go line 161 at r3 (raw file):

	// Handle all synchronization of KV operations at the transport level.
	// This allows us to schedule the operations such that they look like:
	//  txn1: read a (n1), b (n2)

nit: read key1 (n1), key2 (n2)

pkg/kv/kvserver/client_unexpected_commit_test.go line 227 at r3 (raw file):

	st := cluster.MakeTestingClusterSettings()

	// Disable closed timestamps for greater control over when transaction gets bumped.

nit: wrap comment to 80 chars. There's a few other places below that could use an audit.

pkg/kv/kvserver/client_unexpected_commit_test.go line 228 at r3 (raw file):

	// Disable closed timestamps for greater control over when transaction gets bumped.
	closedts.TargetDuration.Override(ctx, &st.SV, 1*time.Hour)

nit: consider overriding this by setting it to 0.

pkg/kv/kvserver/client_unexpected_commit_test.go line 229 at r3 (raw file):

	// Disable closed timestamps for greater control over when transaction gets bumped.
	closedts.TargetDuration.Override(ctx, &st.SV, 1*time.Hour)
	ts.TimeseriesStorageEnabled.Override(ctx, &st.SV, false)

Why do we need this?

pkg/kv/kvserver/client_unexpected_commit_test.go line 246 at r3 (raw file):

					Store: &kvserver.StoreTestingKnobs{
						EvalKnobs: kvserverbase.BatchEvalTestingKnobs{
							RecoverIndeterminateCommitsOnFailedPushes: true,

Consider adding a comment here.

pkg/kv/kvserver/client_unexpected_commit_test.go line 290 at r3 (raw file):

			batch.GetForUpdate(key)
		}
		assert.NoError(t, txn.Run(ctx, batch))

Should we be using require instead of assert here (and elsewhere)?

I'm not entirely sure what the difference is, but I assume there's something to it given every other place in the codebase prefers require over assert. TODO(arul) find out.

pkg/kv/kvserver/client_unexpected_commit_test.go line 302 at r3 (raw file):

	db := tc.Server(0).DB()

	// Dump KVs at end of test for debugging purposes.

Let's get rid of this. Logging these KVs isn't giving us too much -- if we care about how these keys look once the test exits, we should assert the state at the end instead. It's also fine if we don't.

pkg/kv/kvserver/client_unexpected_commit_test.go line 316 at r3 (raw file):

	// Corresponds to:
	_ = `
		CREATE TABLE bank (id INT PRIMARY KEY, balance INT);

Let's get rid of this.

pkg/kv/kvserver/client_unexpected_commit_test.go line 319 at r3 (raw file):

		INSERT INTO bank VALUES (1, 50), (2, 50);
		ALTER TABLE bank SPLIT AT VALUES (2);`
	t.Logf("Putting initial keys")

nit: consider getting rid of these log lines now that we've stabilized the test. They feel a bit noisy (just the super obvious ones).

pkg/kv/kvserver/client_unexpected_commit_test.go line 379 at r3 (raw file):

	// Execute implicit transaction to move $10 from account 1 to account 2.
	// Corresponds to:

Let's get rid of this as well.

pkg/kv/kvserver/client_unexpected_commit_test.go line 397 at r3 (raw file):

		select {
		case <-txn1Ready:
		case <-tc.Stopper().ShouldQuiesce():

Would you ever expect this to happen on a test that passes?

pkg/kv/kvserver/client_unexpected_commit_test.go line 413 at r3 (raw file):

		// Wait until txn2 is ready to start.
		select {
		case <-txn2Ready:

Here, and elsewhere, consider adding:

select {
 ...
  case <-time.After(10 * time.Second):
     t.Error("test timed out")  
}

pkg/kv/kvserver/client_unexpected_commit_test.go line 418 at r3 (raw file):

		}
		tCtx := context.Background()
		txn := db.NewTxn(tCtx, "txn2")

Does txn2 need to be the same transaction as txn1? Could it instead be something simpler? IIUC, all it needs to do is trigger the recovery, right?

pkg/kv/kvserver/client_unexpected_commit_test.go line 436 at r3 (raw file):

		t.Logf("Transferring r%d lease to n%d", secondRange.RangeID, tc.Target(0).NodeID)
		tc.TransferRangeLeaseOrFatal(t, secondRange, tc.Target(0))

Given this isn't the test's main goroutine, should we instead be using TransferRangeLease and t.Error instead (of the t.Fatal)?

Might also be worth auditing the other goroutines the test is spinning off.

pkg/kv/kvserver/client_unexpected_commit_test.go line 448 at r3 (raw file):

		// These conditions are checked in order of expected operations of the test.

		// 1. txn1->n1: Get(a)

nit: s/a/key1/g (here and below). Ditto for "b".

pkg/kv/kvserver/client_unexpected_commit_test.go line 452 at r3 (raw file):

		// 3. txn1->n1: Put(a), EndTxn(parallel commit) -- Puts txn1 in STAGING.
		// 4. txn1->n2: Put(b) -- Send the request, but pause before returning the
		// response yet so we can inject network failure.

"yet so"

pkg/kv/kvserver/client_unexpected_commit_test.go line 466 at r3 (raw file):

		if req.ba.IsSingleRecoverTxnRequest() && cp == BeforeSending {
			// Once the RecoverTxn request is issued, as part of txn2's PushTxn
			// request, the lease can be moved.

What would happen if we moved the lease before the recovery kicked off?

Separately, could we get rid of the leaseMoveReady channel if we triggered the lease move here?

pkg/kv/kvserver/client_unexpected_commit_test.go line 490 at r3 (raw file):

		// Note that if these refreshes fail, the transaction coordinator
		// would return a retriable error, although the transaction could be actually
		// committed during recovery - this is highly problematic.

This deserves its own issue. Should we file one and link it here?

pkg/kv/kvserver/client_unexpected_commit_test.go line 535 at r3 (raw file):

	// TODO(sarkesian): While we expect an AmbiguousResultError once the immediate
	//  changes outlined in #103817 are implemented, right now this is essentially

nit: extra spaces, here and below.

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/client_unexpected_commit_test.go line 142 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we need this? Or could we write at raw keys instead?

Not strictly necessary, but I found it very helpful in conceptualizing the test and reading the output in any case.

pkg/kv/kvserver/client_unexpected_commit_test.go line 228 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: consider overriding this by setting it to 0.

I didn't think that actually worked? I still saw logs showing timestamps being closed even when this was set to 0.

pkg/kv/kvserver/client_unexpected_commit_test.go line 229 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Why do we need this?

Done - Good point, we don't, I should have removed this - was in an attempt to make logs less noisy but the real issue was something else.

pkg/kv/kvserver/client_unexpected_commit_test.go line 246 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Consider adding a comment here.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 290 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we be using require instead of assert here (and elsewhere)?

I'm not entirely sure what the difference is, but I assume there's something to it given every other place in the codebase prefers require over assert. TODO(arul) find out.

No, this is specific - if you look at the code, you'll see that require.X basically call assert.X and then t.FailNow(), and IIUC t.FailNow() isn't supposed to be called from goroutines other than the main test goroutine.

pkg/kv/kvserver/client_unexpected_commit_test.go line 302 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's get rid of this. Logging these KVs isn't giving us too much -- if we care about how these keys look once the test exits, we should assert the state at the end instead. It's also fine if we don't.

Can remove it, but it can be helpful in debugging a failure - that's the primary purpose here. If needed we can assert on the values.

pkg/kv/kvserver/client_unexpected_commit_test.go line 397 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Would you ever expect this to happen on a test that passes?

No, this is so that if a test fails, it will fail quickly.

pkg/kv/kvserver/client_unexpected_commit_test.go line 418 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Does txn2 need to be the same transaction as txn1? Could it instead be something simpler? IIUC, all it needs to do is trigger the recovery, right?

No, this (and the SQL comments) were because this was based on a real-world example seen in the bank workload. What is critical is the GetForUpdates.

pkg/kv/kvserver/client_unexpected_commit_test.go line 436 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Given this isn't the test's main goroutine, should we instead be using TransferRangeLease and t.Error instead (of the t.Fatal)?

Might also be worth auditing the other goroutines the test is spinning off.

Fair point, yeah. I believe this was the only case of a Fatal or FailNow outside the main test goroutine.

pkg/kv/kvserver/client_unexpected_commit_test.go line 452 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

"yet so"

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 466 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

What would happen if we moved the lease before the recovery kicked off?

Separately, could we get rid of the leaseMoveReady channel if we triggered the lease move here?

The lease move doesn't have any effect on the recovery.

That said, I think in terms of readability/understandability, it can help significantly if cluster changes like the lease transfer don't occur from within a DistSender testing knob - that way the maybeWait function can simply perform blocking/signaling/logging, and leave the operations elsewhere. This helps with writing additional tests too.

pkg/kv/kvserver/client_unexpected_commit_test.go line 490 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This deserves its own issue. Should we file one and link it here?

I have some tests I am working on to add in follow-up commits, but would like to get this test merged first.

pkg/kv/kvserver/client_unexpected_commit_test.go line 535 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: extra spaces, here and below.

Done.

AlexTalks

Dismissed @arulajmani from 4 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/client_unexpected_commit_test.go line 436 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Fair point, yeah. I believe this was the only case of a Fatal or FailNow outside the main test goroutine.

LGTM

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/client_unexpected_commit_test.go line 122 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: Should we reword this comment to describe what the current behaviour is (and what the test is asserting) instead? And then describe the desired behaviour we hope to achieve with a future fix ("an ambiguous error should be returned to the client")

It does say that in the comments here (see the lines starting NB and TODO) as well as in the comments next to the actual checking of the error.

pkg/kv/kvserver/client_unexpected_commit_test.go line 413 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Here, and elsewhere, consider adding:

select {
 ...
  case <-time.After(10 * time.Second):
     t.Error("test timed out")  
}

Wanted to avoid putting times in here if possible to avoid flakes. If there is a legitimate timeout these operations should error immediately and we'll see it in the output.

pkg/kv/kvserver/client_unexpected_commit_test.go line 418 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

No, this (and the SQL comments) were because this was based on a real-world example seen in the bank workload. What is critical is the GetForUpdates.

That said, the Puts in txn2 are relevant because there is the possibility of txn1's retry hitting a WriteTooOld if it sees txn2's intent.

nvanbenschoten

At a high level, this looks great! Nice job exercising such a complex series of events.

I'll defer to @arulajmani for the final approval on this PR.

Reviewed 3 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @arulajmani)

pkg/kv/kvserver/client_unexpected_commit_test.go line 11 at r1 (raw file):

// licenses/APL.txt.

package kvserver_test

At its core, this is a test of client-side behavior. Should we move the file to pkg/kv/kvclient/kvcoord under a kvcoord_test package?

pkg/kv/kvserver/client_unexpected_commit_test.go line 74 at r4 (raw file):

	nID roachpb.NodeID

	beforeSend func(context.Context, interceptedReq) (bool, *kvpb.BatchResponse, error)

It's not very clear to a reader here, in the declaration of tMu, or in the definition of each of tMu's functions what the bool return arg controls. Consider naming them.

pkg/kv/kvserver/client_unexpected_commit_test.go line 173 at r4 (raw file):

	tMu := struct {
		syncutil.RWMutex
		filter    func(req interceptedReq) bool

Consider giving comments to each of these functions, describing their role. e.g. "filter intercepts each request and decides ..."

pkg/kv/kvserver/client_unexpected_commit_test.go line 188 at r4 (raw file):

					defer tMu.RUnlock()

					if tMu.filter != nil && tMu.filter(req) {

What is the benefit of filtering if tMu.maybeWait is nil? Just the logging? If so, that's fine, but add a comment to avoid confusion.

pkg/kv/kvserver/client_unexpected_commit_test.go line 189 at r4 (raw file):

					if tMu.filter != nil && tMu.filter(req) {
						opID++

nit: is this going to confuse the race detector?

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/client_unexpected_commit_test.go line 316 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's get rid of this.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 189 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: is this going to confuse the race detector?

Good catch, I wasn't running this under race previously. It's now enabled (though if I should disable it, let me know).

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/client_unexpected_commit_test.go line 379 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's get rid of this as well.

Done.

arulajmani

This feels close. Should be good to go once the open comments are resolved.

Reviewed 1 of 4 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go line 96 at r6 (raw file):

// the transport for kvpb.BatchRequest in the kvcoord.DistSender.
//
// If the override functions beforeSend/afterSend are set, they are called

It might be clearer to make this commentary specific to each function and move it above the field declaration.

pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go line 505 at r6 (raw file):

		// response so we can inject network failure.
		if req.txnName == "txn1" && hasPut && req.toNodeID == tc.Server(1).NodeID() && cp == AfterSending {
			// Once we have seen the write on txn1 to n2 that we will fail, txn2 can start.

nit: wrap comment to 80 chars.

pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go line 511 at r6 (raw file):

		// 5. txn2->n1: Get(a) -- Discovers txn1's locks, issues push request.
		// 6. txn2->n2: Get(b)
		// 7. _->n1: PushTxn(txn2->txn1) -- Discovers txn1 in STAGING and starts recovery.

nit: comment over 80 chars.

pkg/kv/kvserver/client_unexpected_commit_test.go line 142 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Not strictly necessary, but I found it very helpful in conceptualizing the test and reading the output in any case.

I'll defer to you as the author, but mild preference to keep things as simple as possible.

pkg/kv/kvserver/client_unexpected_commit_test.go line 228 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I didn't think that actually worked? I still saw logs showing timestamps being closed even when this was set to 0.

Weird. I was just reading the setting description. Never mind me.

pkg/kv/kvserver/client_unexpected_commit_test.go line 290 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

No, this is specific - if you look at the code, you'll see that require.X basically call assert.X and then t.FailNow(), and IIUC t.FailNow() isn't supposed to be called from goroutines other than the main test goroutine.

Ack. Thanks for the clarification!

pkg/kv/kvserver/client_unexpected_commit_test.go line 302 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Can remove it, but it can be helpful in debugging a failure - that's the primary purpose here. If needed we can assert on the values.

Let's use asserts then. That way, the expectations are codified. People unfamiliar with this test might not find such log lines particularly helpful as it's not clear what to expect and what not to.

pkg/kv/kvserver/client_unexpected_commit_test.go line 379 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Done.

Did you mean to remove this?

pkg/kv/kvserver/client_unexpected_commit_test.go line 397 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

No, this is so that if a test fails, it will fail quickly.

I don't follow how that can happen -- won't the stopper only quiesce once the test's goroutine is done? And that can't happen because of the wg.Wait() call below, right? Am I missing something?

Either way, I think it'll help if we switch to the <-time.After(DefaultSucceedsSoonDuration) pattern here.

pkg/kv/kvserver/client_unexpected_commit_test.go line 413 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Wanted to avoid putting times in here if possible to avoid flakes. If there is a legitimate timeout these operations should error immediately and we'll see it in the output.

If we're worried about flakes, maybe we can use a more conservative value here (such as DefaultSucceedsSoonDuration)? We use that pattern elsewhere in our code (everything that uses the SucceedsSoon wrapper). It also means that the test will fail quickly in CI as opposed to causing package timeouts, which are annoying.

pkg/kv/kvserver/client_unexpected_commit_test.go line 418 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

That said, the Puts in txn2 are relevant because there is the possibility of txn1's retry hitting a WriteTooOld if it sees txn2's intent.

That's an interesting point. For that to happen, one of txn1's initial writes should have failed. We're not doing that in this test, so can we simplify txn2 here? Maybe just a simple get request to trigger recovery?

pkg/kv/kvserver/client_unexpected_commit_test.go line 466 at r3 (raw file):
Ack, fair point on the readability front.

The lease move doesn't have any effect on the recovery.

In that case, could we be a bit loser with the synchronization here? Does something like this work?

--- a/pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go
@@ -504,21 +504,15 @@ func TestTransactionUnexpectedlyCommitted(t *testing.T) {
                if req.txnName == "txn1" && hasPut && req.toNodeID == tc.Server(1).NodeID() && cp == AfterSending {
                        // Once we have seen the write on txn1 to n2 that we will fail, txn2 can start.
                        close(txn2Ready)
+                       close(leaseMoveReady)
                }

                // 5. txn2->n1: Get(a) -- Discovers txn1's locks, issues push request.
                // 6. txn2->n2: Get(b)
                // 7. _->n1: PushTxn(txn2->txn1) -- Discovers txn1 in STAGING and starts recovery.
                // 8. _->n1: RecoverTxn(txn1) -- Before sending, pause the request so we
-               // can ensure it gets evaluated after txn1 retries (and refreshes), but before
-               // its final EndTxn.
-               if req.ba.IsSingleRecoverTxnRequest() && cp == BeforeSending {
-                       // Once the RecoverTxn request is issued, as part of txn2's PushTxn
-                       // request, the lease can be moved.
-                       close(leaseMoveReady)
-               }
+               // Simultaneously, transfer b's lease to n1.

-               // <transfer b's lease to n1>
                // <inject a network failure and finally allow (4) txn1->n2: Put(b) to return with error>
                if req.txnName == "txn1" && hasPut && req.toNodeID == tc.Server(1).NodeID() && cp == AfterSending {
                        // Hold the operation open until we are ready to retry on the new replica,

pkg/kv/kvserver/client_unexpected_commit_test.go line 490 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

I have some tests I am working on to add in follow-up commits, but would like to get this test merged first.

Should we file a github issue describing this scenario though? Your call on whether you want to link that issue here or not.

pkg/kv/kvserver/client_unexpected_commit_test.go line 535 at r3 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Done.

There's still an extra space at the start of these lines.

This adds a unit test to reproduce the behavior described in cockroachdb#103817 and seen in cockroachdb#67765, which currently is a bug in our implementation of the parallel commit protocol that results in the assertion failure known as `transaction unexpectedly committed`. The test currently validates the incorrect behavior of the known issue, though it is inded to be used to validate the potential fixes as proposed in cockroachdb#103817. Release note: None Part of: cockroachdb#103817

AlexTalks

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go line 96 at r6 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

It might be clearer to make this commentary specific to each function and move it above the field declaration.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 142 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I'll defer to you as the author, but mild preference to keep things as simple as possible.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 302 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Let's use asserts then. That way, the expectations are codified. People unfamiliar with this test might not find such log lines particularly helpful as it's not clear what to expect and what not to.

I modified this to only output on failure and added comments explaining, that way it is only used in help for debugging failing tests.

pkg/kv/kvserver/client_unexpected_commit_test.go line 379 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Did you mean to remove this?

I meant to leave this in but I added comments explaining the reasoning (and removed the other unnecessary SQL DDL stuff).

pkg/kv/kvserver/client_unexpected_commit_test.go line 397 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I don't follow how that can happen -- won't the stopper only quiesce once the test's goroutine is done? And that can't happen because of the wg.Wait() call below, right? Am I missing something?

Either way, I think it'll help if we switch to the <-time.After(DefaultSucceedsSoonDuration) pattern here.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 413 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

If we're worried about flakes, maybe we can use a more conservative value here (such as DefaultSucceedsSoonDuration)? We use that pattern elsewhere in our code (everything that uses the SucceedsSoon wrapper). It also means that the test will fail quickly in CI as opposed to causing package timeouts, which are annoying.

Done.

pkg/kv/kvserver/client_unexpected_commit_test.go line 418 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

That's an interesting point. For that to happen, one of txn1's initial writes should have failed. We're not doing that in this test, so can we simplify txn2 here? Maybe just a simple get request to trigger recovery?

That's actually not the case - once txn2's recovery of txn1 completes, the intents can be cleaned up and txn2's puts can continue (there is code to ensure this doesn't happen in this initial simple case).

If you look at the followup, #108001, the expectations on txn2 do apply in several cases. I'm not sure it would be good to remove things here that will be immediately put back in later. I think it actually helps the test to validate that we aren't getting the WriteTooOld here.

pkg/kv/kvserver/client_unexpected_commit_test.go line 466 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Ack, fair point on the readability front.

The lease move doesn't have any effect on the recovery.

In that case, could we be a bit loser with the synchronization here? Does something like this work?

--- a/pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender_ambiguous_test.go
@@ -504,21 +504,15 @@ func TestTransactionUnexpectedlyCommitted(t *testing.T) {
                if req.txnName == "txn1" && hasPut && req.toNodeID == tc.Server(1).NodeID() && cp == AfterSending {
                        // Once we have seen the write on txn1 to n2 that we will fail, txn2 can start.
                        close(txn2Ready)
+                       close(leaseMoveReady)
                }

                // 5. txn2->n1: Get(a) -- Discovers txn1's locks, issues push request.
                // 6. txn2->n2: Get(b)
                // 7. _->n1: PushTxn(txn2->txn1) -- Discovers txn1 in STAGING and starts recovery.
                // 8. _->n1: RecoverTxn(txn1) -- Before sending, pause the request so we
-               // can ensure it gets evaluated after txn1 retries (and refreshes), but before
-               // its final EndTxn.
-               if req.ba.IsSingleRecoverTxnRequest() && cp == BeforeSending {
-                       // Once the RecoverTxn request is issued, as part of txn2's PushTxn
-                       // request, the lease can be moved.
-                       close(leaseMoveReady)
-               }
+               // Simultaneously, transfer b's lease to n1.

-               // <transfer b's lease to n1>
                // <inject a network failure and finally allow (4) txn1->n2: Put(b) to return with error>
                if req.txnName == "txn1" && hasPut && req.toNodeID == tc.Server(1).NodeID() && cp == AfterSending {
                        // Hold the operation open until we are ready to retry on the new replica,

I put that case into #108001 so would like to leave this basic case as is if we can.

pkg/kv/kvserver/client_unexpected_commit_test.go line 490 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we file a github issue describing this scenario though? Your call on whether you want to link that issue here or not.

IMO it should be covered by #103817.

pkg/kv/kvserver/client_unexpected_commit_test.go line 535 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

There's still an extra space at the start of these lines.

Fixed.

While previously, RPC failures were assumed to be retryable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in cockroachdb#67765 and documented in cockroachdb#103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on cockroachdb#107680, cockroachdb#107323, cockroachdb#108154, cockroachdb#108001 Fixes: cockroachdb#103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retryable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`.

107658: kv: enable replay protection for ambiguous writes on commits r=AlexTalks a=AlexTalks While previously, RPC failures were assumed to be retriable, as write operations (with the notable exception of `EndTxn`) were assumed to be idempotent, it has been seen in #67765 and documented in #103817 that RPC failures on write operations that occur in parallel with a commit (i.e. a partial batch where `withCommit==true`), it is not always possible to assume idempotency and retry the "ambiguous" writes. This is due to the fact that the retried write RPC could result in the transaction's `WriteTimestamp` being bumped, changing the commit timestamp of the transaction that may in fact already be implicitly committed if the initial "ambiguous" write actually succeeded. This change modifies the protocol of the DistSender to flag in subsequent retries that a batch with a commit has previously experienced ambiguity, as well as the handling of the retried write in the MVCC layer to detect this previous ambiguity and reject retries that change the write timestamp as a non-idempotent replay. The flag allows subsequent retries to "remember" the earlier ambiguous write and evaluate accordingly. The flag allows us to properly handle RPC failures (i.e. ambiguous writes) that occur on commit, as a transaction that is implicitly committed is eligible to be marked as explicitly committed by contending transactions via the `RecoverTxn` operation, resulting in a race between retries by the transaction coordinator and recovery by contending transactions that could result in either incorrectly reporting a transaction as having failed with a `RETRY_SERIALIZABLE` error (despite the possibility that it already was or could be recovered and successfully committed), or in attempting to explicitly commit an already-recovered and committed transaction, resulting in seeing an assertion failure due to `transaction unexpectedly committed`. The replay protection introduced here allows us to avoid both of these situations by detecting a replay that should be considered non-idempotent and returning an error, causing the original RPC error remembered by the DistSender to be propagated as an `AmbiguousResultError`. As such, this can be handled by application code by validating the success/failure of a transaction when receiving this error. Depends on #107680, #107323, #108154, #108001 Fixes: #103817 Release note (bug fix): Properly handles RPC failures on writes using the parallel commit protocol that execute in parallel to the commit operation, avoiding incorrect retriable failures and `transaction unexpectedly committed` assertions by detecting when writes cannot be retried idempotently, instead returning an `AmbiguousResultError`. Co-authored-by: Alex Sarkesian <[email protected]>

In cockroachdb#107323, testing for the ambiguous write case that leads to the "transaction unexpectedly committed" bug were introduced, however to increase test coverage of the fix, multiple schedules of operations need to be tested. This change simply refactors the framework of the existing test in order to enable the addition of muliple subtests. The subtests are included in a separate patch. Part of: cockroachdb#103817 Release note: None

This change builds on the testing introduced in cockroachdb#107323 with additional subtests evaluating the behavior of different race outcomes with contended transactions where the first transaction experiences an RPC failure (i.e. an ambiguous write). Part of: cockroachdb#103817 Release note: None