Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: unexpected COMMITTED state, likely due to parallel commits #37866

Closed
nvanbenschoten opened this issue May 26, 2019 · 15 comments · Fixed by #37987
Closed

storage: unexpected COMMITTED state, likely due to parallel commits #37866

nvanbenschoten opened this issue May 26, 2019 · 15 comments · Fixed by #37987
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 26, 2019

We've seen a few instances of this assertion fire because a TxnCoordSender in the txnPending state has a COMMITTED transaction proto. The assertion failure looks like:

unexpected txn state: "sql txn" id=2fb6a2e8 key=/Table/60/1/1875/0 rw=true pri=0.00955542 stat=COMMITTED epo=1 ts=1558613168.062142550,0 orig=1558613168.062142550,0 max=1558613168.562142550,0 wto=false seq=7 int=6

I've been able to reproduce this once by running TPC-C. When attempting to do so, I also saw a TransactionStatusError("already committed") error from cmd_end_transaction.go:211.

It's likely that these errors have the same underlying cause. Somehow a transaction record is being COMMITTED when a transaction coordinator does not expect it to be, which is probably related to parallel commits somehow. The first error comes when the coordinator's heartbeat finds the COMMITTED record and the second error comes when an EndTransactionRequest finds the COMMITTED record. The cause of the unexpected COMMITTED status is currently unknown.

I've downgraded the assertion to an error so that it is easier to debug in roachtest failures. Past that point, I won't be around for the next week to do too much debugging and track down the root cause. If this causes enough instability that we deem it necessary to temporarily disable parallel commits, here's the patch to do so (cc. @tbg):

Click to expand
diff --git a/pkg/kv/dist_sender_server_test.go b/pkg/kv/dist_sender_server_test.go
index 708ae2c762..0f269569bf 100644
--- a/pkg/kv/dist_sender_server_test.go
+++ b/pkg/kv/dist_sender_server_test.go
@@ -2442,7 +2442,7 @@ func TestTxnCoordSenderRetries(t *testing.T) {
 			// Parallel commits do not support the canForwardSerializableTimestamp
 			// optimization. That's ok because we need to removed that optimization
 			// anyway. See #36431.
-			txnCoordRetry: true,
+			txnCoordRetry: false,
 		},
 		{
 			name: "multi-range batch with forwarded timestamp and cput",
@@ -2478,9 +2478,10 @@ func TestTxnCoordSenderRetries(t *testing.T) {
 				b.Put("c", "put")
 				return txn.CommitInBatch(ctx, b) // both puts will succeed, et will retry from get
 			},
+			txnCoordRetry: true,
 			// Client-side retry required as this will be a mixed success due
 			// to parallel commits.
-			clientRetry: true,
+			clientRetry: false,
 		},
 		{
 			name: "multi-range batch with forwarded timestamp and cput and delete range",
@@ -2513,7 +2514,7 @@ func TestTxnCoordSenderRetries(t *testing.T) {
 			// Parallel commits do not support the canForwardSerializableTimestamp
 			// optimization. That's ok because we need to removed that optimization
 			// anyway. See #36431.
-			txnCoordRetry: true,
+			txnCoordRetry: false,
 		},
 		{
 			name: "multi-range batch with write too old and failed cput",
@@ -2627,9 +2628,10 @@ func TestTxnCoordSenderRetries(t *testing.T) {
 				return txn.CommitInBatch(ctx, b)
 			},
 			filter: newUncertaintyFilter(roachpb.Key([]byte("c"))),
+			txnCoordRetry: true,
 			// Client-side retry required as this will be a mixed success due
 			// to parallel commits.
-			clientRetry: true,
+			clientRetry: false,
 		},
 		{
 			name: "multi-range batch with uncertainty interval error and get conflict",
diff --git a/pkg/kv/txn_interceptor_committer.go b/pkg/kv/txn_interceptor_committer.go
index 8860cca99d..cccd182457 100644
--- a/pkg/kv/txn_interceptor_committer.go
+++ b/pkg/kv/txn_interceptor_committer.go
@@ -28,7 +28,7 @@ import (
 var parallelCommitsEnabled = settings.RegisterBoolSetting(
 	"kv.transaction.parallel_commits_enabled",
 	"if enabled, transactional commits will be parallelized with transactional writes",
-	true,
+	false,
 )
 
 // txnCommitter is a txnInterceptor that concerns itself with committing and
diff --git a/pkg/kv/txn_interceptor_committer_test.go b/pkg/kv/txn_interceptor_committer_test.go
index d02efc9937..1e7df7c496 100644
--- a/pkg/kv/txn_interceptor_committer_test.go
+++ b/pkg/kv/txn_interceptor_committer_test.go
@@ -28,9 +28,11 @@ import (
 )
 
 func makeMockTxnCommitter() (txnCommitter, *mockLockedSender) {
+	st := cluster.MakeTestingClusterSettings()
+	parallelCommitsEnabled.Override(&st.SV, true)
 	mockSender := &mockLockedSender{}
 	return txnCommitter{
-		st:      cluster.MakeTestingClusterSettings(),
+		st:      st,
 		stopper: stop.NewStopper(),
 		wrapped: mockSender,
 		mu:      new(syncutil.Mutex),
@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-client Relating to the KV client and the KV interface. labels May 26, 2019
@nvanbenschoten nvanbenschoten self-assigned this May 26, 2019
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 26, 2019
See cockroachdb#37866.

This has been failing in roachtests like cockroachdb#37488. This will make
it easier to debug until the issue is fixed.

Release note: None
craig bot pushed a commit that referenced this issue May 26, 2019
37867: kv: replace unexpected txn state assertion with error r=nvanbenschoten a=nvanbenschoten

See #37866.

This has been failing in roachtests like #37488. This will make it easier to debug until the issue is fixed.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@tbg
Copy link
Member

tbg commented May 27, 2019

I went on a little survey of all of those crashes that I could find to see if I could spot any commonalities.

Ended up with seven results (https://gist.github.com/tbg/a3a40b6db38c079ff88122c537bc765b):

  • all of them are from a kvfetcher, i.e. we're trying to send some scan when the assertion fires
  • I didn't see any TransactionStatusErrors (only checked two runs, maybe I just don't see them in the logs but I'd expect them to be logged by the workload)
  • all of the txn protos from the assertion are COMMITTED, are from epoch 1 (orig = commit
  • in most runs the error happens early (i.e. first minute or so), there's some selection bias here since this is tpccbench and it doesn't run the workload for long periods of time
  • there are always intent records associated to the txns, and in a few cases also inflight writes. int (intent records) is always seqno-1, which confirms another time that there weren't any restarts
  • the txn records are anchored at various tableIDs (not sure if they're stable enough to draw conclusions from, but I think we're seeing more than one kind of txn run into this)
  • often but not always n1 is the one crashing.

@tbg
Copy link
Member

tbg commented May 27, 2019

I sprinkled some assertions that should fire earlier (i.e. when the weird update is made). Branch is at https://github.com/tbg/cockroach/tree/repro/tpcc-37866

There have been enough test failures here to hopefully catch this.

@tbg
Copy link
Member

tbg commented May 27, 2019

One of my assertions fired (the fatalf block below, the rest is regular TxnCoordSender code). The stack trace has a commit attempt (sql.(*connExecutor).commitSQLTransaction).

	// If we succeeded to commit, or we attempted to rollback, we move to
	// txnFinalized.
	req, ok := ba.GetArg(roachpb.EndTransaction)
	if ok {
		etReq := req.(*roachpb.EndTransactionRequest)
		if etReq.Commit {
			if pErr == nil {
				tc.mu.txnState = txnFinalized
				tc.cleanupTxnLocked(ctx)
				tc.maybeSleepForLinearizable(ctx, br, startNs)
			}
		} else {
			// Rollbacks always move us to txnFinalized.
			tc.mu.txnState = txnFinalized
			tc.cleanupTxnLocked(ctx)
		}
	}

	if tc.mu.txnState == txnPending && tc.mu.txn.Status == roachpb.COMMITTED {
		log.Fatalf(ctx, "TBG txnPending but committed %+v: %s; pErr=%v", req, tc.mu.txn, pErr)
	}
F190527 19:51:26.194094 21448 kv/txn_coord_sender.go:828  [n3,client=10.12.42.130:54118,user=root,txn=86c8ec4f] TBG txnPending but committed header:

endtxn:
<key:"\300\211\367\007a\215\210" sequence:29 > commit:true deadl
ine:<wall_time:1558986834840509439 > intent_spans:<key:"\302\211\367\007a\367\"\t\210" > intent_spans:<key:"\302\211\367\007a\367g\007\210" > intent_spans:<key:"\302\211\367\007a\367\242k\210" > intent_spans:<key
:"\302\211\367\007a\367\301\222\210" > intent_spans:<key:"\302\211\367\007a\370\001\002\003\210" > intent_spans:<key:"\302\211\367\007a\370\001\"\023\210" > intent_spans:<key:"\302\211\367\007a\370\001:\023\210"
> intent_spans:<key:"\302\211\367\007a\370\001\201\377\210" > intent_spans:<key:"\303\211\367\007a\215\206\364\016\210" > in_flight_writes:<key:"\300\211\367\007a\215\210" sequence:1 > in_flight_writes:<key:"\303
\212\367\007a\215\367\001\224\206\364\016\210" sequence:3 > in_flight_writes:<key:"\276\211\367\007a\215\367\013\361\210" sequence:4 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\212\210" sequence:13
 > in_flight_writes:<key:"\275\212\367\007a\367\"\t\367\007a\215\367\013\361\212\210" sequence:14 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\217\210" sequence:15 > in_flight_writes:<key:"\275\212\
367\007a\367g\007\367\007a\215\367\013\361\217\210" sequence:16 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\213\210" sequence:17 > in_flight_writes:<key:"\275\212\367\007a\367\242k\367\007a\215\367
\013\361\213\210" sequence:18 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\214\210" sequence:19 > in_flight_writes:<key:"\275\212\367\007a\367\301\222\367\007a\215\367\013\361\214\210" sequence:20 >
 in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\215\210" sequence:21 > in_flight_writes:<key:"\275\212\367\007a\370\001\002\003\367\007a\215\367\013\361\215\210" sequence:22 > in_flight_writes:<key:"\27
5\211\367\007a\215\206\364\016\220\210" sequence:23 > in_flight_writes:<key:"\275\212\367\007a\370\001\"\023\367\007a\215\367\013\361\220\210" sequence:24 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\01
6\211\210" sequence:25 > in_flight_writes:<key:"\275\212\367\007a\370\001:\023\367\007a\215\367\013\361\211\210" sequence:26 > in_flight_writes:<key:"\275\211\367\007a\215\206\364\016\216\210" sequence:27 > in_fl
ight_writes:<key:"\275\212\367\007a\370\001\201\377\367\007a\215\367\013\361\216\210" sequence:28 > :


tc.mu.txn:
"sql txn" id=86c8ec4f key=/Table/56/1/1889/5/0 rw=true pri=0.06032274 stat=COMMITTED epo=1 ts=1558986675.02063
3245,0 orig=1558986675.020633245,0 max=1558986675.520633245,0 wto=false seq=29 int=28 ifw=19


pErr=TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id
=86c8ec4f key=/Table/56/1/1889/5/0 rw=true pri=0.06032274 stat=PENDING epo=0 ts=1558986675.020633245,0 orig=1558986675.020633245,0 max=1558986675.520633245,0 wto=false seq=28
goroutine 21448 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000054301, 0xc000054300, 0x5586800, 0x16)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5d15580, 0xc000000004, 0x5586826, 0x16, 0x33c, 0xc064725400, 0xb38)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:878 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3b729e0, 0xc05279c480, 0x4, 0x2, 0x3451d42, 0x2d, 0xc01785a6e0, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d8
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3b729e0, 0xc05279c480, 0x1, 0x4, 0x3451d42, 0x2d, 0xc01785a6e0, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3b729e0, 0xc05279c480, 0x3451d42, 0x2d, 0xc01785a6e0, 0x3, 0x3)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).Send(0xc0103bcd80, 0x3b729e0, 0xc05279c480, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:828 +0x64a
github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).sendUsingSender(0xc00079a700, 0x3b729e0, 0xc05279c420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:661 +0x119
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Send(0xc047622120, 0x3b729e0, 0xc05279c420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:793 +0x13c
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).commit(0xc047622120, 0x3b729e0, 0xc05279c420, 0xc00048aa80, 0x2100c1b)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:492 +0x169
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).Commit(0xc047622120, 0x3b729e0, 0xc05279c420, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:519 +0x3f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).commitSQLTransaction(0xc009979600, 0x3b729e0, 0xc05279c420, 0x3b77f60, 0xc051c64810, 0x3b729e0, 0xc05279c420, 0xc0110e8310, 0xc02b4773f0)

This doesn't exactly tell us when the COMMITTED status leaked into the proto (it wasn't from pErr; that has a PENDING status) but I also have this auxiliary logging which confirms that the offending transaction was successfully recovered:

$ grep -R 86c8ec4f artifacts/
artifacts/tpccbench/nodes=3/cpu=16/2.logs/cockroach.log:I190527 19:51:22.127806 2946879 storage/txnrecovery/manager.go:154  [n2,s2,txn-recovery] TBG recovering id=86c8ec4f key=/Table/56/1/1889/5/0 rw=true pri=0.06032274 stat=STAGING epo=0 ts=1558986675.020633245,0 orig=1558986675.020633245,0 max=0.000000000,0 wto=false seq=29 int=9 ifw=19
artifacts/tpccbench/nodes=3/cpu=16/2.logs/cockroach.log:I190527 19:51:23.166971 2946879 storage/txnrecovery/manager.go:319  [n2,s2,txn-recovery] TBG sending Recover(commit=true) for id=86c8ec4f key=/Table/56/1/1889/5/0 rw=true pri=0.06032274 stat=STAGING epo=0 ts=1558986675.020633245,0 orig=1558986675.020633245,0 max=0.000000000,0 wto=false seq=29 int=9 ifw=19

What's interesting here is the epoch zero. The failures all have epoch one, which means there was some restart attempt involved. I'll have to dig some more, but tentatively I assume that the bug is something like this

  1. txn attempts to parallel commit at epo=0
  2. somehow doesn't receive all the acks, though they do happen
  3. concurrent writer finishes and commits the txn
  4. the committed proto somehow leaks into the restarted txn and triggers the assertion

The problem of course is real. If the txn is in fact committed, the client shouldn't be able to accidentally restart and commit. I think the write timestamp cache prevents it from committing (have to double check that, but this sounds like something @nvanbenschoten thought of by using the epoch zero timestamp), so the major problem might be getting the bookkeeping right.

@tbg
Copy link
Member

tbg commented May 27, 2019

There's no code path (that I don't have an assertion on) that would reset txnState back from finalized to pending. So we must be ingesting an update that moves us to Committed.

I looked some more and discovered that heartbeats also write to the txn (although under a different name). This led me to the following interesting piece of code:

// Update our txn. In particular, we need to make sure that the client will
// notice when the txn has been aborted (in which case we'll give them an
// error on their next request).
//
// TODO(nvanbenschoten): It's possible for a HeartbeatTxn request to observe
// the result of an EndTransaction request and beat it back to the client.
// This is an issue when a COMMITTED txn record is GCed and later re-written
// as ABORTED. The coordinator's local status could flip from PENDING to
// ABORTED (after heartbeat response) to COMMITTED (after commit response).
// This appears to be benign, but it's still somewhat disconcerting. If this
// ever causes any issues, we'll need to be smarter about detecting this race
// on the client and conditionally ignoring the result of heartbeat responses.
if respTxn != nil {
if respTxn.Status == roachpb.STAGING {
// Consider STAGING transactions to be PENDING for the purpose of
// the heartbeat loop. Interceptors above the txnCommitter should
// be oblivious to parallel commits.
respTxn.Status = roachpb.PENDING
}
h.mu.txn.Update(respTxn)
if h.mu.txn.Status != roachpb.PENDING {
if h.mu.txn.Status == roachpb.ABORTED {
log.VEventf(ctx, 1, "Heartbeat detected aborted txn. Cleaning up.")
h.abortTxnAsyncLocked(ctx)
}
return false
}
}
return true
}

Sure seems likely that the txn gets committed, and then the heartbeat picks up the committed proto but doesn't do the bookkeeping around it. I'll verify with a print statement before I fix it.

@nvanbenschoten
Copy link
Member Author

but tentatively I assume that the bug is something like this

  1. txn attempts to parallel commit at epo=0
  2. somehow doesn't receive all the acks, though they do happen
  3. concurrent writer finishes and commits the txn
  4. the committed proto somehow leaks into the restarted txn and triggers the assertion

This is roughly what I was thinking was going on as well. I mentioned above that it's likely that the assertion failure happens when a heartbeat request finds the COMMITTED record and the TransactionStatusErrors happens when the EndTransaction finds the COMMITTED record first. This would allow a single issue to explain both of these symptoms.

I think the write timestamp cache prevents it from committing

Yes it does, so the transaction wouldn't be able to commit a second time. The big question is why the transaction is being COMMITTED (probably by a txn recovery attempt) when the client isn't expecting it to be. There should be no ambiguity to the TxnCoordSender about whether a parallel commit attempt resulted in an implicit commit state (mod AmbiguousCommitErrors). Maybe there's a bug in needTxnRetryAfterStaging.

I meant to mention above that one thing that might help us reproduce this is to enable RecoverIndeterminateCommitsOnFailedPushes when running TPC-C. This will cause all transaction pushes on STAGING transaction records to immediately kick off transaction recovery.

@tbg
Copy link
Member

tbg commented May 28, 2019

Confirmed the sequence of events above. No idea yet why the client is attempting to commit, going to dig into that next. As @nvanbenschoten mentioned, we'd expect the client to either know that the parallel commit succeeded, or know that it definitely failed, or not know either way. The "not know either way" case is the one likely to be mishandled here. For example, I could see this code

// If the batch resulted in an error but the EndTransaction request
// succeeded, staging the transaction record in the process, downgrade
// the status back to PENDING. Even though the transaction record may
// have a status of STAGING, we know that the transaction failed to
// implicitly commit, so interceptors above the txnCommitter in the
// stack don't need to be made aware that the record is staging.
if txn := pErr.GetTxn(); txn != nil && txn.Status == roachpb.STAGING {
pErr.SetTxn(cloneWithStatus(txn, roachpb.PENDING))
}

downgrade from STAGING to PENDING, and if the error doesn't terminate the txn we can see a restart. Naively this might happen if the error is just a txn retry error. I'll add logging there to see if we get anything interesting.

$ ./cockroach debug merge-logs --filter cda3b649 --file-pattern '(.*)' artifacts/tpccbench/nodes\=3/cpu\=16/*.logs/cockroach.log | head -n 10 | pbcopy
> I190528 09:34:02.322148 3382140 storage/txnrecovery/manager.go:154  [n2,s2,txn-recovery] TBG recovering id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=STAGING epo=0 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=0.000000000,0 wto=false seq=7 int=2 ifw=4
> I190528 09:34:02.324450 3382140 storage/txnrecovery/manager.go:319  [n2,s2,txn-recovery] TBG sending Recover(commit=true) for id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=STAGING epo=0 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=0.000000000,0 wto=false seq=7 int=2 ifw=4
> I190528 09:34:03.656717 2587878 kv/txn_interceptor_heartbeater.go:455  [n1,txn-hb=00000000] TBG feeding committed txn:
 new:id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=COMMITTED epo=0 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=0.000000000,0 wto=false seq=7 int=6
cur:"sql txn" id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=PENDING epo=0 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=1559036035.085618629,0 wto=false seq=6
> F190528 09:34:04.598888 24436 kv/txn_coord_sender.go:828  [n1,client=10.12.43.217:50302,user=root,txn=cda3b649] TBG txnPending but committed header:<key:"\304\211\367\006\344\210" sequence:7 > commit:true deadline:<wall_time:1559036296950098557 > intent_spans:<key:"\300\211\367\006\344\215\210" > intent_spans:<key:"\303\211\367\006\344\215\367\005\000\210" > in_flight_writes:<key:"\304\211\367\006\344\210" sequence:1 > in_flight_writes:<key:"\302\211\367\006\344\022s<{\370\347hM#\262\342:\361\271Yz\337\000\001\210" sequence:4 > in_flight_writes:<key:"\302\212\367\006\344\215\367\005\000\367\006\344\022s<{\370\347hM#\262\342:\361\271Yz\337\000\001\210" sequence:5 > in_flight_writes:<key:"\302\213\367\006\344\215\022s<{\370\347hM#\262\342:\361\271Yz\337\000\001\210" sequence:6 > : "sql txn" id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=COMMITTED epo=1 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=1559036035.085618629,0 wto=false seq=7 int=6; pErr=TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=cda3b649 key=/Table/60/1/1764/0 rw=true pri=0.00716429 stat=PENDING epo=0 ts=1559036034.585618629,0 orig=1559036034.585618629,0 max=1559036035.085618629,0 wto=false seq=6
goroutine 24436 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0000cc301, 0xc0000cc3c0, 0x5586900, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1020 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5d15580, 0xc000000004, 0x5586916, 0x16, 0x33c, 0xc043c86a00, 0x4be)


@tbg
Copy link
Member

tbg commented May 28, 2019

I made

case roachpb.COMMITTED:
return result.Result{}, roachpb.NewTransactionCommittedStatusError()
an assertion. It failed quickly in the scenario in which a recovery process beats the commit (for the same epoch), on the attempt to move the txn to staging status. This is expected to be frequent with the modifications I'm running with, but it may be a red herring.

I'll make that assertion a warning to see if it translates itself into an assertion failure higher up in the stack.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented May 28, 2019

We expect to hit that code path frequently (especially with RecoverIndeterminateCommitsOnFailedPushes) when marking an implicit commit as explicit, but that should be ignored on the client.

However, it would be unexpected to hit that code path when the EndTransactionRequest has in-flight writes attached.

@tbg
Copy link
Member

tbg commented May 28, 2019

However, it would be unexpected to hit that code path when the EndTransactionRequest has in-flight writes attached.

Hmm, there's something going on that I don't understand. I am indeed seeing this code path hit by staging EndTransactions with inflight writes all the time (in fact 100% of the time in the log I'm looking at which has maybe a dozen occurrences). The only explanation I have for this is that EndTransaction(stage) routinely gets executed more than once under certain scenarios, so that we can have the interleaving

  • txn stages
  • txn is committed via recovery
  • txn stages (identical to first step), runs into the already committed EndTransaction code path

I'm running with "nuclear logging" (logging every request that doesn't early exit in the txn committer), so I should know more soon.

@tbg
Copy link
Member

tbg commented May 28, 2019

I got another repro with a bit more logging, which I think corresponds to the following sequence of events:

  1. txn starts, does stuff, tries to parallel commit (final batch is 1 EndTxn, 29 QueryIntent)
  2. contending txn recover-commits the txn. This is possible because the final batch above doesn't actually lay down any intents, so as soon as the txn record is there it's fully committed if the intents are already there)
  3. heartbeat loop picks up the committed record, feeds it into the proto
  4. txn receives from the final batch a *MixedSuccessError{Wrapped: &roachpb.IntentMissingError{}} because some of its intents have already been resolved (and so the txnid is lost).
  5. the txn refresher realizes that it can't refresh across a *MixedSuccessError, but it does unwrap it and returns the *IntentMissingError up the stack:
    if aPSErr, ok := pErr.GetDetail().(*roachpb.MixedSuccessError); ok {
    log.VEventf(ctx, 2, "got partial success; cannot retry %s (pErr=%s)", ba, aPSErr.Wrapped)
    return nil, aPSErr.Wrapped, hlc.Timestamp{}
    }
  6. the txnPipeliner transforms this *IntentMissingError into a txn restart:
    // Turn an IntentMissingError into a transactional retry error.
    if ime, ok := pErr.GetDetail().(*roachpb.IntentMissingError); ok {
    log.VEventf(ctx, 2, "transforming intent missing error into retry: %v", ime)
    err := roachpb.NewTransactionRetryError(
    roachpb.RETRY_ASYNC_WRITE_FAILURE, fmt.Sprintf("missing intent on: %s", ime.Key))
    retryErr := roachpb.NewErrorWithTxn(err, pErr.GetTxn())
    retryErr.Index = pErr.Index
    return retryErr
    }
  7. immediately after the interceptor stack is done, TxnCoordSender restarts the txn and then immediately runs into an assertion I placed just below:
    if tc.mu.txnState == txnPending && tc.mu.txn.Status == roachpb.COMMITTED {
        log.Fatalf(ctx, "TBG txnPending but committed %+v: %s; pErr=%v", req, tc.mu.txn, pErr)
    }

As an additional bit of support for this sequence of events, the pErr logged in that assertion is a RETRY_ASYNC_WRITE_FAILURE containing as its txn the epoch-zero txn proto.

At first glance, the unwrapping of the MixedSuccessError seems fishy and removing that (turning it into an ambiguous commit instead) would likely solve the problem (I'll try that tomorrow just to validate the above sequence of events further). However, it seems that the underlying issue is a bit deeper than that. Once we get rid of MixedSuccessError we'd have the same kind of problem again: we might get *IntentMissingError but the txn is actually committed. Hmm, or actually, no we wouldn't. We would (should) fail to stage the txn record again (since the persisted version is already COMMITTED), or use the committed proto as a signal that we're in the clear (though synthesizing the batch response will be awkward: requires replacing all writes with reads at the right timestamps).

@andreimatei
Copy link
Contributor

Good investigation.
I think we should try to avoid introducing more ambiguous commit errors. It sounds to me like we need to unpack the MixedSuccessError and, if we managed to write the txn record, try to resolve the ambiguity by... retrying the QueryIntents. But how many times should we retry? What does the IntentResolver code do when its QueryIntents fail - does it retry indefinitely?

@tbg
Copy link
Member

tbg commented May 28, 2019

I agree that we should not introduce ambiguous results willy-nilly, but as a short term fix (until Nathan is back next week) to get back on safe ground I think it'd be acceptable. That said, I haven't seen this cause test failures since we toned down the assertion so if that stays like that I'd rather put work into the real fix, whatever that may be.

@andreimatei
Copy link
Contributor

Toned down the assertion? Like, the user gets a non-ambiguous error but the txn is committed? This can't stay so, right?

@tbg
Copy link
Member

tbg commented May 28, 2019

I meant this PR: #37867

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 29, 2019
This commit adds write pipelining and concurrent intent resolution
by conflicting transactions to the parallel commits spec. This causes
an assertion to be triggered by the hazard discussed in cockroachdb#37866.

The assertion fires when the pre-commit QueryIntent for a pipelined
write gets confused about an already resolved intent. This triggers
a transaction retry, at which point the transaction record is
unexpectedly already committed.

Running the model right now hits this issue. The next commit will
propose a solution to fixing it.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 29, 2019
…mmit

This commit proposes a medium-term solution to cockroachdb#37866. In doing so, it
resolved the model failure from the previous commit.

The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from the pre-commit QueryIntent
batch. When we see one of these errors, we immediately send a `QueryTxn`
request to the transaction record. This will result in one of the four
statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest`
    succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and
    return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError
    and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all intents
were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will
be needed and I suspect it will be able to tide us over until we're able to eliminate
the loss of information caused by intent resolution almost entirely (e.g. by storing
transaction IDs in committed values). There will still be some loss of information if
we're not careful about MVCC GC, and it's still not completely clear to me how we'll
need to handle that in every case.

Release note: None
@nvanbenschoten
Copy link
Member Author

Thanks again for tracking this down @tbg.

I proposed a medium-term solution to this over in #37900. I don't know that we can completely eliminate the ambiguity due to a combination of loss of information during intent resolution and transaction record GC, but we can get pretty close and throw AmbiguousCommitErrors only where they're strictly necessary. Let me know what you two think.

craig bot pushed a commit that referenced this issue Jun 3, 2019
37900: tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit r=nvanbenschoten a=nvanbenschoten

This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866.

The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed.

This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit.

The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch (right around [here](https://github.com/cockroachdb/cockroach/blob/2428567cba5e0838615521cbc9d0e1310f0ee6ad/pkg/kv/dist_sender.go#L916)). When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time.

37918: roachtest: Move jepsen-initialized marker out of defer r=tbg a=bdarnell

This was running whether the test succeeded or failed, which is silly.

Fixes #37831

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 4, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jun 4, 2019
…lel commit

Fixes cockroachdb#37866.

This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled
in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s
`divideAndSendParallelCommit` method coming from a parallel commit's pre-commit
QueryIntent batch. When we see one of these errors, we immediately send a
`QueryTxn` request to the transaction record. This will result in one of the
four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore.
3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during
intent resolution to just the case where the result of the QueryTxn is ABORTED.
This is because an ABORTED record can mean either 1) the transaction was ABORTED
and the missing intent was removed or 2) the transaction was COMMITTED, all
intents were resolved, and the transaction record was GCed.

Release note: None
craig bot pushed a commit that referenced this issue Jun 4, 2019
37987: kv: detect if missing intent is due to intent resolution during parallel commit r=nvanbenschoten a=nvanbenschoten

Fixes #37866.

This commit implements the medium-term solution to #37866 proposed and modeled in #37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses:
1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore.
2. STAGING: Unambiguously not the issue from #37866. Ignore.
3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto.
4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return.

This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #37987 Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
3 participants