kv: fix issues around failed 1PC txns #26497

andreimatei · 2018-06-07T00:40:13Z

The recent #25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None

cockroach-teamcity · 2018-06-07T00:40:20Z

This change is

andreimatei · 2018-06-07T00:57:38Z

r1 is #26479

Review status: 0 of 8 files reviewed at latest revision, all discussions resolved, some commit checks pending.

Comments from Reviewable

bdarnell · 2018-06-07T15:05:03Z

How did you determine that this fix was needed? I assume you were seeing cases where intents were being left pending; should we have a roachtest that exhibits this more clearly?

Reviewed 7 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, all discussions resolved, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

		tc.mu.meta.Txn = br.Txn.Clone()
		_, hasBT := ba.GetArg(roachpb.BeginTransaction)
		onePC := br.Txn.Status == roachpb.COMMITTED && hasBT

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)

pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

		Txn: txn.Proto(),
	}
	// ba := roachpb.BatchRequest{}

Remove these lines?

Comments from Reviewable

nvanbenschoten · 2018-06-07T18:07:01Z

Reviewed 3 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.

pkg/internal/client/sender.go, line 111 at r2 (raw file):

// TxnSenders with GetMeta or AugmentMeta panicing with unimplemented. This is
// a helper mechanism to facilitate testing.
type TxnSenderAdapter struct {

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.

pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

// the client.Txn is expected to create a new TxnCoordSender instance
// transparently for the higher-level client).
// - Ensures atomic execution for non-transactional (write) batches by transparently

We need to do this even for scans that span ranges.

Comments from Reviewable

andreimatei · 2018-06-07T18:09:33Z

Review status: 7 of 8 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We need to do this even for scans that span ranges.

do we? If the txn is read-only, isn't a consistent timestamp the only thing we need across ranges?

Comments from Reviewable

The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None

andreimatei · 2018-06-07T19:10:11Z

As we discussed, badness had not been seen in the wild. I think the test I've added is sufficient.

Review status: 2 of 8 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

pkg/internal/client/sender.go, line 111 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.

Done.

pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)

Done.

pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove these lines?

Done.

Comments from Reviewable

nvanbenschoten · 2018-06-07T19:18:21Z

Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

Comments from Reviewable

andreimatei · 2018-06-07T19:20:52Z

bors r+

Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

Comments from Reviewable

26497: kv: fix issues around failed 1PC txns r=andreimatei a=andreimatei The recent #25541 changed the way "tracking" (the heartbeat loop and intent collection) is initiated for transactions. It aimed to simplify things and put the burden on the client to decide when a txn needs tracking. This introduced a problem - the client.Txn was not initiating tracking when sending 1PC batches. However, tracking is needed for these transactions too: even though usually they'll succeed and so the TxnCoordSender state can be quickly destroyed, when they fail their intents and heartbeat loop need to be kept around just like for any other txn. This patch backtracks on the previous move to make it the client's responsibility to initiate tracking (it didn't stand the test of time): the client.Txn is no longer in charge of calling tcs.StartTracking(). Instead, the TCS does whatever needs to be done when it sees an EndTransaction. I also took the opportunity to spruce up comments on the TxnCoordSender. Release note: None Co-authored-by: Andrei Matei <[email protected]>

craig · 2018-06-07T19:41:22Z

Build succeeded

GitHub CI (Cockroach)

I've introduced a bug in cockroachdb#26497 - if there's concurrent writing requests going though a single client.Txn, it can be that some of the intents are not tracked correctly. Before this patch, intents started being tracked when the TxnCoordSender sees a BeginTransaction. Even through the client.Txn attaches an EndTransaction to the first writing requests that it sees, a 2nd request might pass through and get to the TxnCoordSender before that BeginTxn. In fact, it might even get a response before the TxnCoordSender sees the BeginTxn, in which case we wouldn't have recorded its intents. This patch makes the TxnCoordSender track intents from all writing requests. The BeginTxn doesn't matter for tracking intents any more. It continues to matter for starting the heartbeat loop. We could alternatively move to starting the hb loop when seeing the first write, but I didn't do it. Fixes cockroachdb#26538 Release note: None

26606: kv: fix intent tracking r=andreimatei a=andreimatei I've introduced a bug in #26497 - if there's concurrent writing requests going though a single client.Txn, it can be that some of the intents are not tracked correctly. Before this patch, intents started being tracked when the TxnCoordSender sees a BeginTransaction. Even through the client.Txn attaches an EndTransaction to the first writing requests that it sees, a 2nd request might pass through and get to the TxnCoordSender before that BeginTxn. In fact, it might even get a response before the TxnCoordSender sees the BeginTxn, in which case we wouldn't have recorded its intents. This patch makes the TxnCoordSender track intents from all writing requests. The BeginTxn doesn't matter for tracking intents any more. It continues to matter for starting the heartbeat loop. We could alternatively move to starting the hb loop when seeing the first write, but I didn't do it. Fixes #26538 Release note: None Co-authored-by: Andrei Matei <[email protected]>

andreimatei assigned tbg and nvanbenschoten Jun 7, 2018

andreimatei requested a review from a team June 7, 2018 00:40

andreimatei force-pushed the txn-tracking branch from 4146057 to 4bc7131 Compare June 7, 2018 00:59

andreimatei force-pushed the txn-tracking branch from 4bc7131 to f255df1 Compare June 7, 2018 18:55

andreimatei force-pushed the txn-tracking branch from f255df1 to 22554db Compare June 7, 2018 19:10

andreimatei requested review from a team June 7, 2018 19:10

craig bot merged commit 22554db into cockroachdb:master Jun 7, 2018

andreimatei mentioned this pull request Jun 11, 2018

kv: fix intent tracking #26606

Merged

andreimatei deleted the txn-tracking branch June 11, 2018 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: fix issues around failed 1PC txns #26497

kv: fix issues around failed 1PC txns #26497

andreimatei commented Jun 7, 2018

cockroach-teamcity commented Jun 7, 2018

andreimatei commented Jun 7, 2018

bdarnell commented Jun 7, 2018

nvanbenschoten commented Jun 7, 2018

andreimatei commented Jun 7, 2018

andreimatei commented Jun 7, 2018

nvanbenschoten commented Jun 7, 2018

andreimatei commented Jun 7, 2018

craig bot commented Jun 7, 2018

kv: fix issues around failed 1PC txns #26497

kv: fix issues around failed 1PC txns #26497

Conversation

andreimatei commented Jun 7, 2018

cockroach-teamcity commented Jun 7, 2018

andreimatei commented Jun 7, 2018

bdarnell commented Jun 7, 2018

nvanbenschoten commented Jun 7, 2018

andreimatei commented Jun 7, 2018

andreimatei commented Jun 7, 2018

nvanbenschoten commented Jun 7, 2018

andreimatei commented Jun 7, 2018

craig bot commented Jun 7, 2018

Build succeeded