Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: fix issues around failed 1PC txns #26497

Merged
merged 1 commit into from
Jun 7, 2018

Conversation

andreimatei
Copy link
Contributor

The recent #25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None

@andreimatei andreimatei requested a review from a team June 7, 2018 00:40
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

r1 is #26479


Review status: 0 of 8 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 7, 2018

:lgtm:

How did you determine that this fix was needed? I assume you were seeing cases where intents were being left pending; should we have a roachtest that exhibits this more clearly?


Reviewed 7 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, all discussions resolved, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

		tc.mu.meta.Txn = br.Txn.Clone()
		_, hasBT := ba.GetArg(roachpb.BeginTransaction)
		onePC := br.Txn.Status == roachpb.COMMITTED && hasBT

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)


pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

		Txn: txn.Proto(),
	}
	// ba := roachpb.BatchRequest{}

Remove these lines?


Comments from Reviewable

@nvanbenschoten
Copy link
Member

Reviewed 3 of 7 files at r2.
Review status: 7 of 8 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/internal/client/sender.go, line 111 at r2 (raw file):

// TxnSenders with GetMeta or AugmentMeta panicing with unimplemented. This is
// a helper mechanism to facilitate testing.
type TxnSenderAdapter struct {

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.


pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

// the client.Txn is expected to create a new TxnCoordSender instance
// transparently for the higher-level client).
// - Ensures atomic execution for non-transactional (write) batches by transparently

We need to do this even for scans that span ranges.


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

Review status: 7 of 8 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 104 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We need to do this even for scans that span ranges.

do we? If the txn is read-only, isn't a consistent timestamp the only thing we need across ranges?


Comments from Reviewable

The recent cockroachdb#25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None
@andreimatei andreimatei requested review from a team June 7, 2018 19:10
@andreimatei
Copy link
Contributor Author

As we discussed, badness had not been seen in the wild. I think the test I've added is sufficient.


Review status: 2 of 8 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/internal/client/sender.go, line 111 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Change this back to TxnSenderFunc if you're removing StartTrackingWrapped.

Done.


pkg/kv/txn_coord_sender.go, line 635 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It seems like a lot of plumbing to get the onePC flag from here to updateStats. Could we just increment the 1PC stat here? (or maybe replace the metric completely with one tracked at the replica so it can be accurate)

Done.


pkg/kv/txn_coord_sender_test.go, line 1975 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove these lines?

Done.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

:lgtm:


Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

@andreimatei
Copy link
Contributor Author

bors r+


Review status: 2 of 8 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


Comments from Reviewable

craig bot pushed a commit that referenced this pull request Jun 7, 2018
26497: kv: fix issues around failed 1PC txns r=andreimatei a=andreimatei

The recent #25541 changed the way "tracking" (the heartbeat loop and
intent collection) is initiated for transactions. It aimed to simplify
things and put the burden on the client to decide when a txn needs
tracking. This introduced a problem - the client.Txn was not initiating
tracking when sending 1PC batches. However, tracking is needed for these
transactions too: even though usually they'll succeed and so the
TxnCoordSender state can be quickly destroyed, when they fail their
intents and heartbeat loop need to be kept around just like for any
other txn.

This patch backtracks on the previous move to make it the client's
responsibility to initiate tracking (it didn't stand the test of time):
the client.Txn is no longer in charge of calling tcs.StartTracking().
Instead, the TCS does whatever needs to be done when it sees an
EndTransaction.

I also took the opportunity to spruce up comments on the TxnCoordSender.

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jun 7, 2018

Build succeeded

@craig craig bot merged commit 22554db into cockroachdb:master Jun 7, 2018
andreimatei added a commit to andreimatei/cockroach that referenced this pull request Jun 11, 2018
I've introduced a bug in cockroachdb#26497 - if there's concurrent writing requests
going though a single client.Txn, it can be that some of the intents
are not tracked correctly. Before this patch, intents started being
tracked when the TxnCoordSender sees a BeginTransaction. Even through
the client.Txn attaches an EndTransaction to the first writing requests
that it sees, a 2nd request might pass through and get to the
TxnCoordSender before that BeginTxn. In fact, it might even get a
response before the TxnCoordSender sees the BeginTxn, in which case we
wouldn't have recorded its intents.
This patch makes the TxnCoordSender track intents from all writing
requests. The BeginTxn doesn't matter for tracking intents any more. It
continues to matter for starting the heartbeat loop. We could
alternatively move to starting the hb loop when seeing the first write,
but I didn't do it.

Fixes cockroachdb#26538

Release note: None
@andreimatei andreimatei deleted the txn-tracking branch June 11, 2018 21:38
craig bot pushed a commit that referenced this pull request Jun 12, 2018
26606: kv: fix intent tracking r=andreimatei a=andreimatei

I've introduced a bug in #26497 - if there's concurrent writing requests
going though a single client.Txn, it can be that some of the intents
are not tracked correctly. Before this patch, intents started being
tracked when the TxnCoordSender sees a BeginTransaction. Even through
the client.Txn attaches an EndTransaction to the first writing requests
that it sees, a 2nd request might pass through and get to the
TxnCoordSender before that BeginTxn. In fact, it might even get a
response before the TxnCoordSender sees the BeginTxn, in which case we
wouldn't have recorded its intents.
This patch makes the TxnCoordSender track intents from all writing
requests. The BeginTxn doesn't matter for tracking intents any more. It
continues to matter for starting the heartbeat loop. We could
alternatively move to starting the hb loop when seeing the first write,
but I didn't do it.

Fixes #26538

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants