Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: don't delete txn records from the rngfeed pusher #53146

Merged
merged 2 commits into from
Aug 20, 2020

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Aug 20, 2020

Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches #52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.

@andreimatei andreimatei requested a review from a team as a code owner August 20, 2020 18:55
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

Only the last commit.

@andreimatei andreimatei force-pushed the rangefeed.dont-gc-txn-record branch from beb3118 to 5c62f6a Compare August 20, 2020 18:58
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice. That was a pretty clean change.

Reviewed 1 of 7 files at r5, 6 of 6 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @nvanbenschoten)


pkg/kv/kvserver/replica_rangefeed.go, line 107 at r6 (raw file):

	ctx context.Context, intents []roachpb.LockUpdate,
) error {
	return tp.ir.ResolveIntents(ctx, intents, intentresolver.ResolveOptions{}).GoError()

Don't we need to poison? That seems pretty important.


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 549 at r6 (raw file):

// actors to GC a txn record, since that can cause ambiguities for the
// coordinator: if it had STAGED the txn, it won't be able to tell the
// difference between a txn that had been implicitly committed and one that

"implicitly committed, recovered, and GC-ed"

"someone else aborted and GC-ed"


pkg/kv/kvserver/rangefeed/processor_test.go, line 717 at r6 (raw file):

	tp.mockResolveIntentsFn(func(ctx context.Context, intents []roachpb.LockUpdate) error {
		// There's nothing to assert here. We expect the intents to correspond to
		// transactions that had theit LockSpans populated when we pushed them. This

their


pkg/kv/kvserver/rangefeed/task.go, line 206 at r6 (raw file):

			})

			// Asynchronously clean up the transaction's intents, which should

Asynchronously


pkg/kv/kvserver/rangefeed/task.go, line 231 at r6 (raw file):

			// populated. If, however, we ran into a transaction that its coordinator
			// tried to rollback but didn't follow up with garbage collection, then
			// LockSpans will be populated.

Lease some indication that all of this is opportunistic and not necessary for correctness.


pkg/kv/kvserver/rangefeed/task_test.go, line 272 at r6 (raw file):

}

func (tp *testTxnPusher) intentsToTxns(intents []roachpb.LockUpdate) []enginepb.TxnMeta {

Does this need to be a method? It doesn't look like it.


pkg/kv/kvserver/rangefeed/task_test.go, line 286 at r6 (raw file):

}

func (tp *testTxnPusher) mockResolveIntentsFn(

nit: move this up so that the two mocking functions are next to each other.

@andreimatei andreimatei force-pushed the rangefeed.dont-gc-txn-record branch from 5c62f6a to bc2f974 Compare August 20, 2020 20:36
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_rangefeed.go, line 107 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Don't we need to poison? That seems pretty important.

good catch; we need to poison on aborted transactions.
Btw, I've prepended a commit commenting on the Poison flag on the proto specifying that it's ignored for intents from non-ABORTED transactions.


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 549 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"implicitly committed, recovered, and GC-ed"

"someone else aborted and GC-ed"

done


pkg/kv/kvserver/rangefeed/processor_test.go, line 717 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

their

done


pkg/kv/kvserver/rangefeed/task.go, line 206 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Asynchronously

done


pkg/kv/kvserver/rangefeed/task.go, line 231 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Lease some indication that all of this is opportunistic and not necessary for correctness.

done


pkg/kv/kvserver/rangefeed/task_test.go, line 272 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does this need to be a method? It doesn't look like it.

it doesn't but I'd rather namespace it here


pkg/kv/kvserver/rangefeed/task_test.go, line 286 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this up so that the two mocking functions are next to each other.

done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 8 files at r7, 6 of 6 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei)


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 785 at r8 (raw file):

arent'

aren't


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 786 at r8 (raw file):

	//
	// This field is ignored for intents that arent' resolved for an ABORTED txn;
	// in other words only intents from ABORTED transactions ever poison the abort

Missing comma after "words"


pkg/roachpb/api.proto, line 1111 at r8 (raw file):

  TransactionStatus status = 3;
  // Optionally poison the abort span for the transaction on all ranges
  // on which the intents reside. The field is ignored if status != ABORTED (i.e. only intents from

Looks like the wrapping got messed up here.

Specify that this field is ignored when resolving COMMITTED (or
otherwise non-ABORTED) intents (which it is). At lease the
rangefeedTxnPusher sets it for non-aborted transactions, leading to
question about whether that's a good idea.

Release note: None
Before this patch, the rangefeed txn pushed was GC'ing the records of
the transactions that it found to be aborted or committed when it
PUSH_TOUCHED's them. Deleting the txn record is not a nice thing to do
if the txn's coordinator might still be running, and so this patch
terminates the practice. If the coordinator is still running and it had
STAGED the txn record, a PUSH_TOUCH might cause a transition to ABORTED
or COMMITTED. If such a transition is followed by the GC of the txn
record, this causes an ambigous commit error for the coordinator, which
can't tell whether the transaction committed (implicitly) or was rolled
back.

As the code stands, this patch doesn't actually help eliminate the
ambiguity in the ABORTED case (it does in the COMMIT case though),
because the coordinator doesn't distinguish between an ABORTED record
and a missing record; it treats both the same, as ambiguous. But we're
going to improve that, and the coordinator is gonna consider an existing
record as non-ambiguous. The GC'ed case is going to remain the ambiguous
one, and the idea is to avoid that as much as possible by only doing GC
from the gc-queue (after an hour) and from the coordinator.

Touches cockroachdb#52566

Release note (bug fix): Some rare AmbiguousCommitErrors happening when
CDC was used were eliminated.
@andreimatei andreimatei force-pushed the rangefeed.dont-gc-txn-record branch from bc2f974 to 8d162e4 Compare August 20, 2020 22:14
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 785 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
arent'

aren't

done


pkg/kv/kvserver/intentresolver/intent_resolver.go, line 786 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Missing comma after "words"

done


pkg/roachpb/api.proto, line 1111 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like the wrapping got messed up here.

done

@craig
Copy link
Contributor

craig bot commented Aug 20, 2020

Build succeeded:

@craig craig bot merged commit 904b7cb into cockroachdb:master Aug 20, 2020
@andreimatei andreimatei deleted the rangefeed.dont-gc-txn-record branch August 21, 2020 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants