rangefeed: fix premature checkpoint due to intent resolution race #117612

erikgrinaker · 2024-01-10T12:55:50Z

This PR depends on the following changes, which should be backported together:

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all.

This could happen because PushTxn may return a false ABORTED status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically
assumes an abort (see Replica.CanCreateTxnRecord and batcheval.SynthesizeTxnFromMeta).

However, the rangefeed txn pusher trusted this ABORTED status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario:

A rangefeed is running on a lagging follower.
A txn writes an intent, which is replicated to the follower.
The closed timestamp advances past the intent.
The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower.
The rangefeed pushes the txn to advance its resolved timestamp.
The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed).
The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint.
The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee.
The changefeed sees an event below its frontier and drops it, never emitting these events at all.

This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking.

Resolves #104309.
Epic: none
Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

cockroach-teamcity · 2024-01-10T12:56:03Z

This change is

nvanbenschoten

Nice! I think I prefer this over #114487 for the reasons discussed in the PR. I'm glad Barrier could be extended for this use case.

What do you think about it?

Reviewed 2 of 2 files at r1, 7 of 7 files at r2, 3 of 3 files at r3, 6 of 6 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvpb/api.proto line 1323 at r4 (raw file):

but it did exist in the past

I don't even think we know that. The timestamp cache may have been ratcheted above the txn's min timestamp across the board.

pkg/kv/kvserver/replica_rangefeed.go line 189 at r3 (raw file):

	// Execute a Barrier on the leaseholder.
	//
	// TODO(erikgrinaker): handle Barrier requests split across ranges. This can

If a range splits, its rangefeed processor will be torn down and recreated, at which point the resolved timestamp will be recomputed with a lock table scan. So it doesn't feel like we necessarily need to handle this by using the LAIs returned from the Barrier request. But we also shouldn't crash. What happens now?

pkg/kv/kvserver/replica_rangefeed.go line 193 at r3 (raw file):

	// isUnsplittable on BarrierRequest, but we can maybe do something better
	// (e.g. return per-range LAIs).
	desc := tp.r.Desc()

Should we be passing the key span in from the rangefeed processor? We will tear the processor down on splits or merged, but I'm not sure that precludes any races.

pkg/kv/kvserver/replica_rangefeed.go line 209 at r3 (raw file):

	for tp.r.GetLeaseAppliedIndex() < lai {
		if ticker == nil {
			ticker = time.NewTicker(100 * time.Millisecond)

Do you think it's worth adding an exponential backoff here? 100ms for the first iteration feels a bit long.

pkg/kv/kvserver/replica_write.go line 299 at r2 (raw file):

			}
			// Populate LeaseAppliedIndex for Barrier responses.
			if ba.IsSingleBarrierRequest() && propResult.Err == nil {

Reminder to myself and others who get here: Barrier requests do go through Raft, despite having an empty WriteBatch, because of BatchRequest.RequiresConsensus.

pkg/kv/kvserver/intentresolver/intent_resolver.go line 387 at r4 (raw file):

	pushType kvpb.PushTxnType,
	skipIfInFlight bool,
) (_ map[uuid.UUID]*roachpb.Transaction, ambiguousAbort bool, _ *kvpb.Error) {

nit: consider renaming this to anyAmbiguousAbort to something along those lines.

pkg/kv/kvserver/rangefeed/resolved_timestamp.go line 163 at r5 (raw file):

		//
		// TODO(erikgrinaker): check that this won't end up with crash loops.
		rts.assertOpAboveRTS(op, t.Timestamp)

I can't recall why we didn't already have this assertion here. I don't think it was an intentional omission when we added the other assertions.

pkg/kv/kvserver/rangefeed/task.go line 346 at r3 (raw file):

	// intents in the rangefeed processor's queue. These updates may not yet have
	// been applied to the resolved timestamp intent queue, but that's ok -- our
	// MVCCAbortTxnOp will be enqueued and processed after them.

Could you mention that it's safe to publish a committed MVCCAbortTxnOp for a txn once all of that txn's local intents have been committed and all MVCCCommitIntentOp ops have been published? And then add a test for this?

pkg/kv/kvserver/rangefeed/task.go line 352 at r3 (raw file):

	//
	// TODO(erikgrinaker): check how often we hit this, and consider coming up
	// with a way to avoid it in the common case, since it incurs a write.

s/write/raft consensus round/

pkg/kv/kvserver/rangefeed/task.go line 253 at r4 (raw file):

	// prevent a restart for any transaction that has not been written
	// over at a larger timestamp.
	pushedTxns, ambiguousAbort, err := a.pusher.PushTxns(ctx, a.txns, a.ts)

Can we trust this ambiguousAbort value in mixed-version clusters?

pkg/kv/kvserver/batcheval/cmd_barrier_test.go line 38 at r2 (raw file):

	// TODO(erikgrinaker): add some actual testing here.
	ts, lai, err := kvDB.Barrier(ctx, start, end)

Should this test be in batcheval? It's spinning up a full server, which we usually reserve for pkg/kv/kvclient or pkg/kv/kvserver.

Do we have any other testing of Barrier that we could extend?

erikgrinaker

What do you think about it?

I think it's better -- it's simpler and more robust. I'm not overly worried about the cost of these writes either, given the conditions under which it triggers and that it's at most one per range per incident, but I'll have a closer look at this.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/replica_rangefeed.go line 189 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If a range splits, its rangefeed processor will be torn down and recreated, at which point the resolved timestamp will be recomputed with a lock table scan. So it doesn't feel like we necessarily need to handle this by using the LAIs returned from the Barrier request. But we also shouldn't crash. What happens now?

As is, we'll use the LAI of the first range overlapping the keyspan, because there's no handling of this in BarrierResponse.combine():

https://github.com/cockroachdb/cockroach/blob/02678b63662b6fec01a7ad27c686011ebe08ae13/pkg/kv/kvpb/api.go#L659-L668

For splits, that could actually be fine, because that LAI would be on the LHS after applying the split, so waiting for it would also mean waiting for the split to apply locally and tearing down the rangefeed. But this is mostly accidental, and we shouldn't rely on it. It would also break on a merge where the LHS LAI is lower than then RHS LAI -- that case requires us to check that the range ID matches our local replica, which I don't believe is currently exposed in the RPC response.

For our purposes here, it would be sufficient to mark Barrier as isUnsplittable and include the range ID in the response, and then just error out in txnPushAttempt until we tear down the rangefeed processor. But this doesn't generalize particularly well, because other callers would likely want to apply a barrier across a keyspan regardless of the range structure.

We could also return a slice of RangeBarrier containing the range ID, key span, evaluation timestamp, and LAI. However, that seems impractical when used with a large table across thousands of ranges, where we'd have to start using resume spans to avoid large responses.

I'm going to do the straightforward thing for now: make this conditional on a parameter BarrierRequest.WithLeaseAppliedIndex which will make the request unsplittable.

pkg/kv/kvserver/replica_rangefeed.go line 193 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we be passing the key span in from the rangefeed processor? We will tear the processor down on splits or merged, but I'm not sure that precludes any races.

Yeah, that seems prudent.

pkg/kv/kvserver/replica_rangefeed.go line 209 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think it's worth adding an exponential backoff here? 100ms for the first iteration feels a bit long.

Yeah, either that or we add a way to subscribe to LAI updates from the replica -- but that seems a bit overkill perhaps.

pkg/kv/kvserver/rangefeed/resolved_timestamp.go line 163 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I can't recall why we didn't already have this assertion here. I don't think it was an intentional omission when we added the other assertions.

Is this assertion relevant for MVCCUpdateIntentOp as well?

pkg/kv/kvserver/rangefeed/task.go line 346 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you mention that it's safe to publish a committed MVCCAbortTxnOp for a txn once all of that txn's local intents have been committed and all MVCCCommitIntentOp ops have been published? And then add a test for this?

Yep, I'll verify and document this, with a test.

pkg/kv/kvserver/rangefeed/task.go line 253 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we trust this ambiguousAbort value in mixed-version clusters?

We can't, but in that case it degrades to the current behavior. That's worth documenting on the API.

pkg/kv/kvserver/batcheval/cmd_barrier_test.go line 38 at r2 (raw file):

Should this test be in batcheval? It's spinning up a full server, which we usually reserve for pkg/kv/kvclient or pkg/kv/kvserver.

Several other batcheval tests do as well. I don't have an opinion on whether they should (beyond unit tests being better when possible), but it seems natural to test the request here, and we'll need to submit the proposal through Raft to fully do so.

Do we have any other testing of Barrier that we could extend?

Nope. I'll see if I can add some basic coverage in kvnemesis too.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)

pkg/kv/kvserver/replica_rangefeed.go line 189 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

As is, we'll use the LAI of the first range overlapping the keyspan, because there's no handling of this in BarrierResponse.combine():

https://github.com/cockroachdb/cockroach/blob/02678b63662b6fec01a7ad27c686011ebe08ae13/pkg/kv/kvpb/api.go#L659-L668

For splits, that could actually be fine, because that LAI would be on the LHS after applying the split, so waiting for it would also mean waiting for the split to apply locally and tearing down the rangefeed. But this is mostly accidental, and we shouldn't rely on it. It would also break on a merge where the LHS LAI is lower than then RHS LAI -- that case requires us to check that the range ID matches our local replica, which I don't believe is currently exposed in the RPC response.

For our purposes here, it would be sufficient to mark Barrier as isUnsplittable and include the range ID in the response, and then just error out in txnPushAttempt until we tear down the rangefeed processor. But this doesn't generalize particularly well, because other callers would likely want to apply a barrier across a keyspan regardless of the range structure.

We could also return a slice of RangeBarrier containing the range ID, key span, evaluation timestamp, and LAI. However, that seems impractical when used with a large table across thousands of ranges, where we'd have to start using resume spans to avoid large responses.

I'm going to do the straightforward thing for now: make this conditional on a parameter BarrierRequest.WithLeaseAppliedIndex which will make the request unsplittable.

I've updated the PR with this approach. It should be mostly good to go once I add some tests, would appreciate another look in the meanwhile.

nvanbenschoten

Reviewed 18 of 18 files at r6, 6 of 12 files at r7, 4 of 7 files at r8, 6 of 6 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/replica_application_result.go line 268 at r6 (raw file):

	// Populate BarrierResponse if requested.
	if pErr == nil && cmd.proposal.Local != nil && cmd.proposal.Local.PopulateBarrierResponse {
		cmd.proposal.Local.PopulateBarrierResponse = false // mark as handled

nit: want to push this into a DetachPopulateBarrierResponse method on LocalResult, which handles the cmd.proposal.Local == nil case?

pkg/kv/kvserver/replica_rangefeed.go line 200 at r9 (raw file):

	}
	if lai == 0 {
		if tp.r.store.ClusterSettings().Version.IsActive(ctx, clusterversion.V24_1Start) {

Should we check the active cluster version before issuing the Barrier request? As is, the cluster might have completed its upgrade after the Barrier request was sent.

pkg/kv/kvserver/batcheval/result/result.go line 102 at r6 (raw file):

		"GossipFirstRange:%t MaybeGossipSystemConfig:%t "+
		"MaybeGossipSystemConfigIfHaveFailure:%t MaybeAddToSplitQueue:%t "+
		"MaybeGossipNodeLiveness:%s PopulateBarrierResponse:%t",

nit: let's keep this in the same order as the struct definition and IsZero.

pkg/kv/kvserver/rangefeed/resolved_timestamp.go line 163 at r5 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Is this assertion relevant for MVCCUpdateIntentOp as well?

I don't think so. In theory, an intent that the rangefeed knows will eventually be aborted could get updated from a timestamp below the resolved ts to another timestamp below the resolved ts.

pkg/kv/kvserver/rangefeed/resolved_timestamp.go line 165 at r9 (raw file):

		// This assertion can be violated in mixed-version clusters, so gate it
		// on 24.1. See: https://github.com/cockroachdb/cockroach/issues/104309
		if rts.settings.Version.IsActive(context.Background(), clusterversion.V24_1Start) {

To be extra careful, do we want to add an env var here so that we don't even up crash looping a cluster without any mechanism to get unstuck?

pkg/kv/kvserver/rangefeed/task.go line 253 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

We can't, but in that case it degrades to the current behavior. That's worth documenting on the API.

I see. So we may operate on incorrect information, but that will cause this txnPushAttempt to degrade to the current behavior. That seems fine, but +1 on documenting — other uses of the flag may not handle being lied to as gracefully.

pkg/kv/kvserver/rangefeed/task.go line 363 at r9 (raw file):

		// rangefeed processor shuts down. It needs to be, otherwise a stalled
		// barrier can prevent node shutdown. Regardless, use a 1 minute backstop
		// to prevent getting wedged.

Were you planning to address this TODO in this initial PR?

pkg/kv/kvserver/batcheval/cmd_barrier_test.go line 38 at r2 (raw file):

I'll see if I can add some basic coverage in kvnemesis too.

👍 That's a good idea.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/replica_rangefeed.go line 200 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we check the active cluster version before issuing the Barrier request? As is, the cluster might have completed its upgrade after the Barrier request was sent.

This seems exceedingly unlikely, since the remote server would have to have its binary upgraded and restarted while the response is in flight (a 24.1 binary will return the right response regardless of the cluster version). But I suppose it can't hurt. Done.

pkg/kv/kvserver/rangefeed/resolved_timestamp.go line 165 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

To be extra careful, do we want to add an env var here so that we don't even up crash looping a cluster without any mechanism to get unstuck?

Yeah, may as well -- done. We could also consider disabling the assertion in release builds instead.

pkg/kv/kvserver/rangefeed/task.go line 253 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I see. So we may operate on incorrect information, but that will cause this txnPushAttempt to degrade to the current behavior. That seems fine, but +1 on documenting — other uses of the flag may not handle being lied to as gracefully.

Added some more NBs on the IntentResolver and TxnPusher APIs.

pkg/kv/kvserver/rangefeed/task.go line 363 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were you planning to address this TODO in this initial PR?

I'll submit a separate PR for this tomorrow, and clean up some of the related rangefeed shutdown code.

nvanbenschoten

Reviewed 2 of 15 files at r10, 6 of 7 files at r11, 5 of 5 files at r12, 3 of 3 files at r13, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

117801: kvserver: acquire `Replica.mu` when returning reproposal error r=erikgrinaker a=erikgrinaker Discovered while working on #117612. Looks like it's been there since #42939. This is a rare error, and these fields are unlikely to change while we're holding the raft mutex, so seems very unlikely to have caused any problems -- I could be convinced we shouldn't backport this, on the off-chance I've missed a potential deadlock. Epic: none Release note: None 117833: *: un-skip many tests that were skipped over the last month r=rail a=rickystewart This partially reverts the following commits: ``` e615aec 99884f8 99f9760 75c91c2 5c64a22 c1778b7 381c96b f9b20ae 8608499 eb71a83 337777f ``` These tests were skipped due to failures under RBE, typically under `deadlock` and `race`. With the addition of the `heavy` pool for these tests, we anticipate the decreased load will cause these tests to stop failing, therefore we un-skip the skipped tests. Epic: CRDB-8308 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

pkg/kv/kvserver/rangefeed/task.go line 363 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'll submit a separate PR for this tomorrow, and clean up some of the related rangefeed shutdown code.

#117859

blathers-crl · 2024-01-19T10:30:41Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

nvanbenschoten

Reviewed 1 of 4 files at r20, 1 of 5 files at r22, 3 of 36 files at r24, 27 of 30 files at r25, 3 of 6 files at r26, 3 of 3 files at r28, 3 of 3 files at r29, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/rangefeed/task.go line 38 at r27 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I wonder if we should make this a cluster setting instead. While I don't expect this to cause problems, if it does, it's a lot less hassle for operators. Then again, adding cluster settings for every little thing (even if temporary) is going to snowball pretty fast. Wdyt?

A cluster setting seems reasonable here. There may be other reasons why we'd want the configurability of being able to disable this dynamically.

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

Epic: none Release note: None

erikgrinaker · 2024-01-23T22:32:56Z

TFTRs!

bors r+

craig · 2024-01-23T23:52:39Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2024-01-23T23:52:48Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from aa5e353 to blathers/backport-release-23.1-117612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

error creating merge commit from aa5e353 to blathers/backport-release-23.2-117612: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

pand5a · 2024-01-25T06:55:18Z

👍

erikgrinaker requested a review from nvanbenschoten January 10, 2024 12:55

erikgrinaker self-assigned this Jan 10, 2024

erikgrinaker mentioned this pull request Jan 10, 2024

rangefeed: fix premature checkpoint due to intent resolution race #114487

Closed

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch 3 times, most recently from 4fc2d95 to e7ac988 Compare January 11, 2024 14:01

nvanbenschoten reviewed Jan 12, 2024

View reviewed changes

erikgrinaker commented Jan 15, 2024

View reviewed changes

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from e7ac988 to fbdd612 Compare January 15, 2024 16:58

erikgrinaker mentioned this pull request Jan 16, 2024

kvserver: acquire Replica.mu when returning reproposal error #117801

Merged

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch 2 times, most recently from c258859 to 94ce567 Compare January 16, 2024 15:35

erikgrinaker commented Jan 16, 2024

View reviewed changes

nvanbenschoten reviewed Jan 16, 2024

View reviewed changes

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from 94ce567 to d105a98 Compare January 16, 2024 19:11

erikgrinaker commented Jan 16, 2024

View reviewed changes

nvanbenschoten approved these changes Jan 16, 2024

View reviewed changes

blathers-crl bot mentioned this pull request Jan 16, 2024

release-23.2: kvserver: acquire Replica.mu when returning reproposal error #117839

Merged

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from d105a98 to 92009d7 Compare January 17, 2024 09:59

erikgrinaker commented Jan 17, 2024

View reviewed changes

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch 2 times, most recently from f37ce33 to 3f4d373 Compare January 19, 2024 10:30

This was referenced Jan 19, 2024

batcheval: add BarrierRequest.WithLeaseAppliedIndex #117967

Merged

kvserver: add Replica.WaitForLeaseAppliedIndex() #117968

Merged

batcheval: add PushTxnResponse.AmbiguousAbort #117969

Merged

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from 3f4d373 to 381bd68 Compare January 19, 2024 10:51

nvanbenschoten approved these changes Jan 23, 2024

View reviewed changes

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from 58a1834 to 6c8d455 Compare January 23, 2024 20:39

erikgrinaker added 2 commits January 23, 2024 21:16

rangefeed: assert intent commits above resolved timestamp

6d92cef

Epic: none Release note: None

erikgrinaker force-pushed the rangefeed-abort-resolvedts-barrier branch from 6c8d455 to 6d92cef Compare January 23, 2024 21:16

craig bot merged commit b6623c5 into cockroachdb:master Jan 23, 2024
9 of 10 checks passed

erikgrinaker mentioned this pull request Feb 8, 2024

release-23.2.1-rc: rangefeed: fix premature checkpoint due to intent resolution race #118981

Merged

erikgrinaker mentioned this pull request Feb 15, 2024

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

Merged

shralex added the backport-22.2.x label Feb 21, 2024

erikgrinaker mentioned this pull request Feb 22, 2024

kvserver: rangefeed txn pusher barrier may cause resolved timestamp stalls #119536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rangefeed: fix premature checkpoint due to intent resolution race #117612

rangefeed: fix premature checkpoint due to intent resolution race #117612

erikgrinaker commented Jan 10, 2024 •

edited

Loading

cockroach-teamcity commented Jan 10, 2024

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

nvanbenschoten left a comment

erikgrinaker left a comment

blathers-crl bot commented Jan 19, 2024

nvanbenschoten left a comment

erikgrinaker commented Jan 23, 2024

craig bot commented Jan 23, 2024

blathers-crl bot commented Jan 23, 2024

pand5a commented Jan 25, 2024

rangefeed: fix premature checkpoint due to intent resolution race #117612

rangefeed: fix premature checkpoint due to intent resolution race #117612

Conversation

erikgrinaker commented Jan 10, 2024 • edited Loading

cockroach-teamcity commented Jan 10, 2024

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

blathers-crl bot commented Jan 19, 2024

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jan 23, 2024

craig bot commented Jan 23, 2024

blathers-crl bot commented Jan 23, 2024

pand5a commented Jan 25, 2024

erikgrinaker commented Jan 10, 2024 •

edited

Loading