release-23.2: rangefeed: fix premature checkpoint due to intent resolution race #118413

erikgrinaker · 2024-01-29T13:28:24Z

Only the 2 last commits should be reviewed here, the earlier commits in this PR are from the following prerequisite backports:

This backport includes new RPC protocol additions that are necessary to fix the bug. The new behavior (applying a Barrier command to wait for a Raft pipeline flush) is controlled by a cluster setting kv.rangefeed.push_txns.barrier.enabled, which defaults to true. According to the backport policy, such changes must be opt-in and disabled by default. However, making a bug fix opt-in appears questionable, especially one as serious as this one. I consider the protocol changes to be low-risk high-reward, and recommend we backport this behavior enabled by default.

@rharding6373 Assigning you as secondary TL reviewer, since you're the upcoming CDC TL. Let me know if you'd like me to route this elsewhere.

Backport 2/2 commits from #117612. Also pull in parts of #118469 and #118265.

Release justification: fixes a bug which could cause changefeeds to omit events in some scenarios.

/cc @cockroachdb/release

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all.

This could happen because PushTxn may return a false ABORTED status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically
assumes an abort (see Replica.CanCreateTxnRecord and batcheval.SynthesizeTxnFromMeta).

However, the rangefeed txn pusher trusted this ABORTED status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario:

A rangefeed is running on a lagging follower.
A txn writes an intent, which is replicated to the follower.
The closed timestamp advances past the intent.
The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower.
The rangefeed pushes the txn to advance its resolved timestamp.
The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed).
The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint.
The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee.
The changefeed sees an event below its frontier and drops it, never emitting these events at all.

This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking.

Resolves #104309.
Epic: none
Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

blathers-crl · 2024-01-29T13:28:28Z

blathers-crl · 2024-01-29T13:28:30Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-01-29T13:28:44Z

This change is

erikgrinaker · 2024-01-29T14:03:45Z

Approved by CTO.

nvanbenschoten · 2024-01-29T19:51:27Z

pkg/kv/kvserver/rangefeed/resolved_timestamp.go

-			rts.resolvedTS, op))
+		// NB: MVCCLogicalOp.String() is only implemented for pointer receiver.
+		err := errors.AssertionFailedf(
+			"resolved timestamp %s equal to or above timestamp of operation %v", rts.resolvedTS, &op)


We're now passing op by reference. Does that cause the argument to escape to the heap and result in a new heap allocation even when the assertion is not firing?

Great catch, especially considering I slipped this in here from #118265 -- thanks. You're probably right, I'll check this tomorrow and either shadow it in the branch or revert to passing by value.

Fixed, also in #118265.

rharding6373

Thanks for all the work finding, fixing, and backporting the fixes for this issue!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker)

nvanbenschoten

LGTM once the discussion from #118265 is resolved.

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

Epic: none Release note: None

yuzefovich · 2024-02-08T20:09:09Z

Just a heads up that I think this currently is not on 23.2.1-rc branch.

erikgrinaker · 2024-02-08T20:43:32Z

Thanks for the headsup, will bump it over.

erikgrinaker requested review from nvanbenschoten and rharding6373 January 29, 2024 13:28

erikgrinaker self-assigned this Jan 29, 2024

erikgrinaker requested a review from a team January 29, 2024 13:28

erikgrinaker requested a review from a team as a code owner January 29, 2024 13:28

blathers-crl bot added the backport Label PR's that are backports to older release branches label Jan 29, 2024

erikgrinaker force-pushed the backport23.2-117612 branch from 2b3746c to 84a81e0 Compare January 29, 2024 14:06

erikgrinaker mentioned this pull request Jan 29, 2024

release-23.1: rangefeed: fix premature checkpoint due to intent resolution race #118415

Merged

nvanbenschoten reviewed Jan 29, 2024

View reviewed changes

rharding6373 approved these changes Jan 29, 2024

View reviewed changes

erikgrinaker force-pushed the backport23.2-117612 branch from 84a81e0 to 9a33cbc Compare January 30, 2024 12:14

erikgrinaker mentioned this pull request Jan 30, 2024

release-22.2: rangefeed: fix premature checkpoint due to intent resolution race #118477

Merged

nvanbenschoten mentioned this pull request Jan 31, 2024

rangefeed: improve assertions #118265

Merged

nvanbenschoten approved these changes Jan 31, 2024

View reviewed changes

erikgrinaker added 2 commits February 1, 2024 19:29

rangefeed: assert intent commits above resolved timestamp

c5792ba

Epic: none Release note: None

erikgrinaker force-pushed the backport23.2-117612 branch from 9a33cbc to c5792ba Compare February 1, 2024 19:30

erikgrinaker merged commit 6019881 into cockroachdb:release-23.2 Feb 1, 2024
5 of 6 checks passed

blathers-crl bot mentioned this pull request Feb 2, 2024

staging-v22.2.18: release-22.2: rangefeed: fix premature checkpoint due to intent resolution race #118633

Merged

msbutler mentioned this pull request Feb 21, 2024

roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.2: rangefeed: fix premature checkpoint due to intent resolution race #118413

release-23.2: rangefeed: fix premature checkpoint due to intent resolution race #118413

erikgrinaker commented Jan 29, 2024 •

edited

Loading

blathers-crl bot commented Jan 29, 2024 •

edited by erikgrinaker

Loading

blathers-crl bot commented Jan 29, 2024

cockroach-teamcity commented Jan 29, 2024

erikgrinaker commented Jan 29, 2024

nvanbenschoten Jan 29, 2024

erikgrinaker Jan 29, 2024

erikgrinaker Jan 30, 2024

rharding6373 left a comment

nvanbenschoten left a comment

yuzefovich commented Feb 8, 2024

erikgrinaker commented Feb 8, 2024

release-23.2: rangefeed: fix premature checkpoint due to intent resolution race #118413

release-23.2: rangefeed: fix premature checkpoint due to intent resolution race #118413

Conversation

erikgrinaker commented Jan 29, 2024 • edited Loading

blathers-crl bot commented Jan 29, 2024 • edited by erikgrinaker Loading

blathers-crl bot commented Jan 29, 2024

cockroach-teamcity commented Jan 29, 2024

erikgrinaker commented Jan 29, 2024

nvanbenschoten Jan 29, 2024

Choose a reason for hiding this comment

erikgrinaker Jan 29, 2024

Choose a reason for hiding this comment

erikgrinaker Jan 30, 2024

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

yuzefovich commented Feb 8, 2024

erikgrinaker commented Feb 8, 2024

erikgrinaker commented Jan 29, 2024 •

edited

Loading

blathers-crl bot commented Jan 29, 2024 •

edited by erikgrinaker

Loading