release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

erikgrinaker · 2024-02-15T22:50:57Z

Clean backport from release-23.1.

Backports:

2/2 commits from rangefeed: fix premature checkpoint due to intent resolution race #117612.
1/1 commits from batcheval: add PushTxnResponse.AmbiguousAbort #117969.
1/1 commits from kvserver: add Replica.WaitForLeaseAppliedIndex() #117968.
2/2 commits from batcheval: add BarrierRequest.WithLeaseAppliedIndex #117967.
1/1 commits from kvpb: mark BarrierRequest as isAlone #117787.
Parts of rangefeed: use MakeTestingClusterSettings() #118469 and rangefeed: improve assertions #118265.

Release justification: fixes a bug which could cause changefeeds to omit events in some scenarios.

/cc @cockroachdb/release

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all.

This could happen because PushTxn may return a false ABORTED status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically
assumes an abort (see Replica.CanCreateTxnRecord and batcheval.SynthesizeTxnFromMeta).

However, the rangefeed txn pusher trusted this ABORTED status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario:

A rangefeed is running on a lagging follower.
A txn writes an intent, which is replicated to the follower.
The closed timestamp advances past the intent.
The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower.
The rangefeed pushes the txn to advance its resolved timestamp.
The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed).
The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint.
The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee.
The changefeed sees an event below its frontier and drops it, never emitting these events at all.

This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking.

Resolves #104309.
Epic: none
Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

Otherwise, `BatchRequest.RequiresConsensus()` may return `false` and not submit the barrier through Raft. Similarly, `shouldWaitOnLatchesWithoutAcquiring` will return `false` so it will contend with later writes. Barriers are not used in recent releases, so this does not have any mixed-version concerns. Epic: none Release note: None

This can be used to detect whether a replica has applied the barrier command yet. Epic: none Release note: None

This only executes random `Barrier` requests, but does not verify that the barrier guarantees are actually satisfied (i.e. that all past and concurrent writes are applied before it returns). At least we get some execution coverage, and verify that it does not have negative interactions with other operations. Epic: none Release note: None

This allows a caller to wait for a replica to reach a certain lease applied index. Similar functionality elsewhere is not migrated yet, out of caution. Epic: none Release note: None

This indicates to the caller that the `ABORTED` status of the pushed transaction is ambiguous, and the transaction may in fact have been committed and GCed already. This information is also plumbed through the `IntentResolver` txn push APIs. Epic: none Release note: None

It was possible for rangefeeds to emit a premature checkpoint, before all writes below its timestamp had been emitted. This in turn would cause changefeeds to not emit these write events at all. This could happen because `PushTxn` may return a false `ABORTED` status for a transaction that has in fact been committed, if the transaction record has been GCed (after resolving all intents). The timestamp cache does not retain sufficient information to disambiguate a committed transaction from an aborted one in this case, so it pessimistically assumes an abort (see `Replica.CanCreateTxnRecord` and `batcheval.SynthesizeTxnFromMeta`). However, the rangefeed txn pusher trusted this `ABORTED` status, ignoring the pending txn intents and allowing the resolved timestamp to advance past them before emitting the committed intents. This can lead to the following scenario: - A rangefeed is running on a lagging follower. - A txn writes an intent, which is replicated to the follower. - The closed timestamp advances past the intent. - The txn commits and resolves the intent at the original write timestamp, then GCs its txn record. This is not yet applied on the follower. - The rangefeed pushes the txn to advance its resolved timestamp. - The txn is GCed, so the push returns ABORTED (it can't know whether the txn was committed or aborted after its record is GCed). - The rangefeed disregards the "aborted" txn and advances the resolved timestamp, emitting a checkpoint. - The follower applies the resolved intent and emits an event below the checkpoint, violating the checkpoint guarantee. - The changefeed sees an event below its frontier and drops it, never emitting these events at all. This patch fixes the bug by submitting a barrier command to the leaseholder which waits for all past and ongoing writes (including intent resolution) to complete and apply, and then waits for the local replica to apply the barrier as well. This ensures any committed intent resolution will be applied and emitted before the transaction is removed from resolved timestamp tracking. Epic: none Release note (bug fix): fixed a bug where a changefeed could omit events in rare cases, logging the error "cdc ux violation: detected timestamp ... that is less or equal to the local frontier". This can happen if a rangefeed runs on a follower replica that lags significantly behind the leaseholder, a transaction commits and removes its transaction record before its intent resolution is applied on the follower, the follower's closed timestamp has advanced past the transaction commit timestamp, and the rangefeed attempts to push the transaction to a new timestamp (at least 10 seconds after the transaction began). This may cause the rangefeed to prematurely emit a checkpoint before emitting writes at lower timestamps, which in turn may cause the changefeed to drop these events entirely, never emitting them.

Epic: none Release note: None

blathers-crl · 2024-02-15T22:51:01Z

Thanks for opening a backport.

Please check the backport criteria before merging:

Backports should only be created for serious
issues or test-only changes.
Backports should not break backwards-compatibility.
Backports should change as little code as possible.
Backports should not change on-disk formats or node communication protocols.
Backports should not add new functionality (except as defined
here).
Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
All backports must be reviewed by the owning areas TL and one additional
TL. For more information as to how that review should be conducted, please consult the backport
policy.

If your backport adds new functionality, please ensure that the following additional criteria are satisfied:

There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
Your backport must be accompanied by a post to the appropriate Slack
channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

blathers-crl · 2024-02-15T22:51:02Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-02-15T22:51:16Z

This change is

erikgrinaker · 2024-02-15T22:52:27Z

@nvanbenschoten is out of office I believe, so I'll take @rharding6373's approval as authoritative here, considering we've backported this a ton of times already to various releases.

rharding6373

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

erikgrinaker added 7 commits February 15, 2024 22:48

batcheval: add BarrierRequest.WithLeaseAppliedIndex

b201a8b

This can be used to detect whether a replica has applied the barrier command yet. Epic: none Release note: None

kvserver: add Replica.WaitForLeaseAppliedIndex()

49a1261

This allows a caller to wait for a replica to reach a certain lease applied index. Similar functionality elsewhere is not migrated yet, out of caution. Epic: none Release note: None

rangefeed: assert intent commits above resolved timestamp

1045f2d

Epic: none Release note: None

erikgrinaker requested review from nvanbenschoten and rharding6373 February 15, 2024 22:50

erikgrinaker self-assigned this Feb 15, 2024

erikgrinaker requested a review from a team February 15, 2024 22:50

erikgrinaker requested a review from a team as a code owner February 15, 2024 22:50

blathers-crl bot added the backport Label PR's that are backports to older release branches label Feb 15, 2024

rharding6373 approved these changes Feb 15, 2024

View reviewed changes

rharding6373 merged commit 69a32fa into cockroachdb:release-23.1.15-rc Feb 16, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

erikgrinaker commented Feb 15, 2024

blathers-crl bot commented Feb 15, 2024

blathers-crl bot commented Feb 15, 2024

cockroach-teamcity commented Feb 15, 2024

erikgrinaker commented Feb 15, 2024

rharding6373 left a comment

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

release-23.1.15-rc: rangefeed: fix premature checkpoint due to intent resolution race #119270

Conversation

erikgrinaker commented Feb 15, 2024

blathers-crl bot commented Feb 15, 2024

blathers-crl bot commented Feb 15, 2024

cockroach-teamcity commented Feb 15, 2024

erikgrinaker commented Feb 15, 2024

rharding6373 left a comment

Choose a reason for hiding this comment