kvserver/rangefeed: fix off-by-one in `NewCatchUpIterator()` #82451

erikgrinaker · 2022-06-05T11:56:42Z

kvserver/rangefeed: fix off-by-one in NewCatchUpIterator()

NewCatchUpIterator created the MVCCIncrementalIterator with
RangeFeedRequest.Header.Timestamp.Prev(), because it assumed the given
timestamp was inclusive. However, the rest of the rangefeed code treats
this timestamp as exclusive.

This patch uses the correct, exclusive timestamp when creating the
MVCCIncrementalIterator. This change has no externally visible effect:
even though the previous code would cause MVCCIncrementalIterator to emit
keys at Timestamp, the CatchUpScan() method would discard those
events if they were at or below Timestamp.

An integration test is also added to verify that the start timestamp of
a rangefeed catchup scan is exclusive. The test passes both with the new
and the old code, as expected.

Release note: None

rangefeed: emphasize that start time is exclusive

This patch adds comments for rangefeed-related APIs and components
emphasizing that the start timestamp of rangefeeds is exclusive. In
other words, the first possible emitted event (including catchup scans)
will be at RangeFeedRequest.Header.Timestamp.Next(). Several
parameters are also renamed to emphasize this.

There are no functional or semantic changes.

Touches #82488.

Release note: None

cockroach-teamcity · 2022-06-05T11:56:50Z

This change is

erikgrinaker · 2022-06-05T12:00:18Z

Would be great with some extra eyes on this, to make sure I'm not missing anything. Here's some evidence that the timestamp should be exclusive:

cockroach/pkg/kv/kvserver/replica_rangefeed.go

Lines 174 to 175 in 0a54b3d

    
           // If the RangeFeed is performing a catch-up scan then it will observe all 
        
           // values above args.Timestamp. If the RangeFeed is requesting previous

https://github.com/cockroachdb/cockroach/blob/abfe38c0b529df188b89a643be4d675b008608ad/pkg/kv/kvserver/rangefeed/catchup_scan.go#L106-L108

https://github.com/cockroachdb/cockroach/blob/abfe38c0b529df188b89a643be4d675b008608ad/pkg/kv/kvserver/rangefeed/catchup_scan.go#L205-L208

https://github.com/cockroachdb/cockroach/blob/abfe38c0b529df188b89a643be4d675b008608ad/pkg/kv/kvserver/rangefeed/processor.go#L438-L439

https://github.com/cockroachdb/cockroach/blob/abfe38c0b529df188b89a643be4d675b008608ad/pkg/kv/kvserver/rangefeed/registry.go#L459-L463

cockroach/pkg/kv/kvserver/replica_rangefeed.go

Lines 238 to 240 in 0a54b3d

    
           p := r.registerWithRangefeedRaftMuLocked( 
        
           	ctx, rSpan, args.Timestamp, catchUpIterFunc, args.WithDiff, lockedStream, errC, 
        
           )

https://github.com/cockroachdb/cockroach/blob/abfe38c0b529df188b89a643be4d675b008608ad/pkg/kv/kvserver/rangefeed/registry.go#L356-L357

stevendanna

👍 My reading of the evidence you provided is the same. Probably good to have another set of eyes though since I'm the one that got this wrong.

My best guess at reconstructing what happened here is that during initial development, I hadn't yet dug into which timestamps were inclusive vs exclusive. So, to err on the right side of things and get things working, I did:

// Start time is exclusive
StartTime: args.Timestamp.Prev()

With the comment just being a reminder to me that StartTime was exclusive, not indicating that the overall choice was correct. That then hung around until code review when we recommended the longer comment that is there today, but the comment was based on the assumption that the original choice was purposeful.

miretskiy · 2022-06-06T14:59:07Z

Something tells me that changefeed might be broken by this:
https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/kvfeed/kv_feed.go#L253-L253

I'm not saying it's wrong or anything like that.. Just that, we assumed that rangefeed start time was inclusive.

Elsewhere in the code, interfaces such as start/endTS assume that start is inclusive and end is exclusive.
This PR would change those assumptions. If this is what's desirable, perhaps we can rename StartTS to be "ExclusiveStartTS" or some such.

erikgrinaker · 2022-06-06T15:14:05Z

Something tells me that changefeed might be broken by this: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/changefeedccl/kvfeed/kv_feed.go#L253-L253

I'm not saying it's wrong or anything like that.. Just that, we assumed that rangefeed start time was inclusive.

Elsewhere in the code, interfaces such as start/endTS assume that start is inclusive and end is exclusive. This PR would change those assumptions. If this is what's desirable, perhaps we can rename StartTS to be "ExclusiveStartTS" or some such.

Right. Note that this PR doesn't change any behavior: CatchUpScan() already drops events equal to the start time, as does Registry.PublishToOverlapping(), so it doesn't matter whether MVCCIncrementalIterator returns them (which is what this PR changes). So this bug already exists in that case. Would be good if CDC could go over this and make sure it's all consistent, with integration tests.

I think the exclusive semantics do make sense. If we've checkpointed at time T, then there's no point emitting events at T when the client resumes at T because those have already been emitted. This is the same with e.g. incremental backups. But we need to be careful with off-by-ones.

In the KV API, the start timestamp is taken from the general Header.Timestamp that's used for all requests, so we can't rename that. But might be a good idea to rename the client timestamp parameters and add comments for them, e.g. on DistSender.RangeFeed() and inside kvfeed and such. I can do a pass over this in this PR.

miretskiy · 2022-06-06T15:29:36Z

Another case where we attempted to fix "off by 1 error":
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go#L496-L496

If you're saying that we have always had a bug that assumed that startTS is inclusive, then... This makes me sad.
I think this PR should address the bugs in the calling code; We should also not use StartTimestamp -- that one is always inclusive -- and apparently, using it as exclusive TS is both confusing and error prone.
Perhaps callers should use different parameters as you say... Alternatively, perhaps rangefeed request itself should take "exclusive timestamp" argument.

@ajwerner do you have any recollections on this intricate timestamp management?

ajwerner · 2022-06-06T15:40:48Z

gosh, this is bad, at the higher level we've used it as inclusive. It's lead to all sort of other issues. Indeed the lines linked by @miretskiy matter.

I do agree that the exclusive semantics may make sense. We just need to be consistent.

erikgrinaker · 2022-06-06T15:44:01Z

I think this PR should address the bugs in the calling code

Let's do that in a separate PR. This one doesn't change anything about the rangefeed API, it just fixes an internal inconsistency that had no externally visible effect. Fixing the client code to use the API correctly (or alternatively changing the API) seems orthogonal. I just stumbled across this while implementing MVCC range tombstone support in MVCCIncrementalIterator, and don't really have time to get sidetracked right now, so I'd really appreciate if someone else could pick that up.

miretskiy · 2022-06-06T15:46:25Z

Ack. Perhaps an issue, or just an immediate follow on PR so that we don't forget.

`NewCatchUpIterator` created the `MVCCIncrementalIterator` with `RangeFeedRequest.Header.Timestamp.Prev()`, because it assumed the given timestamp was inclusive. However, the rest of the rangefeed code treats this timestamp as exclusive. This patch uses the correct, exclusive timestamp when creating the `MVCCIncrementalIterator`. This change has no externally visible effect: even though the previous code would cause `MVCCIncrementalIterator` to emit keys at `Timestamp`, the `CatchUpScan()` method would discard those events if they were at or below `Timestamp`. An integration test is also added to verify that the start timestamp of a rangefeed catchup scan is exclusive. The test passes both with the new and the old code, as expected. Release note: None

erikgrinaker · 2022-06-06T21:14:00Z

I added an integration test that confirms the exclusive semantics all the way through from rangefeed.Factory to the server and back. This also confirms that the change in this PR has no visible effect, since CatchUpScan() filters out the inclusive versions that would be returned by MVCCIncrementalIterator.

I've added on a commit here that liberally sprinkles comments about the exclusive nature of the start time around various rangefeed APIs and components, and also renamed a bunch of parameter names to make this clear. PTAL.

Opened #82488 to audit and fix clients that use inclusive semantics.

pkg/kv/kvclient/rangefeed/rangefeed_external_test.go

erikgrinaker · 2022-06-07T07:47:01Z

TFTRs!

bors r=miretskiy

erikgrinaker · 2022-06-07T07:52:52Z

Missed a linter failure.

bors r-

craig · 2022-06-07T07:52:54Z

Canceled.

This patch adds comments for rangefeed-related APIs and components emphasizing that the start timestamp of rangefeeds is exclusive. In other words, the first possible emitted event (including catchup scans) will be at `RangeFeedRequest.Header.Timestamp.Next()`. Several parameters are also renamed to emphasize this. There are no functional or semantic changes. Release note: None

erikgrinaker · 2022-06-07T14:48:06Z

bors r=miretskiy

craig · 2022-06-07T15:53:51Z

Build succeeded:

GitHub CI (Cockroach)

Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None

84109: changefeedcc: De-flake changefeed tests. r=miretskiy a=miretskiy Address multiple source of flakes in changefeed tests. #83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as documented in #84102 Address this issue by making cloud storage sink more permissive around its handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since #82451 clarified that the rangefeed start time is exclusive. Informs #83882 Fixes #83946 Release Notes: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None Release note (<category, see below>): <what> <show> <why>

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In cockroachdb#79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In cockroachdb#82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking cockroachdb#90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes cockroachdb#90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

91019: sqlliveness: encode region in session id r=JeffSwenson a=JeffSwenson Encode a region enum in the sqlliveness session id. The region will be used to support converting the sqlliveness and sql_instances table to regional by row tables. This change creates a custom encoding for the session id. The encoding is convenient, as it allows adding a region to the session id without requiring modifications to the jobs table or the crdb_internal.sql_liveness_is_alive built in. The enum.One value is a work around for the fact the system database does not include a region enum by default. In the absence of a region enum, enum.One will be used in the session. Part of #85736 Release note: None 91116: kvcoord: DistSender rangefeed bookkeeping had an off-by-one r=ajwerner a=ajwerner It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s). Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

erikgrinaker added the T-kv-replication label Jun 5, 2022

erikgrinaker requested review from stevendanna, miretskiy and a team June 5, 2022 11:56

erikgrinaker requested a review from a team as a code owner June 5, 2022 11:56

erikgrinaker self-assigned this Jun 5, 2022

erikgrinaker removed the T-kv-replication label Jun 5, 2022

erikgrinaker mentioned this pull request Jun 5, 2022

kvserver/rangefeed: move CatchUpScan() params to constructor #82452

Merged

stevendanna reviewed Jun 6, 2022

View reviewed changes

erikgrinaker force-pushed the rangefeed-catchup-starttime branch from abfe38c to d92c0ca Compare June 6, 2022 13:52

miretskiy approved these changes Jun 6, 2022

View reviewed changes

erikgrinaker force-pushed the rangefeed-catchup-starttime branch from d92c0ca to a24130d Compare June 6, 2022 20:56

erikgrinaker requested a review from a team as a code owner June 6, 2022 20:56

erikgrinaker requested a review from a team June 6, 2022 20:56

erikgrinaker mentioned this pull request Jun 6, 2022

rangefeed: ensure correct use of exclusive start time #82488

Closed

miretskiy approved these changes Jun 6, 2022

View reviewed changes

pkg/kv/kvclient/rangefeed/rangefeed_external_test.go Show resolved Hide resolved

shermanCRL requested a review from samiskin June 7, 2022 00:04

erikgrinaker force-pushed the rangefeed-catchup-starttime branch 2 times, most recently from 5f0f3df to eaf0e49 Compare June 7, 2022 09:38

erikgrinaker force-pushed the rangefeed-catchup-starttime branch from eaf0e49 to a24a7bb Compare June 7, 2022 11:22

craig bot merged commit 02cc393 into cockroachdb:master Jun 7, 2022

erikgrinaker deleted the rangefeed-catchup-starttime branch June 7, 2022 15:58

miretskiy mentioned this pull request Jul 8, 2022

changefeedcc: De-flake changefeed tests. #84109

Merged

miretskiy mentioned this pull request Jul 27, 2022

release-22.1: changefeedcc: De-flake changefeed tests #85152

Merged

ajwerner mentioned this pull request Nov 2, 2022

kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91116

Merged

blathers-crl bot mentioned this pull request Nov 11, 2022

release-22.2: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91748

Merged

blathers-crl bot mentioned this pull request Nov 11, 2022

release-22.1: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver/rangefeed: fix off-by-one in `NewCatchUpIterator()` #82451

kvserver/rangefeed: fix off-by-one in `NewCatchUpIterator()` #82451

erikgrinaker commented Jun 5, 2022 •

edited

Loading

cockroach-teamcity commented Jun 5, 2022

erikgrinaker commented Jun 5, 2022

stevendanna left a comment

miretskiy commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

miretskiy commented Jun 6, 2022

ajwerner commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

miretskiy commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

erikgrinaker commented Jun 7, 2022

erikgrinaker commented Jun 7, 2022

craig bot commented Jun 7, 2022

erikgrinaker commented Jun 7, 2022

craig bot commented Jun 7, 2022

kvserver/rangefeed: fix off-by-one in NewCatchUpIterator() #82451

kvserver/rangefeed: fix off-by-one in NewCatchUpIterator() #82451

Conversation

erikgrinaker commented Jun 5, 2022 • edited Loading

cockroach-teamcity commented Jun 5, 2022

erikgrinaker commented Jun 5, 2022

stevendanna left a comment

Choose a reason for hiding this comment

miretskiy commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

miretskiy commented Jun 6, 2022

ajwerner commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

miretskiy commented Jun 6, 2022

erikgrinaker commented Jun 6, 2022

erikgrinaker commented Jun 7, 2022

erikgrinaker commented Jun 7, 2022

craig bot commented Jun 7, 2022

erikgrinaker commented Jun 7, 2022

craig bot commented Jun 7, 2022

kvserver/rangefeed: fix off-by-one in `NewCatchUpIterator()` #82451

kvserver/rangefeed: fix off-by-one in `NewCatchUpIterator()` #82451

erikgrinaker commented Jun 5, 2022 •

edited

Loading