-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver/rangefeed: fix off-by-one in NewCatchUpIterator()
#82451
kvserver/rangefeed: fix off-by-one in NewCatchUpIterator()
#82451
Conversation
Would be great with some extra eyes on this, to make sure I'm not missing anything. Here's some evidence that the timestamp should be exclusive: cockroach/pkg/kv/kvserver/replica_rangefeed.go Lines 174 to 175 in 0a54b3d
cockroach/pkg/kv/kvserver/replica_rangefeed.go Lines 238 to 240 in 0a54b3d
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 My reading of the evidence you provided is the same. Probably good to have another set of eyes though since I'm the one that got this wrong.
My best guess at reconstructing what happened here is that during initial development, I hadn't yet dug into which timestamps were inclusive vs exclusive. So, to err on the right side of things and get things working, I did:
// Start time is exclusive
StartTime: args.Timestamp.Prev()
With the comment just being a reminder to me that StartTime was exclusive, not indicating that the overall choice was correct. That then hung around until code review when we recommended the longer comment that is there today, but the comment was based on the assumption that the original choice was purposeful.
abfe38c
to
d92c0ca
Compare
Something tells me that changefeed might be broken by this: I'm not saying it's wrong or anything like that.. Just that, we assumed that rangefeed start time was inclusive. Elsewhere in the code, interfaces such as start/endTS assume that start is inclusive and end is exclusive. |
Right. Note that this PR doesn't change any behavior: I think the exclusive semantics do make sense. If we've checkpointed at time T, then there's no point emitting events at T when the client resumes at T because those have already been emitted. This is the same with e.g. incremental backups. But we need to be careful with off-by-ones. In the KV API, the start timestamp is taken from the general |
Another case where we attempted to fix "off by 1 error": If you're saying that we have always had a bug that assumed that startTS is inclusive, then... This makes me sad. @ajwerner do you have any recollections on this intricate timestamp management? |
gosh, this is bad, at the higher level we've used it as inclusive. It's lead to all sort of other issues. Indeed the lines linked by @miretskiy matter. I do agree that the exclusive semantics may make sense. We just need to be consistent. |
Let's do that in a separate PR. This one doesn't change anything about the rangefeed API, it just fixes an internal inconsistency that had no externally visible effect. Fixing the client code to use the API correctly (or alternatively changing the API) seems orthogonal. I just stumbled across this while implementing MVCC range tombstone support in |
Ack. Perhaps an issue, or just an immediate follow on PR so that we don't forget. |
`NewCatchUpIterator` created the `MVCCIncrementalIterator` with `RangeFeedRequest.Header.Timestamp.Prev()`, because it assumed the given timestamp was inclusive. However, the rest of the rangefeed code treats this timestamp as exclusive. This patch uses the correct, exclusive timestamp when creating the `MVCCIncrementalIterator`. This change has no externally visible effect: even though the previous code would cause `MVCCIncrementalIterator` to emit keys at `Timestamp`, the `CatchUpScan()` method would discard those events if they were at or below `Timestamp`. An integration test is also added to verify that the start timestamp of a rangefeed catchup scan is exclusive. The test passes both with the new and the old code, as expected. Release note: None
d92c0ca
to
a24130d
Compare
I added an integration test that confirms the exclusive semantics all the way through from I've added on a commit here that liberally sprinkles comments about the exclusive nature of the start time around various rangefeed APIs and components, and also renamed a bunch of parameter names to make this clear. PTAL. Opened #82488 to audit and fix clients that use inclusive semantics. |
TFTRs! bors r=miretskiy |
Missed a linter failure. bors r- |
Canceled. |
5f0f3df
to
eaf0e49
Compare
This patch adds comments for rangefeed-related APIs and components emphasizing that the start timestamp of rangefeeds is exclusive. In other words, the first possible emitted event (including catchup scans) will be at `RangeFeedRequest.Header.Timestamp.Next()`. Several parameters are also renamed to emphasize this. There are no functional or semantic changes. Release note: None
eaf0e49
to
a24a7bb
Compare
bors r=miretskiy |
Build succeeded: |
Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None
84109: changefeedcc: De-flake changefeed tests. r=miretskiy a=miretskiy Address multiple source of flakes in changefeed tests. #83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as documented in #84102 Address this issue by making cloud storage sink more permissive around its handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since #82451 clarified that the rangefeed start time is exclusive. Informs #83882 Fixes #83946 Release Notes: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None Release note (<category, see below>): <what> <show> <why>
It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In cockroachdb#79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In cockroachdb#82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking cockroachdb#90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes cockroachdb#90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).
91019: sqlliveness: encode region in session id r=JeffSwenson a=JeffSwenson Encode a region enum in the sqlliveness session id. The region will be used to support converting the sqlliveness and sql_instances table to regional by row tables. This change creates a custom encoding for the session id. The encoding is convenient, as it allows adding a region to the session id without requiring modifications to the jobs table or the crdb_internal.sql_liveness_is_alive built in. The enum.One value is a work around for the fact the system database does not include a region enum by default. In the absence of a region enum, enum.One will be used in the session. Part of #85736 Release note: None 91116: kvcoord: DistSender rangefeed bookkeeping had an off-by-one r=ajwerner a=ajwerner It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s). Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).
It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).
kvserver/rangefeed: fix off-by-one in NewCatchUpIterator()
NewCatchUpIterator
created theMVCCIncrementalIterator
withRangeFeedRequest.Header.Timestamp.Prev()
, because it assumed the giventimestamp was inclusive. However, the rest of the rangefeed code treats
this timestamp as exclusive.
This patch uses the correct, exclusive timestamp when creating the
MVCCIncrementalIterator
. This change has no externally visible effect:even though the previous code would cause
MVCCIncrementalIterator
to emitkeys at
Timestamp
, theCatchUpScan()
method would discard thoseevents if they were at or below
Timestamp
.An integration test is also added to verify that the start timestamp of
a rangefeed catchup scan is exclusive. The test passes both with the new
and the old code, as expected.
Release note: None
rangefeed: emphasize that start time is exclusive
This patch adds comments for rangefeed-related APIs and components
emphasizing that the start timestamp of rangefeeds is exclusive. In
other words, the first possible emitted event (including catchup scans)
will be at
RangeFeedRequest.Header.Timestamp.Next()
. Severalparameters are also renamed to emphasize this.
There are no functional or semantic changes.
Touches #82488.
Release note: None