Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

Closed
adityamaru opened this issue Oct 27, 2022 · 4 comments · Fixed by #91116
Closed

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

adityamaru opened this issue Oct 27, 2022 · 4 comments · Fixed by #91116
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Oct 27, 2022
@adityamaru
Copy link
Contributor Author

cc: @irfansharif / @arulajmani

@williamkulju
Copy link

@irfansharif - Please take a look at this and assign a priority

@irfansharif
Copy link
Contributor

irfansharif commented Nov 1, 2022

Just reporting that it doesn't reproduce so easily. 30+m of stress running this on my GCE worker at the test-level and then the package level didn't cause it to flake.

image

The failure log above doesn't say much either about what timed out. This test should only take seconds, and it mostly does:

image

It's using a lot of the test-{cluster,server} machinery underneath so it's unlikely that this flake is due to anything in spanconfigs proper (which hasn't changed at all since this test started flaking). I'll ask/look around for other test-level timeouts that have been occurring to see if this is just another victim.

@irfansharif
Copy link
Contributor

Andrew already has a fix -- there was a real bug.

craig bot pushed a commit that referenced this issue Nov 2, 2022
91019: sqlliveness: encode region in session id r=JeffSwenson a=JeffSwenson

Encode a region enum in the sqlliveness session id. The region will be used to support converting the sqlliveness and sql_instances table to regional by row tables.

This change creates a custom encoding for the session id. The encoding is convenient, as it allows adding a region to the session id without requiring modifications to the jobs table or the
crdb_internal.sql_liveness_is_alive built in.

The enum.One value is a work around for the fact the system database does not include a region enum by default. In the absence of a region enum, enum.One will be used in the session.

Part of #85736

Release note: None

91116: kvcoord: DistSender rangefeed bookkeeping had an off-by-one r=ajwerner a=ajwerner

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended.

This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see:

```
I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882  RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 >
E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886  RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT)
I221102 01:31:44.480676 2388 sql/internal.go:1321  [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947  txn committed at 1667352704.380458388,1
I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420  [nsql1,rangefeed=lease] 3965  RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT)
```

Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see.

Fixes #90764.

Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

Co-authored-by: Jeff Swenson <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig craig bot closed this as completed in 46bbd61 Nov 2, 2022
blathers-crl bot pushed a commit that referenced this issue Nov 11, 2022
It turns out that two commits occurred about two months apart to address some
off-by-one errors due to disagreements regarding the inclusivity or exclusivity
of bounds of time intervals. In #79525 we added a next call to compensate for
the catch-up scan occurring at an inclusive time. In #82451 we made the catch-
up scan act exclusively, like the rest of the kvserver code has assumed. The
end result is that we now actually do the catch up scan one tick later than
we had intended.

This resulted in some flakey tests, and in cases where the closed timestamp
pushed a writing transaction, may have resulted in missing rows. This was
uncovered deflaking #90764. With some added logging we see:

```
I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882  RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 >
E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886  RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT)
I221102 01:31:44.480676 2388 sql/internal.go:1321  [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947  txn committed at 1667352704.380458388,1
I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420  [nsql1,rangefeed=lease] 3965  RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT)
```

Notice that the commit for the schema change occurred at
`1667352704.380458388,1` and the resolved event was at `1667352704.380458388`.
As the code was before, we'd perform the catch-up scan at
`1667352704.380458388,2` and miss the write we needed to see.

Fixes #90764.

Release note (bug fix): Fixed a bug which, in rare cases, could result in a
changefeed missing rows which occur around the time of a split in writing
transactions which take longer than the closed timestamp target duration
(defaults to 3s).
blathers-crl bot pushed a commit that referenced this issue Nov 11, 2022
It turns out that two commits occurred about two months apart to address some
off-by-one errors due to disagreements regarding the inclusivity or exclusivity
of bounds of time intervals. In #79525 we added a next call to compensate for
the catch-up scan occurring at an inclusive time. In #82451 we made the catch-
up scan act exclusively, like the rest of the kvserver code has assumed. The
end result is that we now actually do the catch up scan one tick later than
we had intended.

This resulted in some flakey tests, and in cases where the closed timestamp
pushed a writing transaction, may have resulted in missing rows. This was
uncovered deflaking #90764. With some added logging we see:

```
I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882  RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 >
E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653  [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886  RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT)
I221102 01:31:44.480676 2388 sql/internal.go:1321  [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947  txn committed at 1667352704.380458388,1
I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420  [nsql1,rangefeed=lease] 3965  RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT)
```

Notice that the commit for the schema change occurred at
`1667352704.380458388,1` and the resolved event was at `1667352704.380458388`.
As the code was before, we'd perform the catch-up scan at
`1667352704.380458388,2` and miss the write we needed to see.

Fixes #90764.

Release note (bug fix): Fixed a bug which, in rare cases, could result in a
changefeed missing rows which occur around the time of a split in writing
transactions which take longer than the closed timestamp target duration
(defaults to 3s).
ajwerner added a commit that referenced this issue Nov 11, 2022
ajwerner added a commit that referenced this issue Nov 11, 2022
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants