spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

adityamaru · 2022-10-27T13:48:07Z

Example build: https://teamcity.cockroachdb.com/viewLog.html?buildId=7166439&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_TestsAwsLinuxArm64_UnitTests

Test history - https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Ci_TestsAwsLinuxArm64BigVm&buildTypeId=&tab=testDetails&testNameId=-3342716999936030037&order=TEST_STATUS_DESC&branch_Cockroach_Ci_TestsAwsLinuxArm64BigVm=__all_branches__&itemsCount=50

Jira issue: CRDB-20936

adityamaru · 2022-10-27T13:48:18Z

cc: @irfansharif / @arulajmani

williamkulju · 2022-10-31T20:27:31Z

@irfansharif - Please take a look at this and assign a priority

irfansharif · 2022-11-01T21:34:01Z

Just reporting that it doesn't reproduce so easily. 30+m of stress running this on my GCE worker at the test-level and then the package level didn't cause it to flake.

The failure log above doesn't say much either about what timed out. This test should only take seconds, and it mostly does:

It's using a lot of the test-{cluster,server} machinery underneath so it's unlikely that this flake is due to anything in spanconfigs proper (which hasn't changed at all since this test started flaking). I'll ask/look around for other test-level timeouts that have been occurring to see if this is just another victim.

irfansharif · 2022-11-02T13:57:22Z

Andrew already has a fix -- there was a real bug.

91019: sqlliveness: encode region in session id r=JeffSwenson a=JeffSwenson Encode a region enum in the sqlliveness session id. The region will be used to support converting the sqlliveness and sql_instances table to regional by row tables. This change creates a custom encoding for the session id. The encoding is convenient, as it allows adding a region to the session id without requiring modifications to the jobs table or the crdb_internal.sql_liveness_is_alive built in. The enum.One value is a work around for the fact the system database does not include a region enum by default. In the absence of a region enum, enum.One will be used in the session. Part of #85736 Release note: None 91116: kvcoord: DistSender rangefeed bookkeeping had an off-by-one r=ajwerner a=ajwerner It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s). Co-authored-by: Jeff Swenson <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

It turns out that two commits occurred about two months apart to address some off-by-one errors due to disagreements regarding the inclusivity or exclusivity of bounds of time intervals. In #79525 we added a next call to compensate for the catch-up scan occurring at an inclusive time. In #82451 we made the catch- up scan act exclusively, like the rest of the kvserver code has assumed. The end result is that we now actually do the catch up scan one tick later than we had intended. This resulted in some flakey tests, and in cases where the closed timestamp pushed a writing transaction, may have resulted in missing rows. This was uncovered deflaking #90764. With some added logging we see: ``` I221102 01:31:44.444557 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:667 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3882 RangeFeedEvent: span:<key:"\376\222\213" end_key:"\376\222\214" > resolved_ts:<wall_time:166735270430458388 > E221102 01:31:44.445042 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:653 [nsql1,rangefeed=lease,dest_n=1,dest_s=1,dest_r=53] 3886 RangeFeedError: retry rangefeed (REASON_RANGE_SPLIT) I221102 01:31:44.480676 2388 sql/internal.go:1321 [nsql1,job=810294652971450369,scExec,id=106,mutation=1] 3947 txn committed at 1667352704.380458388,1 I221102 01:31:44.485558 1509 kv/kvclient/kvcoord/dist_sender_rangefeed.go:420 [nsql1,rangefeed=lease] 3965 RangeFeed /Tenant/10/Table/{3-4} disconnected with last checkpoint 105.097693ms ago: retry rangefeed (REASON_RANGE_SPLIT) ``` Notice that the commit for the schema change occurred at `1667352704.380458388,1` and the resolved event was at `1667352704.380458388`. As the code was before, we'd perform the catch-up scan at `1667352704.380458388,2` and miss the write we needed to see. Fixes #90764. Release note (bug fix): Fixed a bug which, in rare cases, could result in a changefeed missing rows which occur around the time of a split in writing transactions which take longer than the closed timestamp target duration (defaults to 3s).

…22.2-91116 Fixes #90764

…22.1-91116 Fixes #90764

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels Oct 27, 2022

exalate-issue-sync bot assigned irfansharif Oct 31, 2022

ajwerner mentioned this issue Nov 2, 2022

kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91116

Merged

irfansharif assigned ajwerner and unassigned irfansharif Nov 2, 2022

irfansharif removed the T-kv KV Team label Nov 2, 2022

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Nov 2, 2022

craig bot closed this as completed in 46bbd61 Nov 2, 2022

This was referenced Nov 11, 2022

release-22.2: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91748

Merged

release-22.1: kvcoord: DistSender rangefeed bookkeeping had an off-by-one #91749

Merged

ajwerner added a commit that referenced this issue Nov 11, 2022

Merge pull request #91748 from cockroachdb/blathers/backport-release-…

9ec2ddb

…22.2-91116 Fixes #90764

ajwerner added a commit that referenced this issue Nov 11, 2022

Merge pull request #91749 from cockroachdb/blathers/backport-release-…

4ca1980

…22.1-91116 Fixes #90764

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

adityamaru commented Oct 27, 2022 •

edited by cockroach-jira-scripts

Loading

adityamaru commented Oct 27, 2022

williamkulju commented Oct 31, 2022

irfansharif commented Nov 1, 2022 •

edited

Loading

irfansharif commented Nov 2, 2022

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

spanconfiglimiterccl: TestDataDriven/indexes is timing out #90764

Comments

adityamaru commented Oct 27, 2022 • edited by cockroach-jira-scripts Loading

adityamaru commented Oct 27, 2022

williamkulju commented Oct 31, 2022

irfansharif commented Nov 1, 2022 • edited Loading

irfansharif commented Nov 2, 2022

adityamaru commented Oct 27, 2022 •

edited by cockroach-jira-scripts

Loading

irfansharif commented Nov 1, 2022 •

edited

Loading