rangefeed: mux rangefeeds failing in metamorphic test builds #100783

AlexTalks · 2023-04-06T03:09:46Z

During investigation of #99207, it was discovered that whenever mux range feeds were enabled in metamorphic test runs, the test would fail, the kvserver test TestReplicateQueueDecommissioningNonVoters/remove would fail, observing the following messages in the logs:

I230405 23:35:41.812451 205024 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,rangefeed=‹spanconfig-subscriber›] 308544  MuxRangeFeed starting for range ‹/Table/47/{1-2}›@1680737707.726498812,0 (rangeID 49, attempt 0)
I230405 23:35:41.812503 205378 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,rangefeed=‹tenant-settings-watcher›] 308546  MuxRangeFeed starting for range ‹/Table/5{0-1}›@1680737700.374403140,0 (rangeID 51, attempt 0)
I230405 23:35:41.812503 205363 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,job=‹AUTO SPAN CONFIG RECONCILIATION id=854154456477892609›,rangefeed=‹sql-watcher-protected-ts-records-rangefeed›] 308547  MuxRangeFeed starting for range ‹/Table/3{2-3}›@1680737704.854354224,0 (rangeID 34, attempt 0)
I230405 23:35:41.812527 205359 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n3,rangefeed=‹tenant-settings-watcher›] 308548  RangeFeed ‹/Table/5{0-1}›@1680737734.040746992,0 disconnected with last checkpoint 466871h35m41.81252653s ago: r51 was not found on s3
I230405 23:35:41.812551 204734 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,rangefeed=‹spanconfig-subscriber›] 308549  RangeFeed ‹/Table/47/{1-2}›@1680737707.726498812,0 disconnected with last checkpoint 466871h35m41.81255031s ago: r49 was not found on s1
I230405 23:35:41.812574 205379 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,rangefeed=‹tenant-settings-watcher›] 308550  RangeFeed ‹/Table/5{0-1}›@1680737700.374403140,0 disconnected with last checkpoint 466871h35m41.812573013s ago: r51 was not found on s1
I230405 23:35:41.812574 205364 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,job=‹AUTO SPAN CONFIG RECONCILIATION id=854154456477892609›,rangefeed=‹sql-watcher-protected-ts-records-rangefeed›] 308551  RangeFeed ‹/Table/3{2-3}›@1680737704.854354224,0 disconnected with last checkpoint 466871h35m41.812572923s ago: r34 was not found on s1

It seems that when the metamorphic test enabled mux rangefeeds, there were frequent errors likely due to the failure to apply the updated span config received from the rangefeed. This error was seen as follows:

=== RUN   TestReplicateQueueDecommissioningNonVoters/remove
    replicate_queue_test.go:793: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:793
        	Error:      	Condition never satisfied
        	Test:       	TestReplicateQueueDecommissioningNonVoters/remove
    --- FAIL: TestReplicateQueueDecommissioningNonVoters/remove (68.62s)

This ticket is to resolve these errors, as well as to reenable TestReplicateQueueDecommissioningNonVoters (and/or any others) under metamorphic builds.

Jira issue: CRDB-26619

Epic CRDB-28879

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-04-06T03:09:49Z

cc @cockroachdb/cdc

AlexTalks · 2023-04-06T03:10:33Z

While not a GA blocker, I would assume we'd want to backport this to 23.1 as well.

AlexTalks · 2023-04-06T03:36:46Z

To be fair, this may not be entirely due to the mux rangefeed as I am able to run this test successfully while setting useMuxRangeFeed = true in rangefeed.go, however it does seem to fail faster in stress. It does seem clear, however, that disabling metamorphic builds eliminates the flake in TestReplicateQueueDecommissioningNonVoters, after having stressed it for several hours without failure once it was skipped under metamorphic builds.

@miretskiy, are the log messages seen above actually benign? They look like errors at first, but perhaps this issue requires more investigation if the mux rangefeed logs are simply a red herring.

miretskiy · 2023-04-06T13:53:18Z

I'm currently trying to work through mux rangefeed issues. It's been slow.

This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped specifically under metamorphic builds because this may enable mux range feeds, which are not currently working in these test builds (cockroachdb#100783). Once this issue is resolved and the test can work successfully with mux range feeds, this check should be removed. Fixes: cockroachdb#99207 Release note: None

100785: kvserver: skip replicate queue tests under metamorphic builds r=AlexTalks a=AlexTalks This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped specifically under metamorphic builds because this may enable mux range feeds, which are not currently working in these test builds (#100783). Once this issue is resolved and the test can work successfully with mux range feeds, this check should be removed. Fixes: #99207 Release note: None Co-authored-by: Alex Sarkesian <[email protected]>

This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped specifically under metamorphic builds because this may enable mux range feeds, which are not currently working in these test builds (#100783). Once this issue is resolved and the test can work successfully with mux range feeds, this check should be removed. Fixes: #99207 Release note: None

Similar to the issue found in cockroachdb#99207, it was discovered that `TestDecommission` is also failing due to cockroachdb#100783 - under metamorphic builds, mux range feeds (among other settings) are enabled, which cause span config updates (which depend on range feeds) to fail. As can be seen in the test, this causes expectations on things such as the replication factor applied to a range to fail due to being out-of-date. Until cockroachdb#100783 is fixed, this changes `TestDecommission` to be skipped under metamorphic builds; stressing this test has shown that it succeeds repeatedly when in a non-metamorphic build. Fixes: cockroachdb#96630 Release note: None

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes cockroachdb#96630 Fixes cockroachdb#100783 Informs cockroachdb#99631 Informs cockroachdb#101614 Release note: None

102094: kvcoord: Fix error handling and retries in mux rangefeed client r=miretskiy a=miretskiy Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes #96630 Fixes #100783 Informs #99631 Informs #101614 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes cockroachdb#96630 Fixes cockroachdb#100783 Informs cockroachdb#99631 Informs cockroachdb#101614 Release note: None

AlexTalks added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc labels Apr 6, 2023

blathers-crl bot added the A-cdc Change Data Capture label Apr 6, 2023

AlexTalks mentioned this issue Apr 6, 2023

kvserver: skip replicate queue tests under metamorphic builds #100785

Merged

miretskiy self-assigned this Apr 6, 2023

blathers-crl bot mentioned this issue Apr 14, 2023

release-23.1: kvserver: skip replicate queue tests under metamorphic builds #101576

Merged

This was referenced Apr 19, 2023

kvserver: skip decommissioning test under metamorphic builds #101882

Closed

kv/kvserver: TestDecommission failed #96630

Closed

miretskiy mentioned this issue Apr 23, 2023

kvcoord: Fix error handling and retries in mux rangefeed client #102094

Merged

craig bot closed this as completed in c0a5c30 Apr 26, 2023

This was referenced May 10, 2023

kv/kvserver: TestPromoteNonVoterInAddVoter failed #101519

Closed

kvserver,allocator: disable mux rangefeed on tests using a test cluster #103057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

AlexTalks commented Apr 6, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Apr 6, 2023

AlexTalks commented Apr 6, 2023

AlexTalks commented Apr 6, 2023

miretskiy commented Apr 6, 2023

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

Comments

AlexTalks commented Apr 6, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Apr 6, 2023

AlexTalks commented Apr 6, 2023

AlexTalks commented Apr 6, 2023

miretskiy commented Apr 6, 2023

AlexTalks commented Apr 6, 2023 •

edited by exalate-issue-sync bot

Loading