Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

Closed
AlexTalks opened this issue Apr 6, 2023 · 4 comments · Fixed by #102094
Closed

rangefeed: mux rangefeeds failing in metamorphic test builds #100783

AlexTalks opened this issue Apr 6, 2023 · 4 comments · Fixed by #102094
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc

Comments

@AlexTalks
Copy link
Contributor

AlexTalks commented Apr 6, 2023

During investigation of #99207, it was discovered that whenever mux range feeds were enabled in metamorphic test runs, the test would fail, the kvserver test TestReplicateQueueDecommissioningNonVoters/remove would fail, observing the following messages in the logs:

I230405 23:35:41.812451 205024 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,rangefeed=‹spanconfig-subscriber›] 308544  MuxRangeFeed starting for range ‹/Table/47/{1-2}›@1680737707.726498812,0 (rangeID 49, attempt 0)
I230405 23:35:41.812503 205378 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,rangefeed=‹tenant-settings-watcher›] 308546  MuxRangeFeed starting for range ‹/Table/5{0-1}›@1680737700.374403140,0 (rangeID 51, attempt 0)
I230405 23:35:41.812503 205363 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:212 ⋮ [T1,n1,job=‹AUTO SPAN CONFIG RECONCILIATION id=854154456477892609›,rangefeed=‹sql-watcher-protected-ts-records-rangefeed›] 308547  MuxRangeFeed starting for range ‹/Table/3{2-3}›@1680737704.854354224,0 (rangeID 34, attempt 0)
I230405 23:35:41.812527 205359 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n3,rangefeed=‹tenant-settings-watcher›] 308548  RangeFeed ‹/Table/5{0-1}›@1680737734.040746992,0 disconnected with last checkpoint 466871h35m41.81252653s ago: r51 was not found on s3
I230405 23:35:41.812551 204734 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,rangefeed=‹spanconfig-subscriber›] 308549  RangeFeed ‹/Table/47/{1-2}›@1680737707.726498812,0 disconnected with last checkpoint 466871h35m41.81255031s ago: r49 was not found on s1
I230405 23:35:41.812574 205379 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,rangefeed=‹tenant-settings-watcher›] 308550  RangeFeed ‹/Table/5{0-1}›@1680737700.374403140,0 disconnected with last checkpoint 466871h35m41.812573013s ago: r51 was not found on s1
I230405 23:35:41.812574 205364 kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go:471 ⋮ [T1,n1,job=‹AUTO SPAN CONFIG RECONCILIATION id=854154456477892609›,rangefeed=‹sql-watcher-protected-ts-records-rangefeed›] 308551  RangeFeed ‹/Table/3{2-3}›@1680737704.854354224,0 disconnected with last checkpoint 466871h35m41.812572923s ago: r34 was not found on s1

It seems that when the metamorphic test enabled mux rangefeeds, there were frequent errors likely due to the failure to apply the updated span config received from the rangefeed. This error was seen as follows:

=== RUN   TestReplicateQueueDecommissioningNonVoters/remove
    replicate_queue_test.go:793: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/kv/kvserver_test/pkg/kv/kvserver/replicate_queue_test.go:793
        	Error:      	Condition never satisfied
        	Test:       	TestReplicateQueueDecommissioningNonVoters/remove
    --- FAIL: TestReplicateQueueDecommissioningNonVoters/remove (68.62s)

This ticket is to resolve these errors, as well as to reenable TestReplicateQueueDecommissioningNonVoters (and/or any others) under metamorphic builds.

Jira issue: CRDB-26619

Epic CRDB-28879

@AlexTalks AlexTalks added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc labels Apr 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 6, 2023

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Apr 6, 2023
@AlexTalks
Copy link
Contributor Author

While not a GA blocker, I would assume we'd want to backport this to 23.1 as well.

@AlexTalks
Copy link
Contributor Author

To be fair, this may not be entirely due to the mux rangefeed as I am able to run this test successfully while setting useMuxRangeFeed = true in rangefeed.go, however it does seem to fail faster in stress. It does seem clear, however, that disabling metamorphic builds eliminates the flake in TestReplicateQueueDecommissioningNonVoters, after having stressed it for several hours without failure once it was skipped under metamorphic builds.

@miretskiy, are the log messages seen above actually benign? They look like errors at first, but perhaps this issue requires more investigation if the mux rangefeed logs are simply a red herring.

@miretskiy miretskiy self-assigned this Apr 6, 2023
@miretskiy
Copy link
Contributor

I'm currently trying to work through mux rangefeed issues. It's been slow.

AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Apr 14, 2023
This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped
specifically under metamorphic builds because this may enable mux range
feeds, which are not currently working in these test builds (cockroachdb#100783).
Once this issue is resolved and the test can work successfully with mux
range feeds, this check should be removed.

Fixes: cockroachdb#99207

Release note: None
craig bot pushed a commit that referenced this issue Apr 14, 2023
100785: kvserver: skip replicate queue tests under metamorphic builds r=AlexTalks a=AlexTalks

This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped specifically under metamorphic builds because this may enable mux range feeds, which are not currently working in these test builds (#100783). Once this issue is resolved and the test can work successfully with mux range feeds, this check should be removed.

Fixes: #99207

Release note: None

Co-authored-by: Alex Sarkesian <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 14, 2023
This modifies `TestReplicateQueueDecommissioningNonVoters` to be skipped
specifically under metamorphic builds because this may enable mux range
feeds, which are not currently working in these test builds (#100783).
Once this issue is resolved and the test can work successfully with mux
range feeds, this check should be removed.

Fixes: #99207

Release note: None
AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Apr 19, 2023
Similar to the issue found in cockroachdb#99207, it was discovered that
`TestDecommission` is also failing due to cockroachdb#100783 - under metamorphic
builds, mux range feeds (among other settings) are enabled, which cause
span config updates (which depend on range feeds) to fail. As can be
seen in the test, this causes expectations on things such as the
replication factor applied to a range to fail due to being out-of-date.
Until cockroachdb#100783 is fixed, this changes `TestDecommission` to be skipped
under metamorphic builds; stressing this test has shown that it succeeds
repeatedly when in a non-metamorphic build.

Fixes: cockroachdb#96630

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 23, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 25, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
craig bot pushed a commit that referenced this issue Apr 26, 2023
102094: kvcoord: Fix error handling and retries in mux rangefeed client r=miretskiy a=miretskiy

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above.  For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops.  Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes #96630
Fixes #100783
Informs #99631
Informs #101614

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in c0a5c30 Apr 26, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 26, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 26, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants