kv/kvserver: TestDecommission failed #96630

cockroach-teamcity · 2023-02-06T08:21:42Z

kv/kvserver.TestDecommission failed with artifacts on release-22.1 @ 8816f14ae071d448d74c35e51c26690e91519f74:

=== RUN   TestDecommission
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestDecommission1782102708
    test_log_scope.go:80: use -show-logs to present logs inline

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv _{This test on roachdash | Improve this report!

Jira issue: CRDB-24248}

The text was updated successfully, but these errors were encountered:

knz · 2023-02-06T10:24:23Z

@AlexTalks wanna have a look?

github-actions · 2023-03-09T10:06:05Z

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

adityamaru · 2023-03-17T15:51:55Z

TestDecommission is still fairly flaky - https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Ci_TestsGcpLinuxX8664BigVm&buildTypeId=&tab=testDetails&testNameId=-6750397557676147176&order=TEST_STATUS_DESC&branch_Cockroach_Ci_TestsGcpLinuxX8664BigVm=__all_branches__&itemsCount=50. @AlexTalks could you take a look or route it to the most approriate person please, thanks!

Refs: cockroachdb#96630 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Epic: None

cockroach-teamcity · 2023-03-18T01:00:45Z

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1711741847
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n4,s4):5, (n3,s3):2, (n5,s5):3, next=6, gen=13, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1711741847
--- FAIL: TestDecommission (63.26s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-03-18T01:00:47Z

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1069287369
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n3,s3):4, (n4,s4):2, (n5,s5):3, next=5, gen=9, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1069287369
--- FAIL: TestDecommission (66.21s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-03-18T01:00:48Z

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1557941027
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n3,s3):4, (n5,s5):2, (n4,s4):5, next=6, gen=13, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1557941027
--- FAIL: TestDecommission (64.24s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

AlexTalks · 2023-04-19T15:58:12Z

Test is flaking due to a timing issue, I'll remove GA blocker and send a fix later today.

AlexTalks · 2023-04-19T23:09:09Z

On further inspection this is due to #100783 - when in a metamorphic build with mux range feeds enabled, the span configs never get updated.

Similar to the issue found in cockroachdb#99207, it was discovered that `TestDecommission` is also failing due to cockroachdb#100783 - under metamorphic builds, mux range feeds (among other settings) are enabled, which cause span config updates (which depend on range feeds) to fail. As can be seen in the test, this causes expectations on things such as the replication factor applied to a range to fail due to being out-of-date. Until cockroachdb#100783 is fixed, this changes `TestDecommission` to be skipped under metamorphic builds; stressing this test has shown that it succeeds repeatedly when in a non-metamorphic build. Fixes: cockroachdb#96630 Release note: None

erikgrinaker · 2023-04-20T08:12:42Z

cc @miretskiy

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes cockroachdb#96630 Fixes cockroachdb#100783 Informs cockroachdb#99631 Informs cockroachdb#101614 Release note: None

102094: kvcoord: Fix error handling and retries in mux rangefeed client r=miretskiy a=miretskiy Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes #96630 Fixes #100783 Informs #99631 Informs #101614 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed. Mux rangefeed, on the other hand, is not "loop" based. Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above. For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops. Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test. This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state. Fixes cockroachdb#96630 Fixes cockroachdb#100783 Informs cockroachdb#99631 Informs cockroachdb#101614 Release note: None

cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 6, 2023

cockroach-teamcity added this to the 22.1 milestone Feb 6, 2023

blathers-crl bot added the T-kv KV Team label Feb 6, 2023

github-actions bot added no-test-failure-activity X-stale labels Mar 9, 2023

github-actions bot closed this as completed Mar 14, 2023

adityamaru reopened this Mar 17, 2023

adityamaru added branch-master Failures and bugs on the master branch. and removed X-stale branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Mar 17, 2023

adityamaru mentioned this issue Mar 17, 2023

jobs,*: read job payload and progress from the job_info table #98608

Merged

smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 17, 2023

kv/kvserver: skip TestDecommission

417f8f2

Refs: cockroachdb#96630 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Epic: None

smg260 mentioned this issue Mar 17, 2023

kv/kvserver: skip TestDecommission #98894

Merged

smg260 added the skipped-test label Mar 17, 2023

smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 17, 2023

kv/kvserver: skip TestDecommission

bb1b722

Refs: cockroachdb#96630 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Epic: None

github-actions bot removed the no-test-failure-activity label Mar 20, 2023

nvanbenschoten added the GA-blocker label Mar 27, 2023

nvanbenschoten assigned AlexTalks Apr 5, 2023

AlexTalks removed the GA-blocker label Apr 19, 2023

AlexTalks mentioned this issue Apr 19, 2023

kvserver: skip decommissioning test under metamorphic builds #101882

Closed

miretskiy mentioned this issue Apr 23, 2023

kvcoord: Fix error handling and retries in mux rangefeed client #102094

Merged

craig bot closed this as completed in c0a5c30 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv/kvserver: TestDecommission failed #96630

kv/kvserver: TestDecommission failed #96630

cockroach-teamcity commented Feb 6, 2023 •

edited by cockroach-jira-scripts

Loading

knz commented Feb 6, 2023

github-actions bot commented Mar 9, 2023

adityamaru commented Mar 17, 2023

cockroach-teamcity commented Mar 18, 2023

cockroach-teamcity commented Mar 18, 2023

cockroach-teamcity commented Mar 18, 2023

AlexTalks commented Apr 19, 2023

AlexTalks commented Apr 19, 2023

erikgrinaker commented Apr 20, 2023

kv/kvserver: TestDecommission failed #96630

kv/kvserver: TestDecommission failed #96630

Comments

cockroach-teamcity commented Feb 6, 2023 • edited by cockroach-jira-scripts Loading

knz commented Feb 6, 2023

github-actions bot commented Mar 9, 2023

adityamaru commented Mar 17, 2023

cockroach-teamcity commented Mar 18, 2023

cockroach-teamcity commented Mar 18, 2023

cockroach-teamcity commented Mar 18, 2023

AlexTalks commented Apr 19, 2023

AlexTalks commented Apr 19, 2023

erikgrinaker commented Apr 20, 2023

cockroach-teamcity commented Feb 6, 2023 •

edited by cockroach-jira-scripts

Loading