Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestDecommission failed #96630

Closed
cockroach-teamcity opened this issue Feb 6, 2023 · 16 comments · Fixed by #102094
Closed

kv/kvserver: TestDecommission failed #96630

cockroach-teamcity opened this issue Feb 6, 2023 · 16 comments · Fixed by #102094
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 6, 2023

kv/kvserver.TestDecommission failed with artifacts on release-22.1 @ 8816f14ae071d448d74c35e51c26690e91519f74:

=== RUN   TestDecommission
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/751d67000aac5f3394c2369309253f02/logTestDecommission1782102708
    test_log_scope.go:80: use -show-logs to present logs inline

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-24248

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 6, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.1 milestone Feb 6, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 6, 2023
@knz
Copy link
Contributor

knz commented Feb 6, 2023

@AlexTalks wanna have a look?

@github-actions
Copy link

github-actions bot commented Mar 9, 2023

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

@adityamaru
Copy link
Contributor

@adityamaru adityamaru added branch-master Failures and bugs on the master branch. and removed X-stale branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Mar 17, 2023
smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 17, 2023
Refs: cockroachdb#96630

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes
Release note: None
Epic: None
smg260 pushed a commit to smg260/cockroach that referenced this issue Mar 17, 2023
Refs: cockroachdb#96630

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes
Release note: None
Epic: None
@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1711741847
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n4,s4):5, (n3,s3):2, (n5,s5):3, next=6, gen=13, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1711741847
--- FAIL: TestDecommission (63.26s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1069287369
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n3,s3):4, (n4,s4):2, (n5,s5):3, next=5, gen=9, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1069287369
--- FAIL: TestDecommission (66.21s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

kv/kvserver.TestDecommission failed with artifacts on master @ 8833248e0cb5d93d3ea30936e86702802b91dc7c:

=== RUN   TestDecommission
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1557941027
    test_log_scope.go:79: use -show-logs to present logs inline
    client_raft_test.go:3299: condition failed to evaluate within 45s: expected 1 replicas: r61:/{Table/Max-Max} [(n3,s3):4, (n5,s5):2, (n4,s4):5, next=6, gen=13, sticky=9223372036.854775807,2147483647]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/dfba1fd6cc9e1d30670286cfc03f3d34/logTestDecommission1557941027
--- FAIL: TestDecommission (64.24s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@AlexTalks
Copy link
Contributor

Test is flaking due to a timing issue, I'll remove GA blocker and send a fix later today.

@AlexTalks
Copy link
Contributor

On further inspection this is due to #100783 - when in a metamorphic build with mux range feeds enabled, the span configs never get updated.

AlexTalks added a commit to AlexTalks/cockroach that referenced this issue Apr 19, 2023
Similar to the issue found in cockroachdb#99207, it was discovered that
`TestDecommission` is also failing due to cockroachdb#100783 - under metamorphic
builds, mux range feeds (among other settings) are enabled, which cause
span config updates (which depend on range feeds) to fail. As can be
seen in the test, this causes expectations on things such as the
replication factor applied to a range to fail due to being out-of-date.
Until cockroachdb#100783 is fixed, this changes `TestDecommission` to be skipped
under metamorphic builds; stressing this test has shown that it succeeds
repeatedly when in a non-metamorphic build.

Fixes: cockroachdb#96630

Release note: None
@erikgrinaker
Copy link
Contributor

cc @miretskiy

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 23, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 24, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 25, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
craig bot pushed a commit that referenced this issue Apr 26, 2023
102094: kvcoord: Fix error handling and retries in mux rangefeed client r=miretskiy a=miretskiy

Fix error handling and retries when restarting rangefeeds. A big difference between regular rangefeed, and mux rangefeed, is regular rangefeed has a dedicated go routine per range. This go routine is responsible for running a rangefeed, handling its errors, management of state pertaining to the RPC call (transport, retry information, routing information, etc), and restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior to this PR, mux rangefeed, when it encountered a transient error, would loose a lot of the restart state mentioned above.  For example, it would loose the transport information, so that the restart would run against the same node as before, resulting, potentially, in busy loops.  Those busy loops (where the RPC call is restarted against the same node/replica that just experienced an error), would tend to make test flaky since they would take longer time to converge to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes #96630
Fixes #100783
Informs #99631
Informs #101614

Release note: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in c0a5c30 Apr 26, 2023
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 26, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 26, 2023
Fix error handling and retries when restarting rangefeeds.
A big difference between regular rangefeed, and mux rangefeed,
is regular rangefeed has a dedicated go routine per range.
This go routine is responsible for running a rangefeed, handling
its errors, management of state pertaining to the RPC call
(transport, retry information, routing information, etc), and
restarting rangefeed with backoff as needed.

Mux rangefeed, on the other hand, is not "loop" based.  Prior
to this PR, mux rangefeed, when it encountered a transient error,
would loose a lot of the restart state mentioned above.  For example,
it would loose the transport information, so that the restart would
run against the same node as before, resulting, potentially, in
busy loops.  Those busy loops (where the RPC call is restarted
against the same node/replica that just experienced an error), would
tend to make test flaky since they would take longer time to converge
to the state expected by the tests (such as `TestDecommission`) test.

This PR fixes this loss of state pertaining to single range restart
by associating this state with the long lived `activeMuxRangeFeed` state.

Fixes cockroachdb#96630
Fixes cockroachdb#100783
Informs cockroachdb#99631
Informs cockroachdb#101614

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-kv KV Team
Projects
None yet
7 participants