-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: replicagc-changed-peers/restart=true failed #105506
Comments
roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:
Parameters: |
Test fails because we can't remove replica from n3 despite it having a 'deadnode' attr that is forbidden by zone config.
Allocator struggles to find new location for replica and gives up. We also have other zone configs like
which doesn't have constraints, but I think they belong to virtual tables, not real ones so not sure why do we even have them. |
This PR which refactored liveness seem to cause the failure: #104746 Prior to this, restarted node with
but after the PR we see that node first drains, then another node decided to send replicas and leases back ignoring zone constrains only to be drained again but only partially.
I checked, and that's just one range or few ranges that stay on the node:
this is different from run to run, but failure and change in the behaviour if pretty consistent. |
I was comparing between commits |
roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:
Parameters: |
The underlying reason this test fails is a timing / issue with the test. I'm going to look at updating the test to make this pass. The test does the following:
Waits for replica gc on n3. What happens is that we get replica GC happening on n3 immediately after startup as expected. However before it sees the new zone config, other nodes begin transferring replicas back to it. When they do see the new zone config, they transfer them back off, however, it is no longer possible for the RF=5 on some ranges, and they can't be removed from n3 anymore. The key is that the other nodes shouldn't attempt to send replicas back to n3 after it comes back online. There is a combination of n3 being suspect, decommissioning, zone constraints, dead and unavailable that makes this test very hard to control. A small sleep (1 minute) after creating the isolating zone configs and starting the nodes should be enough that this test is no longer flakey. The test does expose the craziness we have about determining allocator targets, and hopefully this will be cleaned up in later liveness PRs. |
105555: roachtest: add delay after adding constraints. r=kvoli a=andrewbaptist Fixes: #105506. Fixes: #105505. Add a delay after adding the constraints before recommissioning the node. This allows the constraints to propagate to all nodes and avoids replicas being added back to the node. Epic: none Release note: None Co-authored-by: Andrew Baptist <[email protected]>
roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29087
The text was updated successfully, but these errors were encountered: