roachtest: replicagc-changed-peers/restart=true failed #105506

cockroach-teamcity · 2023-06-24T11:59:24Z

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication _{This test on roachdash | Improve this report!

Jira issue: CRDB-29087}

The text was updated successfully, but these errors were encountered:

cockroach-teamcity · 2023-06-25T11:44:17Z

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

aliher1911 · 2023-06-26T10:03:22Z

Test fails because we can't remove replica from n3 despite it having a 'deadnode' attr that is forbidden by zone config.
If we look on zone config of liveness range that doesn't budge it has RF=5 while we only have 4 nodes in cluster remaining and one of them is forbidden.

  RANGE liveness                                   | ALTER RANGE liveness CONFIGURE ZONE USING
                                                   |     range_min_bytes = 134217728,
                                                   |     range_max_bytes = 536870912,
                                                   |     gc.ttlseconds = 600,
                                                   |     num_replicas = 5,
                                                   |     constraints = '[-deadnode]',
                                                   |     lease_preferences = '[]'

Allocator struggles to find new location for replica and gives up.

We also have other zone configs like

  TABLE system.public.replication_constraint_stats | ALTER TABLE system.public.replication_constraint_stats CONFIGURE ZONE USING
                                                   |     gc.ttlseconds = 600,
                                                   |     constraints = '[]',
                                                   |     lease_preferences = '[]'

which doesn't have constraints, but I think they belong to virtual tables, not real ones so not sure why do we even have them.

aliher1911 · 2023-06-26T11:01:41Z

This PR which refactored liveness seem to cause the failure: #104746

Prior to this, restarted node with deadnode attr caused replicas to drain:

10:46:43 cluster.go:859: test status: starting nodes :3
10:46:43 cockroach.go:184: oleg-1687776090-01-n6cpu4: starting nodes
10:46:47 cockroach.go:826: oleg-1687776090-01-n6cpu4: creating backup schedule
10:46:51 cockroach.go:857: SCHEDULED BACKUP 0
NOTICE: schedule "test_only_backup" already exists, skipping
10:46:51 replicagc.go:223: found 155 replicas on n3
10:46:56 replicagc.go:223: found 105 replicas on n3
10:47:02 replicagc.go:223: found 54 replicas on n3
10:47:07 replicagc.go:223: found 3 replicas on n3
10:47:12 replicagc.go:223: found 0 replicas on n3

but after the PR we see that node first drains, then another node decided to send replicas and leases back ignoring zone constrains only to be drained again but only partially.

10:56:20 cluster.go:859: test status: starting nodes :3
10:56:20 cockroach.go:184: oleg-1687776717-01-n6cpu4: starting nodes
10:56:24 cockroach.go:826: oleg-1687776717-01-n6cpu4: creating backup schedule
10:56:28 cockroach.go:857: SCHEDULED BACKUP 0
NOTICE: schedule "test_only_backup" already exists, skipping
10:56:28 replicagc.go:223: found 161 replicas on n3
10:56:33 replicagc.go:223: found 110 replicas on n3
10:56:38 replicagc.go:223: found 59 replicas on n3
10:56:43 replicagc.go:223: found 8 replicas on n3
10:56:49 replicagc.go:223: found 103 replicas on n3
10:56:54 replicagc.go:223: found 103 replicas on n3
10:56:59 replicagc.go:223: found 103 replicas on n3
10:57:04 replicagc.go:223: found 103 replicas on n3
10:57:09 replicagc.go:223: found 103 replicas on n3
10:57:14 replicagc.go:223: found 70 replicas on n3
10:57:19 replicagc.go:223: found 1 replicas on n3
10:57:24 replicagc.go:223: found 1 replicas on n3
10:57:29 replicagc.go:223: found 1 replicas on n3
10:57:35 replicagc.go:223: found 1 replicas on n3
10:57:40 replicagc.go:223: found 1 replicas on n3
...

I checked, and that's just one range or few ranges that stay on the node:

        57 | /Table/56                         | {3,4,5,6}

this is different from run to run, but failure and change in the behaviour if pretty consistent.

aliher1911 · 2023-06-26T11:05:18Z

I was comparing between commits
f2fa926d1c73893e67314c6e951720fb46563bb5 - after merge, fails
c4df44ef5db1d91014b1fa287ca11d04704615de - before merge, works fine

cockroach-teamcity · 2023-06-26T11:45:08Z

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

_{This test on roachdash | Improve this report!}

andrewbaptist · 2023-06-26T14:49:50Z

The underlying reason this test fails is a timing / issue with the test. I'm going to look at updating the test to make this pass. The test does the following:

Adds 6 nodes, and then decommissions 3 (n1-n3). One node is stopped (n3) is stopped before decommissioning. They stay in decommissioning, not decommissioned
Waits for the replicas to move off the 3 decommissioned nodes
Stops n1 and n2 (which are now in decommissioning)
Adds a zone config to prevent replicas from moving back to n3.
Immediately starts n3 (there are now 4 live nodes)

Waits for replica gc on n3.

What happens is that we get replica GC happening on n3 immediately after startup as expected. However before it sees the new zone config, other nodes begin transferring replicas back to it.

When they do see the new zone config, they transfer them back off, however, it is no longer possible for the RF=5 on some ranges, and they can't be removed from n3 anymore.

The key is that the other nodes shouldn't attempt to send replicas back to n3 after it comes back online. There is a combination of n3 being suspect, decommissioning, zone constraints, dead and unavailable that makes this test very hard to control. A small sleep (1 minute) after creating the isolating zone configs and starting the nodes should be enough that this test is no longer flakey.

The test does expose the craziness we have about determining allocator targets, and hopefully this will be cleaned up in later liveness PRs.

105555: roachtest: add delay after adding constraints. r=kvoli a=andrewbaptist Fixes: #105506. Fixes: #105505. Add a delay after adding the constraints before recommissioning the node. This allows the constraints to propagate to all nodes and avoids replicas being added back to the node. Epic: none Release note: None Co-authored-by: Andrew Baptist <[email protected]>

cockroach-teamcity added this to the 23.2 milestone Jun 24, 2023

aliher1911 assigned andrewbaptist Jun 26, 2023

aliher1911 mentioned this issue Jun 26, 2023

roachtest: replicagc-changed-peers/restart=false failed #105505

Closed

erikgrinaker added T-kv KV Team and removed T-kv-replication labels Jun 26, 2023

andrewbaptist mentioned this issue Jun 26, 2023

roachtest: add delay after adding constraints. #105555

Merged

craig bot closed this as completed in c858088 Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: replicagc-changed-peers/restart=true failed #105506

roachtest: replicagc-changed-peers/restart=true failed #105506

cockroach-teamcity commented Jun 24, 2023 •

edited by cockroach-jira-scripts

Loading

cockroach-teamcity commented Jun 25, 2023

aliher1911 commented Jun 26, 2023

aliher1911 commented Jun 26, 2023

aliher1911 commented Jun 26, 2023

cockroach-teamcity commented Jun 26, 2023

andrewbaptist commented Jun 26, 2023

roachtest: replicagc-changed-peers/restart=true failed #105506

roachtest: replicagc-changed-peers/restart=true failed #105506

Comments

cockroach-teamcity commented Jun 24, 2023 • edited by cockroach-jira-scripts Loading

cockroach-teamcity commented Jun 25, 2023

aliher1911 commented Jun 26, 2023

aliher1911 commented Jun 26, 2023

aliher1911 commented Jun 26, 2023

cockroach-teamcity commented Jun 26, 2023

andrewbaptist commented Jun 26, 2023

cockroach-teamcity commented Jun 24, 2023 •

edited by cockroach-jira-scripts

Loading