Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicagc-changed-peers/restart=true failed #105506

Closed
cockroach-teamcity opened this issue Jun 24, 2023 · 6 comments · Fixed by #105555
Closed

roachtest: replicagc-changed-peers/restart=true failed #105506

cockroach-teamcity opened this issue Jun 24, 2023 · 6 comments · Fixed by #105555
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 24, 2023

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ 7fd4c21157221eae9e7d5892d89d2b5a671aba3e:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/replication

This test on roachdash | Improve this report!

Jira issue: CRDB-29087

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv-replication labels Jun 24, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Jun 24, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@aliher1911
Copy link
Contributor

Test fails because we can't remove replica from n3 despite it having a 'deadnode' attr that is forbidden by zone config.
If we look on zone config of liveness range that doesn't budge it has RF=5 while we only have 4 nodes in cluster remaining and one of them is forbidden.

  RANGE liveness                                   | ALTER RANGE liveness CONFIGURE ZONE USING
                                                   |     range_min_bytes = 134217728,
                                                   |     range_max_bytes = 536870912,
                                                   |     gc.ttlseconds = 600,
                                                   |     num_replicas = 5,
                                                   |     constraints = '[-deadnode]',
                                                   |     lease_preferences = '[]'

Allocator struggles to find new location for replica and gives up.

We also have other zone configs like

  TABLE system.public.replication_constraint_stats | ALTER TABLE system.public.replication_constraint_stats CONFIGURE ZONE USING
                                                   |     gc.ttlseconds = 600,
                                                   |     constraints = '[]',
                                                   |     lease_preferences = '[]'

which doesn't have constraints, but I think they belong to virtual tables, not real ones so not sure why do we even have them.

@aliher1911
Copy link
Contributor

This PR which refactored liveness seem to cause the failure: #104746

Prior to this, restarted node with deadnode attr caused replicas to drain:

10:46:43 cluster.go:859: test status: starting nodes :3
10:46:43 cockroach.go:184: oleg-1687776090-01-n6cpu4: starting nodes
10:46:47 cockroach.go:826: oleg-1687776090-01-n6cpu4: creating backup schedule
10:46:51 cockroach.go:857: SCHEDULED BACKUP 0
NOTICE: schedule "test_only_backup" already exists, skipping
10:46:51 replicagc.go:223: found 155 replicas on n3
10:46:56 replicagc.go:223: found 105 replicas on n3
10:47:02 replicagc.go:223: found 54 replicas on n3
10:47:07 replicagc.go:223: found 3 replicas on n3
10:47:12 replicagc.go:223: found 0 replicas on n3

but after the PR we see that node first drains, then another node decided to send replicas and leases back ignoring zone constrains only to be drained again but only partially.

10:56:20 cluster.go:859: test status: starting nodes :3
10:56:20 cockroach.go:184: oleg-1687776717-01-n6cpu4: starting nodes
10:56:24 cockroach.go:826: oleg-1687776717-01-n6cpu4: creating backup schedule
10:56:28 cockroach.go:857: SCHEDULED BACKUP 0
NOTICE: schedule "test_only_backup" already exists, skipping
10:56:28 replicagc.go:223: found 161 replicas on n3
10:56:33 replicagc.go:223: found 110 replicas on n3
10:56:38 replicagc.go:223: found 59 replicas on n3
10:56:43 replicagc.go:223: found 8 replicas on n3
10:56:49 replicagc.go:223: found 103 replicas on n3
10:56:54 replicagc.go:223: found 103 replicas on n3
10:56:59 replicagc.go:223: found 103 replicas on n3
10:57:04 replicagc.go:223: found 103 replicas on n3
10:57:09 replicagc.go:223: found 103 replicas on n3
10:57:14 replicagc.go:223: found 70 replicas on n3
10:57:19 replicagc.go:223: found 1 replicas on n3
10:57:24 replicagc.go:223: found 1 replicas on n3
10:57:29 replicagc.go:223: found 1 replicas on n3
10:57:35 replicagc.go:223: found 1 replicas on n3
10:57:40 replicagc.go:223: found 1 replicas on n3
...

I checked, and that's just one range or few ranges that stay on the node:

        57 | /Table/56                         | {3,4,5,6}

this is different from run to run, but failure and change in the behaviour if pretty consistent.

@aliher1911
Copy link
Contributor

I was comparing between commits
f2fa926d1c73893e67314c6e951720fb46563bb5 - after merge, fails
c4df44ef5db1d91014b1fa287ca11d04704615de - before merge, works fine

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=true failed with artifacts on master @ a2c2c060a423ee410b57868e657df644f2619cb3:

(replicagc.go:209).waitForZeroReplicas: replica count on n3 didn't drop to zero: 3
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/replicagc-changed-peers/restart=true/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker erikgrinaker added T-kv KV Team and removed T-kv-replication labels Jun 26, 2023
@andrewbaptist
Copy link
Collaborator

The underlying reason this test fails is a timing / issue with the test. I'm going to look at updating the test to make this pass. The test does the following:

  1. Adds 6 nodes, and then decommissions 3 (n1-n3). One node is stopped (n3) is stopped before decommissioning. They stay in decommissioning, not decommissioned
  2. Waits for the replicas to move off the 3 decommissioned nodes
  3. Stops n1 and n2 (which are now in decommissioning)
  4. Adds a zone config to prevent replicas from moving back to n3.
  5. Immediately starts n3 (there are now 4 live nodes)

Waits for replica gc on n3.

What happens is that we get replica GC happening on n3 immediately after startup as expected. However before it sees the new zone config, other nodes begin transferring replicas back to it.

When they do see the new zone config, they transfer them back off, however, it is no longer possible for the RF=5 on some ranges, and they can't be removed from n3 anymore.

The key is that the other nodes shouldn't attempt to send replicas back to n3 after it comes back online. There is a combination of n3 being suspect, decommissioning, zone constraints, dead and unavailable that makes this test very hard to control. A small sleep (1 minute) after creating the isolating zone configs and starting the nodes should be enough that this test is no longer flakey.

The test does expose the craziness we have about determining allocator targets, and hopefully this will be cleaned up in later liveness PRs.

craig bot pushed a commit that referenced this issue Jun 27, 2023
105555: roachtest: add delay after adding constraints. r=kvoli a=andrewbaptist

Fixes: #105506.
Fixes: #105505.

Add a delay after adding the constraints before recommissioning the
node. This allows the constraints to propagate to all nodes and avoids
replicas being added back to the node.

Epic: none

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
@craig craig bot closed this as completed in c858088 Jun 27, 2023
@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants