Skip to content

Commit

Permalink
roachtest: stop cockroach gracefully when upgrading nodes
Browse files Browse the repository at this point in the history
This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (`pq: version mismatch in
flow request: 65; this node accepts 69 through 69`, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):
- we are in a state when node 1 is running a newer version when node
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
- newer node has already been started but its DistSQL server hasn't been
started yet (although it already can accept incoming RPCs - see comments
on `distsql.ServerImpl.Start` for more details). This means that newer
node has **not** sent through gossip an update about its DistSQL version.
- node 2 acts as the gateway for a query that reads some data that node
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
- when node 1 receives that request, it errors out with "version
mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Release justification: test-only change.

Release note: None
  • Loading branch information
yuzefovich committed Aug 30, 2022
1 parent 716626e commit 669186a
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 2 deletions.
1 change: 0 additions & 1 deletion pkg/cmd/roachtest/tests/acceptance.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,6 @@ func registerAcceptance(r registry.Registry) {
// to head after 19.2 fails.
minVersion: "v19.2.0",
timeout: 30 * time.Minute,
skip: "https://github.com/cockroachdb/cockroach/issues/87104",
},
},
}
Expand Down
4 changes: 3 additions & 1 deletion pkg/cmd/roachtest/tests/versionupgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -417,7 +417,9 @@ func upgradeNodes(
newVersionMsg = "<current>"
}
t.L().Printf("restarting node %d into version %s", node, newVersionMsg)
c.Stop(ctx, t.L(), option.DefaultStopOpts(), c.Node(node))
if err := c.StopCockroachGracefullyOnNode(ctx, t.L(), node); err != nil {
t.Fatal(err)
}

binary := uploadVersion(ctx, t, c, c.Node(node), newVersion)
settings := install.MakeClusterSettings(install.BinaryOption(binary))
Expand Down

0 comments on commit 669186a

Please sign in to comment.