roachtest: stop cockroach gracefully when upgrading nodes #87154

yuzefovich · 2022-08-30T23:03:57Z

This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (pq: version mismatch in flow request: 65; this node accepts 69 through 69, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):

we are in a state when node 1 is running a newer version when node
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
newer node has already been started but its DistSQL server hasn't been
started yet (although it already can accept incoming RPCs - see comments
on distsql.ServerImpl.Start for more details). This means that newer
node has not sent through gossip an update about its DistSQL version.
node 2 acts as the gateway for a query that reads some data that node
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
when node 1 receives that request, it errors out with "version
mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Fixes: #87104.

Release justification: test-only change.

Release note: None

cockroach-teamcity · 2022-08-30T23:04:08Z

This change is

srosenberg · 2022-08-31T00:55:29Z

Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.

When a node is stopped gracefully, does the notification still happen through gossip? My concern is that gossip is not intended to provide guaranteed delivery; what if the network flakes or the other nodes are saturated?

tbg

Looks good, thanks for digging into this.

I sort of second parts of Stan's comment. I think it's okay that nodes will have outdated information about the version of peers for a short time after they hard-cycle, but is this error behavior "sticky" until the gossip update arrives? In other words, why does this error message reach the client? Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use? I know this is all sort of tricky and since it "only" happens around node upgrades and unclear restarts it could be considered problematic, but there might be an issue to file still.

tbg · 2022-08-31T09:02:31Z

pkg/cmd/roachtest/tests/versionupgrade.go

@@ -417,7 +417,9 @@ func upgradeNodes(
 			newVersionMsg = "<current>"
 		}
 		t.L().Printf("restarting node %d into version %s", node, newVersionMsg)
-		c.Stop(ctx, t.L(), option.DefaultStopOpts(), c.Node(node))
+		if err := c.StopCockroachGracefullyOnNode(ctx, t.L(), node); err != nil {


Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.

This commit makes it so that we stop cockroach nodes gracefully when upgrading them. Previous abrupt behavior of stopping the nodes during the upgrade could lead to test flakes because the nodes were not being properly drained. Here is one scenario for how one of the flakes (`pq: version mismatch in flow request: 65; this node accepts 69 through 69`, which means that a gateway running an older version asks another node running a newer version to do DistSQL computation, but the versions are not DistSQL compatible): - we are in a state when node 1 is running a newer version when node 2 is running an older version. Importantly, node 1 was upgraded "abruptly" meaning that it wasn't properly drained; in particular, it didn't send DistSQL draining notification through gossip. - newer node has already been started but its DistSQL server hasn't been started yet (although it already can accept incoming RPCs - see comments on `distsql.ServerImpl.Start` for more details). This means that newer node has **not** sent through gossip an update about its DistSQL version. - node 2 acts as the gateway for a query that reads some data that node 1 is the leaseholder for. During the physical planning, older node 2 checks whether newer node 1 is "healthy and compatible", and node 1 is deemed both healthy (because it can accept incoming RPCs) and is compatible (because node 2 hasn't received updated DistSQL version of node 1 since it hasn't been sent yet). As a result, node 2 plans a read on node 1. - when node 1 receives that request, it errors out with "version mismatch" error. This whole problem is solved if we stop nodes gracefully when upgrading them. In particular, this will mean that node 1 would first dissipate its draining notification across the cluster, so during the physical planning it will only be considered IFF node 1 has already communicated its updated DistSQL version, and then it would be deemed DistSQL-incompatible. I verified that this scenario is possible (with manual adjustments of the version upgrade test and cockroach binary to insert a delay) and that it's fixed by this commit. I believe it is likely that other flake types have the same root cause, but I haven't verified it. Release justification: test-only change. Release note: None

yuzefovich

Thanks for the feedback! These concerns are valid, so I filed #87199 to improve things here. However, this bug has existed since forever (since introduction of DistSQL like 5-6 years ago), and I don't think we've had any users complain about it, so it seems to be of relatively low priority. What I don't understand is why we started seeing these flakes more frequently lately.

Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use?

Re-planning the query at this point is a bit difficult. I think a simple way to improve would be to not use the upgraded node after it emits "version mismatch" error for the first time. Currently we might use the incompatible upgraded node until the new DistSQL version is propagated through gossip and all queries that use that node would fail; however, if we limit the failures to just one time, it seems like a big improvement.

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @renatolabs, @srosenberg, and @tbg)

pkg/cmd/roachtest/tests/versionupgrade.go line 420 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.

Done.

srosenberg · 2022-08-31T17:52:19Z

Thanks for the feedback! These concerns are valid, so I filed #87199 to improve things here.

Thanks for filing #87199!

However, this bug has existed since forever (since introduction of DistSQL like 5-6 years ago), and I don't think we've had any users complain about it, so it seems to be of relatively low priority.

Do you think it's because of its non-deterministic nature and perhaps that some of the workloads have retry (i.e., equivalent of --tolerate-errors)? It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

What I don't understand is why we started seeing these flakes more frequently lately.

My guess it's likely that a CI env. change affected the scheduling order. @renatolabs Afaik we're not seeing an increase in this type of flake in the nightly roachtest, right?

One other question... it wasn't immediately obvious as to which of the upgrade steps failed and whether or not it should have been guarded by retry logic?

yuzefovich · 2022-08-31T18:21:18Z

Do you think it's because of its non-deterministic nature and perhaps that some of the workloads have retry (i.e., equivalent of --tolerate-errors)? It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

Non-determinism definitely plays a role - most likely due to the nature of how gossip works (which I don't know how it does). For this bug ("version mismatch" error) to appear there are multiple conditions to be met:

an upgraded node must be started (so that it would respond to a ping from an older gateway) but must not have shared its DistSQL version with the gateway yet. Probably, in vast majority of cases the window between these two events is very short. Also, if the upgraded node encounters an error when adding the DistSQL version message into gossip, it panics (I don't know what no error means though - is it guaranteed that gossip message will be delivered eventually? does the delivery have the upper bound guarantee for when?)
an older node is used as a gateway for a query, it previously populated some internal state that the now-upgraded node was healthy and of compatible DistSQL version, and it didn't receive a draining notification from that now-upgraded node. Additionally, the gateway must be thinking that the now-upgraded node is the leaseholder for some ranges (according to the its local range cache) that a query touches. My guess is that in many cases when a node goes down for upgrade, this range cache should be updated to store the new leaseholder for the relevant ranges, but I'm not sure whether it's true.

In short, there are quite a few conditions to be met for this bug to appear.

One other question... it wasn't immediately obvious as to which of the upgrade steps failed and whether or not it should have been guarded by retry logic?

I don't think any of the upgraded steps failed - rather, after a partial upgrade (one node in the cluster) some "checks" (interesting queries) would fail. I'd be worried about adding a retry mechanism for those "checks" since retries might hide unexpected errors. In this case, the test failure was mostly because of the test, but it did expose a previously unknown (although seemingly rather rare) issue, so it was still worthy of an investigation.

craig · 2022-08-31T18:23:52Z

Build succeeded:

Bazel Essential CI (Cockroach)

yuzefovich · 2022-08-31T18:54:09Z

I did some archeology: the current mechanism of handling upgrades in DistSQL was introduced in #17747. At the time of that PR being merged, the re-planning on a "version mismatch" error was considered in #17497, but it was deemed too hard, and I agree with that stance in the present time too.

I also briefly looked into gossip, and even though I don't think it provides any guarantees (especially for the upper bound for when information is dissipated), gossip probably works fast enough that the window when the errors could occur is rather short.

renatolabs · 2022-08-31T19:12:19Z

@renatolabs Afaik we're not seeing an increase in this type of flake in the nightly roachtest, right?

Indeed, this code is used in a variety of upgrade tests but as far as I know, only the acceptance upgrade test has been failing with this issue. It might be that it is especially capable of hitting the specific conditions that trigger the error here.

It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

I'm looking at official documentation on the upgrade process, and I found this:

Warning:
We do not recommend sending SIGKILL to perform a "hard" shutdown, which bypasses CockroachDB's node shutdown logic and forcibly terminates the process. This can corrupt log files and, in certain edge cases, can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. When decommissioning, a hard shutdown will leave ranges under-replicated and vulnerable to another node failure, causing quorum loss in the window before up-replication completes.

Therefore, I agree we should always be gracefully shutting down nodes, as it would be really difficult, in the tests, to discern whether an error falls under those expected errors or not.

renatolabs · 2022-08-31T19:15:19Z

pkg/cmd/roachtest/tests/versionupgrade.go

+		// TODO(yuzefovich): ideally, we would also check that the drain was
+		// successful since if it wasn't, then we might see flakes too.


Currently, StopCockroachGracefullyOnNode will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/cmd/roachtest/tests/versionupgrade.go line 427 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Currently, StopCockroachGracefullyOnNode will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?

Yeah, I believe so. It's quite possible that the timeout would be ok, so we would probably want the timeout error to be logged, but not fail the test.

In cockroachdb#87154, we started gracefully shutting down nodes in mixed-version tests. As a consequence of that change, the `cdc/mixed-versions` roachtest started failing. After some investigation, it was discovered that the issue has nothing to do with upgrades but instead with the the different way nodes are stopped (if we don't use different binaries and just stop and restart the same binaries, running current `master`, the issue is still observed). In order to stop that test from failing, this commit implements a workaround solution of stopping the changefeed while an upgrade is in process, which makes the test pass reliably. This commit also removes the retry logic from the FingerprintValidator code as it adds a dimension of complexity that is not needed in the tests. The validator will now block if trying to perform a SQL update until the upgrade process is finished. The actual issue that leads to this failure is being investigated in cockroachdb#88948. Fixes cockroachdb#87251. Release note: None.

yuzefovich requested review from tbg, adityamaru, a team and srosenberg and removed request for a team August 30, 2022 23:03

yuzefovich mentioned this pull request Aug 30, 2022

roachtest: acceptance/version-upgrade is flaky #87104

Closed

srosenberg requested a review from renatolabs August 31, 2022 00:49

tbg approved these changes Aug 31, 2022

View reviewed changes

yuzefovich mentioned this pull request Aug 31, 2022

distsql: make physical planning during upgrade bullet-proof #87199

Open

yuzefovich force-pushed the version-upgrade branch from 669186a to 0b5a851 Compare August 31, 2022 17:29

yuzefovich commented Aug 31, 2022

View reviewed changes

craig bot merged commit da00c3a into cockroachdb:master Aug 31, 2022

yuzefovich deleted the version-upgrade branch August 31, 2022 18:25

renatolabs reviewed Aug 31, 2022

View reviewed changes

yuzefovich commented Aug 31, 2022

View reviewed changes

renatolabs mentioned this pull request Sep 1, 2022

roachtest: cdc/mixed-versions failed #87251

Closed

This was referenced Sep 28, 2022

roachtest: pause changefeed during upgrade in mixed-versions test #88961

Closed

changefeed: suspicious timestamps received when nodes drain #88948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: stop cockroach gracefully when upgrading nodes #87154

roachtest: stop cockroach gracefully when upgrading nodes #87154

yuzefovich commented Aug 30, 2022

cockroach-teamcity commented Aug 30, 2022

srosenberg commented Aug 31, 2022

tbg left a comment

tbg Aug 31, 2022

yuzefovich left a comment

srosenberg commented Aug 31, 2022

yuzefovich commented Aug 31, 2022

craig bot commented Aug 31, 2022

yuzefovich commented Aug 31, 2022

renatolabs commented Aug 31, 2022

renatolabs Aug 31, 2022

yuzefovich left a comment

		// TODO(yuzefovich): ideally, we would also check that the drain was
		// successful since if it wasn't, then we might see flakes too.

roachtest: stop cockroach gracefully when upgrading nodes #87154

roachtest: stop cockroach gracefully when upgrading nodes #87154

Conversation

yuzefovich commented Aug 30, 2022

cockroach-teamcity commented Aug 30, 2022

srosenberg commented Aug 31, 2022

tbg left a comment

Choose a reason for hiding this comment

tbg Aug 31, 2022

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

srosenberg commented Aug 31, 2022

yuzefovich commented Aug 31, 2022

craig bot commented Aug 31, 2022

yuzefovich commented Aug 31, 2022

renatolabs commented Aug 31, 2022

renatolabs Aug 31, 2022

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment