-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: stop cockroach gracefully when upgrading nodes #87154
Conversation
When a node is stopped gracefully, does the notification still happen through gossip? My concern is that gossip is not intended to provide guaranteed delivery; what if the network flakes or the other nodes are saturated? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for digging into this.
I sort of second parts of Stan's comment. I think it's okay that nodes will have outdated information about the version of peers for a short time after they hard-cycle, but is this error behavior "sticky" until the gossip update arrives? In other words, why does this error message reach the client? Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use? I know this is all sort of tricky and since it "only" happens around node upgrades and unclear restarts it could be considered problematic, but there might be an issue to file still.
@@ -417,7 +417,9 @@ func upgradeNodes( | |||
newVersionMsg = "<current>" | |||
} | |||
t.L().Printf("restarting node %d into version %s", node, newVersionMsg) | |||
c.Stop(ctx, t.L(), option.DefaultStopOpts(), c.Node(node)) | |||
if err := c.StopCockroachGracefullyOnNode(ctx, t.L(), node); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.
This commit makes it so that we stop cockroach nodes gracefully when upgrading them. Previous abrupt behavior of stopping the nodes during the upgrade could lead to test flakes because the nodes were not being properly drained. Here is one scenario for how one of the flakes (`pq: version mismatch in flow request: 65; this node accepts 69 through 69`, which means that a gateway running an older version asks another node running a newer version to do DistSQL computation, but the versions are not DistSQL compatible): - we are in a state when node 1 is running a newer version when node 2 is running an older version. Importantly, node 1 was upgraded "abruptly" meaning that it wasn't properly drained; in particular, it didn't send DistSQL draining notification through gossip. - newer node has already been started but its DistSQL server hasn't been started yet (although it already can accept incoming RPCs - see comments on `distsql.ServerImpl.Start` for more details). This means that newer node has **not** sent through gossip an update about its DistSQL version. - node 2 acts as the gateway for a query that reads some data that node 1 is the leaseholder for. During the physical planning, older node 2 checks whether newer node 1 is "healthy and compatible", and node 1 is deemed both healthy (because it can accept incoming RPCs) and is compatible (because node 2 hasn't received updated DistSQL version of node 1 since it hasn't been sent yet). As a result, node 2 plans a read on node 1. - when node 1 receives that request, it errors out with "version mismatch" error. This whole problem is solved if we stop nodes gracefully when upgrading them. In particular, this will mean that node 1 would first dissipate its draining notification across the cluster, so during the physical planning it will only be considered IFF node 1 has already communicated its updated DistSQL version, and then it would be deemed DistSQL-incompatible. I verified that this scenario is possible (with manual adjustments of the version upgrade test and cockroach binary to insert a delay) and that it's fixed by this commit. I believe it is likely that other flake types have the same root cause, but I haven't verified it. Release justification: test-only change. Release note: None
669186a
to
0b5a851
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! These concerns are valid, so I filed #87199 to improve things here. However, this bug has existed since forever (since introduction of DistSQL like 5-6 years ago), and I don't think we've had any users complain about it, so it seems to be of relatively low priority. What I don't understand is why we started seeing these flakes more frequently lately.
Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use?
Re-planning the query at this point is a bit difficult. I think a simple way to improve would be to not use the upgraded node after it emits "version mismatch" error for the first time. Currently we might use the incompatible upgraded node until the new DistSQL version is propagated through gossip and all queries that use that node would fail; however, if we limit the failures to just one time, it seems like a big improvement.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @renatolabs, @srosenberg, and @tbg)
pkg/cmd/roachtest/tests/versionupgrade.go
line 420 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.
Done.
Thanks for filing #87199!
Do you think it's because of its non-deterministic nature and perhaps that some of the workloads have retry (i.e., equivalent of
My guess it's likely that a CI env. change affected the scheduling order. @renatolabs Afaik we're not seeing an increase in this type of flake in the nightly roachtest, right? One other question... it wasn't immediately obvious as to which of the upgrade steps failed and whether or not it should have been guarded by retry logic? |
Non-determinism definitely plays a role - most likely due to the nature of how gossip works (which I don't know how it does). For this bug ("version mismatch" error) to appear there are multiple conditions to be met:
In short, there are quite a few conditions to be met for this bug to appear.
I don't think any of the upgraded steps failed - rather, after a partial upgrade (one node in the cluster) some "checks" (interesting queries) would fail. I'd be worried about adding a retry mechanism for those "checks" since retries might hide unexpected errors. In this case, the test failure was mostly because of the test, but it did expose a previously unknown (although seemingly rather rare) issue, so it was still worthy of an investigation. |
Build succeeded: |
I did some archeology: the current mechanism of handling upgrades in DistSQL was introduced in #17747. At the time of that PR being merged, the re-planning on a "version mismatch" error was considered in #17497, but it was deemed too hard, and I agree with that stance in the present time too. I also briefly looked into gossip, and even though I don't think it provides any guarantees (especially for the upper bound for when information is dissipated), gossip probably works fast enough that the window when the errors could occur is rather short. |
Indeed, this code is used in a variety of upgrade tests but as far as I know, only the acceptance upgrade test has been failing with this issue. It might be that it is especially capable of hitting the specific conditions that trigger the error here.
I'm looking at official documentation on the upgrade process, and I found this: Warning: Therefore, I agree we should always be gracefully shutting down nodes, as it would be really difficult, in the tests, to discern whether an error falls under those expected errors or not. |
// TODO(yuzefovich): ideally, we would also check that the drain was | ||
// successful since if it wasn't, then we might see flakes too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, StopCockroachGracefullyOnNode
will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/cmd/roachtest/tests/versionupgrade.go
line 427 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Currently,
StopCockroachGracefullyOnNode
will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?
Yeah, I believe so. It's quite possible that the timeout would be ok, so we would probably want the timeout error to be logged, but not fail the test.
In cockroachdb#87154, we started gracefully shutting down nodes in mixed-version tests. As a consequence of that change, the `cdc/mixed-versions` roachtest started failing. After some investigation, it was discovered that the issue has nothing to do with upgrades but instead with the the different way nodes are stopped (if we don't use different binaries and just stop and restart the same binaries, running current `master`, the issue is still observed). In order to stop that test from failing, this commit implements a workaround solution of stopping the changefeed while an upgrade is in process, which makes the test pass reliably. This commit also removes the retry logic from the FingerprintValidator code as it adds a dimension of complexity that is not needed in the tests. The validator will now block if trying to perform a SQL update until the upgrade process is finished. The actual issue that leads to this failure is being investigated in cockroachdb#88948. Fixes cockroachdb#87251. Release note: None.
In cockroachdb#87154, we started gracefully shutting down nodes in mixed-version tests. As a consequence of that change, the `cdc/mixed-versions` roachtest started failing. After some investigation, it was discovered that the issue has nothing to do with upgrades but instead with the the different way nodes are stopped (if we don't use different binaries and just stop and restart the same binaries, running current `master`, the issue is still observed). In order to stop that test from failing, this commit implements a workaround solution of stopping the changefeed while an upgrade is in process, which makes the test pass reliably. This commit also removes the retry logic from the FingerprintValidator code as it adds a dimension of complexity that is not needed in the tests. The validator will now block if trying to perform a SQL update until the upgrade process is finished. The actual issue that leads to this failure is being investigated in cockroachdb#88948. Fixes cockroachdb#87251. Release note: None.
This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.
Here is one scenario for how one of the flakes (
pq: version mismatch in flow request: 65; this node accepts 69 through 69
, which means thata gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
started yet (although it already can accept incoming RPCs - see comments
on
distsql.ServerImpl.Start
for more details). This means that newernode has not sent through gossip an update about its DistSQL version.
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
mismatch" error.
This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.
I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.
Fixes: #87104.
Release justification: test-only change.
Release note: None