Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: stop cockroach gracefully when upgrading nodes #87154

Merged
merged 1 commit into from
Aug 31, 2022

Conversation

yuzefovich
Copy link
Member

This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (pq: version mismatch in flow request: 65; this node accepts 69 through 69, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):

  • we are in a state when node 1 is running a newer version when node
    2 is running an older version. Importantly, node 1 was upgraded
    "abruptly" meaning that it wasn't properly drained; in particular, it
    didn't send DistSQL draining notification through gossip.
  • newer node has already been started but its DistSQL server hasn't been
    started yet (although it already can accept incoming RPCs - see comments
    on distsql.ServerImpl.Start for more details). This means that newer
    node has not sent through gossip an update about its DistSQL version.
  • node 2 acts as the gateway for a query that reads some data that node
    1 is the leaseholder for. During the physical planning, older node
    2 checks whether newer node 1 is "healthy and compatible", and node 1 is
    deemed both healthy (because it can accept incoming RPCs) and is
    compatible (because node 2 hasn't received updated DistSQL version of
    node 1 since it hasn't been sent yet). As a result, node 2 plans a read
    on node 1.
  • when node 1 receives that request, it errors out with "version
    mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Fixes: #87104.

Release justification: test-only change.

Release note: None

@yuzefovich yuzefovich requested review from tbg, adityamaru, a team and srosenberg and removed request for a team August 30, 2022 23:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg
Copy link
Member

Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.

When a node is stopped gracefully, does the notification still happen through gossip? My concern is that gossip is not intended to provide guaranteed delivery; what if the network flakes or the other nodes are saturated?

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for digging into this.

I sort of second parts of Stan's comment. I think it's okay that nodes will have outdated information about the version of peers for a short time after they hard-cycle, but is this error behavior "sticky" until the gossip update arrives? In other words, why does this error message reach the client? Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use? I know this is all sort of tricky and since it "only" happens around node upgrades and unclear restarts it could be considered problematic, but there might be an issue to file still.

@@ -417,7 +417,9 @@ func upgradeNodes(
newVersionMsg = "<current>"
}
t.L().Printf("restarting node %d into version %s", node, newVersionMsg)
c.Stop(ctx, t.L(), option.DefaultStopOpts(), c.Node(node))
if err := c.StopCockroachGracefullyOnNode(ctx, t.L(), node); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.

This commit makes it so that we stop cockroach nodes gracefully when
upgrading them. Previous abrupt behavior of stopping the nodes during
the upgrade could lead to test flakes because the nodes were not
being properly drained.

Here is one scenario for how one of the flakes (`pq: version mismatch in
flow request: 65; this node accepts 69 through 69`, which means that
a gateway running an older version asks another node running a newer
version to do DistSQL computation, but the versions are not DistSQL
compatible):
- we are in a state when node 1 is running a newer version when node
2 is running an older version. Importantly, node 1 was upgraded
"abruptly" meaning that it wasn't properly drained; in particular, it
didn't send DistSQL draining notification through gossip.
- newer node has already been started but its DistSQL server hasn't been
started yet (although it already can accept incoming RPCs - see comments
on `distsql.ServerImpl.Start` for more details). This means that newer
node has **not** sent through gossip an update about its DistSQL version.
- node 2 acts as the gateway for a query that reads some data that node
1 is the leaseholder for. During the physical planning, older node
2 checks whether newer node 1 is "healthy and compatible", and node 1 is
deemed both healthy (because it can accept incoming RPCs) and is
compatible (because node 2 hasn't received updated DistSQL version of
node 1 since it hasn't been sent yet). As a result, node 2 plans a read
on node 1.
- when node 1 receives that request, it errors out with "version
mismatch" error.

This whole problem is solved if we stop nodes gracefully when upgrading
them. In particular, this will mean that node 1 would first dissipate its
draining notification across the cluster, so during the physical
planning it will only be considered IFF node 1 has already communicated
its updated DistSQL version, and then it would be deemed
DistSQL-incompatible.

I verified that this scenario is possible (with manual adjustments of the
version upgrade test and cockroach binary to insert a delay) and that
it's fixed by this commit. I believe it is likely that other flake types
have the same root cause, but I haven't verified it.

Release justification: test-only change.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! These concerns are valid, so I filed #87199 to improve things here. However, this bug has existed since forever (since introduction of DistSQL like 5-6 years ago), and I don't think we've had any users complain about it, so it seems to be of relatively low priority. What I don't understand is why we started seeing these flakes more frequently lately.

Shouldn't we internally re-plan the flow, but this time making sure that we don't plan on that node until we have evidence that it is ready for use?

Re-planning the query at this point is a bit difficult. I think a simple way to improve would be to not use the upgraded node after it emits "version mismatch" error for the first time. Currently we might use the incompatible upgraded node until the new DistSQL version is propagated through gossip and all queries that use that node would fail; however, if we limit the failures to just one time, it seems like a big improvement.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @renatolabs, @srosenberg, and @tbg)


pkg/cmd/roachtest/tests/versionupgrade.go line 420 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Add a comment - we might still see the problem, if draining hangs, and it'll help the next soul who has to deal with it. Also, it will be helpful to document why we're doing graceful here in the first place. Of course it is also closer to production, but then maybe we ought to check here that the drain terminates gracefully - not saying to add that now, but it could be part of the comment.

Done.

@srosenberg
Copy link
Member

Thanks for the feedback! These concerns are valid, so I filed #87199 to improve things here.

Thanks for filing #87199!

However, this bug has existed since forever (since introduction of DistSQL like 5-6 years ago), and I don't think we've had any users complain about it, so it seems to be of relatively low priority.

Do you think it's because of its non-deterministic nature and perhaps that some of the workloads have retry (i.e., equivalent of --tolerate-errors)? It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

What I don't understand is why we started seeing these flakes more frequently lately.

My guess it's likely that a CI env. change affected the scheduling order. @renatolabs Afaik we're not seeing an increase in this type of flake in the nightly roachtest, right?

One other question... it wasn't immediately obvious as to which of the upgrade steps failed and whether or not it should have been guarded by retry logic?

@yuzefovich
Copy link
Member Author

Do you think it's because of its non-deterministic nature and perhaps that some of the workloads have retry (i.e., equivalent of --tolerate-errors)? It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

Non-determinism definitely plays a role - most likely due to the nature of how gossip works (which I don't know how it does). For this bug ("version mismatch" error) to appear there are multiple conditions to be met:

  • an upgraded node must be started (so that it would respond to a ping from an older gateway) but must not have shared its DistSQL version with the gateway yet. Probably, in vast majority of cases the window between these two events is very short. Also, if the upgraded node encounters an error when adding the DistSQL version message into gossip, it panics (I don't know what no error means though - is it guaranteed that gossip message will be delivered eventually? does the delivery have the upper bound guarantee for when?)
  • an older node is used as a gateway for a query, it previously populated some internal state that the now-upgraded node was healthy and of compatible DistSQL version, and it didn't receive a draining notification from that now-upgraded node. Additionally, the gateway must be thinking that the now-upgraded node is the leaseholder for some ranges (according to the its local range cache) that a query touches. My guess is that in many cases when a node goes down for upgrade, this range cache should be updated to store the new leaseholder for the relevant ranges, but I'm not sure whether it's true.

In short, there are quite a few conditions to be met for this bug to appear.

One other question... it wasn't immediately obvious as to which of the upgrade steps failed and whether or not it should have been guarded by retry logic?

I don't think any of the upgraded steps failed - rather, after a partial upgrade (one node in the cluster) some "checks" (interesting queries) would fail. I'd be worried about adding a retry mechanism for those "checks" since retries might hide unexpected errors. In this case, the test failure was mostly because of the test, but it did expose a previously unknown (although seemingly rather rare) issue, so it was still worthy of an investigation.

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build succeeded:

@craig craig bot merged commit da00c3a into cockroachdb:master Aug 31, 2022
@yuzefovich yuzefovich deleted the version-upgrade branch August 31, 2022 18:25
@yuzefovich
Copy link
Member Author

I did some archeology: the current mechanism of handling upgrades in DistSQL was introduced in #17747. At the time of that PR being merged, the re-planning on a "version mismatch" error was considered in #17497, but it was deemed too hard, and I agree with that stance in the present time too.

I also briefly looked into gossip, and even though I don't think it provides any guarantees (especially for the upper bound for when information is dissipated), gossip probably works fast enough that the window when the errors could occur is rather short.

@renatolabs
Copy link
Contributor

@renatolabs Afaik we're not seeing an increase in this type of flake in the nightly roachtest, right?

Indeed, this code is used in a variety of upgrade tests but as far as I know, only the acceptance upgrade test has been failing with this issue. It might be that it is especially capable of hitting the specific conditions that trigger the error here.

It's hard to imagine that the nodes would always be shut down gracefully during upgrades.

I'm looking at official documentation on the upgrade process, and I found this:

Warning:

We do not recommend sending SIGKILL to perform a "hard" shutdown, which bypasses CockroachDB's node shutdown logic and forcibly terminates the process. This can corrupt log files and, in certain edge cases, can result in temporary data unavailability, latency spikes, uncertainty errors, ambiguous commit errors, or query timeouts. When decommissioning, a hard shutdown will leave ranges under-replicated and vulnerable to another node failure, causing quorum loss in the window before up-replication completes.

Therefore, I agree we should always be gracefully shutting down nodes, as it would be really difficult, in the tests, to discern whether an error falls under those expected errors or not.

Comment on lines +426 to +427
// TODO(yuzefovich): ideally, we would also check that the drain was
// successful since if it wasn't, then we might see flakes too.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, StopCockroachGracefullyOnNode will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/cmd/roachtest/tests/versionupgrade.go line 427 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Currently, StopCockroachGracefullyOnNode will wait for 1 minute, and then SIGKILL the process. If we instead waited for one minute and bubbled up a timeout error if the process is still running, would that deal with the situation described here?

Yeah, I believe so. It's quite possible that the timeout would be ok, so we would probably want the timeout error to be logged, but not fail the test.

renatolabs added a commit to renatolabs/cockroach that referenced this pull request Sep 28, 2022
In cockroachdb#87154, we started gracefully shutting down nodes in mixed-version
tests. As a consequence of that change, the `cdc/mixed-versions`
roachtest started failing. After some investigation, it was discovered
that the issue has nothing to do with upgrades but instead with the
the different way nodes are stopped (if we don't use different
binaries and just stop and restart the same binaries, running current
`master`, the issue is still observed).

In order to stop that test from failing, this commit implements a
workaround solution of stopping the changefeed while an upgrade is in
process, which makes the test pass reliably. This commit also removes
the retry logic from the FingerprintValidator code as it adds a
dimension of complexity that is not needed in the tests. The validator
will now block if trying to perform a SQL update until the upgrade
process is finished.

The actual issue that leads to this failure is being investigated in cockroachdb#88948.

Fixes cockroachdb#87251.

Release note: None.
renatolabs added a commit to renatolabs/cockroach that referenced this pull request Sep 29, 2022
In cockroachdb#87154, we started gracefully shutting down nodes in mixed-version
tests. As a consequence of that change, the `cdc/mixed-versions`
roachtest started failing. After some investigation, it was discovered
that the issue has nothing to do with upgrades but instead with the
the different way nodes are stopped (if we don't use different
binaries and just stop and restart the same binaries, running current
`master`, the issue is still observed).

In order to stop that test from failing, this commit implements a
workaround solution of stopping the changefeed while an upgrade is in
process, which makes the test pass reliably. This commit also removes
the retry logic from the FingerprintValidator code as it adds a
dimension of complexity that is not needed in the tests. The validator
will now block if trying to perform a SQL update until the upgrade
process is finished.

The actual issue that leads to this failure is being investigated in cockroachdb#88948.

Fixes cockroachdb#87251.

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: acceptance/version-upgrade is flaky
5 participants