-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed test: version-upgrade #38859
Comments
Two nodes crashed at the same time because of:
This is very likely because of #38568. @ajwerner and I looked through the code and found that we modify the stats delta for entries in It seems possible that a node running |
I looked into this more. The migration from the It doesn't look like it's impossible though. If we look at the 2.1 release, we see that the |
Interesting. Is the solution to transcribe the update in MigrateToRangeAppliedStateKey to the correct delta (or both)? If I understand you correctly, we are updating a delta that we are then replacing with the deprecated delta, which wasn't updated. Another idea that I don't think we should do (unless we decide for some reason that it's worth doing on top of a real fix): we could also patch 2.1.8 to never use VersionRangeAppliedStateKey unless also VersionMVCCNetworkStats (i.e. we On a meta level, this sort of problem about never ever really being able to rid ourselves of migrations that depend on "persisted" data is annoying. I had a discussion with @danhhz the other day about making it part of the version upgrade bumps to make sure "old stuff" got reliably flushed out (my stance then was cautious about whether it was worth it, but now I'm starting to come around to a yes). This could be a hook in the version bump after we've verified that the version can be bumped but before actually doing so. For example, for the applied state key, each node would iterate over its replicas and wake up those that haven't migrated yet (so that they will). It would keep doing this appropriately until it finds that nothing is left to do. We can only do this sort of cleanup for past migrations which we know must be active based on the pre-bumped cluster version, i.e. we trail by one release. That is, when bumping the version to, say, 20.1, we could make sure that all the migrations new in 19.1 (or before) are now always-on (so 20.1 itself would still have the migrations, but not 20.2). We're going to have a similar dance to do with preemptive snapshots, which we want to phase out and after which we don't want to deal with
Effectively this means that the legacy code stays on master until the day after the 20.1 release (Spring 20), i.e. for around half a year after 19.2 release date, which is acceptable. It seems that we could speed that up (in this case and in general), though, by front-loading the migration into the 19.2 binary, at the expense of a bigger investment. The idea is to "break up" the cluster version bump into two stages:
You can't migrate from a 19.2-pre back to anything lower (duh), but you also can't go higher except through 19.2-post. For example, restarting a 19.2-pre node into 20.1 will be met with the same error as if it had initially been 19.1 (or 2.1). What I don't like about all this (on top of it being complex) is that there's potential to get stuck in this in-between state because one of the hooks required to move forward doesn't fire. But that problem exists anyway, even if we spread it out as initially discussed, and if there's a bug in the hooks we'll need to release a new patch release and users need to run that. There's also this other awkward problem that @nvanbenschoten has discovered a few times recently, which is that when an X+1 node operates at X, we want it to be able to assume that all other nodes are operating at X as well (in fact, we assume that all the time, because it seems obvious). But imagine a node running the predecessor binary X-1 joining the cluster; it would run at X-1 until it receives the Gossip update telling it about X. This is maybe a problem orthogonal to that discussed above, but it would also be good to simplify this: we can verify before a cluster setting bump (via an RPC to all nodes or picking it up from Gossip) that all nodes are using the latest version, similar to how the auto-upgrade mechanism does it. Unfortunately, as always, thanks to distributed systems reasons this always has false negatives, and the window for this bug is already quite small so that it's unclear that it's worth it. |
The solution is to move the migration of the
This is a very real issue that we'll want to address sooner rather than later. I like your proposed solution of a pre and post version migration step. However, the challenge with these kinds of solutions is that they require an acknowledgment from every node in the cluster, so every node needs to be live. Is that an acceptable requirement? |
Can we do something where any node that doesn't acknowledge within a certain timeframe has its rpc connections cut off and any reconnects fail until it's done what it needs to do? |
So we'd have to make a single change: manual version bumps fail when there are non-dead nodes or nodes can't be reached for other reasons. Here "non-dead" means: not dead in the sense of the webui, i.e. either not decommissioned and down for < TimeUntilStoreDead; or decommissioned but live, where we allow ourselves a liveness period or two to poll if necessary. Basically "get your cluster green before you upgrade". We can argue about whether decommissioning nodes are immediately excluded when they're down or whether the dead timeout applies to them too. The former gives users a convenient way to tell CRDB "trust me, this node is gone" without having to futz with the store dead timeout, which I never want people touching because they'll forget to reset it. But it's also a potential footgun in this world of k8s surprisingly restarting your pods. Automatic updates already don't happen when nodes are down, so maybe at the end of the day we're just pushing most of that logic down into |
Fixes cockroachdb#38859. The explanation of what is going wrong is in cockroachdb#38859 (comment) and the next comment. The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch` [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600) and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631). We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325) if the raft command was proposed with a `DeprecatedDelta` instead of a new `Delta` field. I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our test suite is catching the issue, so I think that's sufficient. Release note: None
Fixes cockroachdb#38859. The explanation of what is going wrong is in cockroachdb#38859 (comment) and the next comment. The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch` [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600) and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631). We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325) if the raft command was proposed with a `DeprecatedDelta` instead of a new `Delta` field. I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our test suite is catching the issue, so I think that's sufficient. Release note: None
38976: storage: account for below Raft stats changes when using DeprecatedDelta r=ajwerner a=nvanbenschoten Fixes #38859. The explanation of what is going wrong is in #38859 (comment) and the next comment. The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch` [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600) and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631). We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325) if the raft command was proposed with a `DeprecatedDelta` instead of a new `Delta` field. I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our acceptance test suite is catching the issue, so I think that's sufficient. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
The following tests appear to have failed on master (roachtest): acceptance/version-upgrade
You may want to check for open issues.
#1387087:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: