Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: version-upgrade #38859

Closed
cockroach-teamcity opened this issue Jul 12, 2019 · 6 comments · Fixed by #38976
Closed

teamcity: failed test: version-upgrade #38859

cockroach-teamcity opened this issue Jul 12, 2019 · 6 comments · Fixed by #38976
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (roachtest): acceptance/version-upgrade

You may want to check for open issues.

#1387087:

acceptance/version-upgrade
--- FAIL: roachtest/acceptance/version-upgrade (86.702s)
test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	cluster.go:1233,context.go:135,cluster.go:1222,test_runner.go:668,test_runner.go:679: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		1: dead
		2: dead
		3: 30067
		Error:  1: dead, 2: dead

test artifacts and logs in: artifacts/acceptance/version-upgrade/run_1
	cluster.go:1233,context.go:135,cluster.go:1222,test_runner.go:668,test_runner.go:679: dead node detection: /go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor local --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		1: dead
		2: dead
		3: 30067
		Error:  1: dead, 2: dead





Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Jul 12, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 12, 2019
@nvanbenschoten nvanbenschoten self-assigned this Jul 17, 2019
@nvanbenschoten
Copy link
Member

Two nodes crashed at the same time because of:

E190712 22:04:12.102456 205 storage/replica.go:938  [n1,s1,r105/1:/Table/14{0-1}] on-disk and in-memory state diverged:
[Stats.LastUpdateNanos: 1562968977958541016 != 1562968978475248508 Stats.SysBytes: 918 != 556 Stats.SysCount: 10 != 9]
F190712 22:04:12.102538 205 storage/replica.go:940  [n1,s1,r105/1:/Table/14{0-1}] on-disk and in-memory state diverged: [Stats.LastUpdateNanos: 1562968977958541016 != 1562968978475248508 Stats.SysBytes: 918 != 556 Stats.SysCount: 10 != 9]
goroutine 205 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc00043a101, 0xc00043a1e0, 0x0, 0x9c)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x6849180, 0xc000000004, 0x6040a99, 0x12, 0x3ac, 0xc007002e40, 0xba)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x92b
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x42d4740, 0xc007b799e0, 0xc000000004, 0x2, 0x0, 0x0, 0xc007d90868, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x42d4740, 0xc007b799e0, 0x1, 0xc000000004, 0x0, 0x0, 0xc007d90868, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:189
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).assertStateLocked(0xc0006e6600, 0x42d4740, 0xc007b799e0, 0x42fadc0, 0xc0000b7980)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:940 +0x6f7
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftCommandResult(0xc0006e6600, 0x42d4740, 0xc007b799e0, 0xc0038da008, 0xc0006e0001, 0x15b0c8126978154b, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application_result.go:180 +0x5ab
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applyCmdAppBatch(0xc0006e6600, 0x42d4740, 0xc007b799e0, 0xc0036b2d80, 0xc007d915a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application.go:837 +0x8a8
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleCommittedEntriesRaftMuLocked(0xc0006e6600, 0x42d4740, 0xc007b799e0, 0xc007247180, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_application.go:141 +0x2b5
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0006e6600, 0x42d4740, 0xc007b799e0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:703 +0xb0d

This is very likely because of #38568. @ajwerner and I looked through the code and found that we modify the stats delta for entries in applyRaftCommandToBatch here and here. We then conditionally replace this stats delta here if the raft command was proposed with a DeprecatedDelta instead of a new Delta field.

It seems possible that a node running master would apply an entry that both performed the applied state key migration and used the DeprecatedDelta field, which would lead to a panic like this. I still need to confirm that this is all possible given the versions that these migrations landed in.

@nvanbenschoten
Copy link
Member

It seems possible that a node running master would apply an entry that both performed the applied state key migration and used the DeprecatedDelta field, which would lead to a panic like this. I still need to confirm that this is all possible given the versions that these migrations landed in.

I looked into this more. The migration from the DeprecatedDelta field to the Delta field took place in v2.0. The migration to the RangeAppliedStateKey took place in v2.1. So it seems unlikely that a Raft command using a DeprecatedDelta would also have the UsingAppliedStateKey flag set to true.

It doesn't look like it's impossible though. If we look at the 2.1 release, we see that the VersionMVCCNetworkStats version check comes before (by 30 lines of code!) the VersionRangeAppliedStateKey version check. It seems highly unlikely, but possible, that a 2.1 node would not know that VersionMVCCNetworkStats was active when that was checked but would then know that VersionRangeAppliedStateKey was active when that was checked because it got a gossip update that informed it of the new cluster version in between. We didn't have #20587 at the time, so it's hard to make any firm guarantees about what would prevent that back then.

@tbg
Copy link
Member

tbg commented Jul 18, 2019

It seems highly unlikely, but possible, that a 2.1 node would not know that VersionMVCCNetworkStats was active when that was checked but would then know that VersionRangeAppliedStateKey was active when that was checked because it got a gossip update that informed it of the new cluster version in between.

Interesting. Is the solution to transcribe the update in MigrateToRangeAppliedStateKey to the correct delta (or both)? If I understand you correctly, we are updating a delta that we are then replacing with the deprecated delta, which wasn't updated.

Another idea that I don't think we should do (unless we decide for some reason that it's worth doing on top of a real fix): we could also patch 2.1.8 to never use VersionRangeAppliedStateKey unless also VersionMVCCNetworkStats (i.e. we && the latter into the use of the former). This wouldn't solve the problem (users can migrate from 2.1.x into 19.1.x directly) but it would "solve" it effectively in our testing.

On a meta level, this sort of problem about never ever really being able to rid ourselves of migrations that depend on "persisted" data is annoying. I had a discussion with @danhhz the other day about making it part of the version upgrade bumps to make sure "old stuff" got reliably flushed out (my stance then was cautious about whether it was worth it, but now I'm starting to come around to a yes).

This could be a hook in the version bump after we've verified that the version can be bumped but before actually doing so. For example, for the applied state key, each node would iterate over its replicas and wake up those that haven't migrated yet (so that they will). It would keep doing this appropriately until it finds that nothing is left to do.

We can only do this sort of cleanup for past migrations which we know must be active based on the pre-bumped cluster version, i.e. we trail by one release. That is, when bumping the version to, say, 20.1, we could make sure that all the migrations new in 19.1 (or before) are now always-on (so 20.1 itself would still have the migrations, but not 20.2).

We're going to have a similar dance to do with preemptive snapshots, which we want to phase out and after which we don't want to deal with replicaID==0 any more. Alas, preemptive snapshots are replicas on disk and unless something either catches them up or GCs them, they will stick around forever. So in 19.2, we effectively don't want to use them any more, but we want to get rid of them and can't yet. So we spread out the process:

  • 19.2 uses preemptive snaps when it has to, or learners when it can
  • 20.1 understands preemptive snaps, uses only learners, and in addition there's a pre-bump hook that removes all vestiges of preemptive snaps (it includes a check that all peers are already using 19.2, see the paragraph at the bottom of this post)
  • 20.2 has no idea preemptive snaps ever existed

Effectively this means that the legacy code stays on master until the day after the 20.1 release (Spring 20), i.e. for around half a year after 19.2 release date, which is acceptable.

It seems that we could speed that up (in this case and in general), though, by front-loading the migration into the 19.2 binary, at the expense of a bigger investment. The idea is to "break up" the cluster version bump into two stages:

  • bump to 19.2-pre: activates all the cluster versions for 19.2, i.e. learners are always-on
  • bump to 19.2-post: requires that the migration has erased preemptive snapshots from the wild (19.2-post can only be activated when all nodes have acked 19.2-pre).

You can't migrate from a 19.2-pre back to anything lower (duh), but you also can't go higher except through 19.2-post. For example, restarting a 19.2-pre node into 20.1 will be met with the same error as if it had initially been 19.1 (or 2.1). What I don't like about all this (on top of it being complex) is that there's potential to get stuck in this in-between state because one of the hooks required to move forward doesn't fire. But that problem exists anyway, even if we spread it out as initially discussed, and if there's a bug in the hooks we'll need to release a new patch release and users need to run that.

There's also this other awkward problem that @nvanbenschoten has discovered a few times recently, which is that when an X+1 node operates at X, we want it to be able to assume that all other nodes are operating at X as well (in fact, we assume that all the time, because it seems obvious). But imagine a node running the predecessor binary X-1 joining the cluster; it would run at X-1 until it receives the Gossip update telling it about X. This is maybe a problem orthogonal to that discussed above, but it would also be good to simplify this: we can verify before a cluster setting bump (via an RPC to all nodes or picking it up from Gossip) that all nodes are using the latest version, similar to how the auto-upgrade mechanism does it. Unfortunately, as always, thanks to distributed systems reasons this always has false negatives, and the window for this bug is already quite small so that it's unclear that it's worth it.

@nvanbenschoten
Copy link
Member

Is the solution to transcribe the update in MigrateToRangeAppliedStateKey to the correct delta (or both)?

The solution is to move the migration of the DeprecatedDelta up before we start mutating Delta. I'm also going to add an assertion in this migration that what we're overwriting it empty.

This could be a hook in the version bump after we've verified that the version can be bumped but before actually doing so. For example, for the applied state key, each node would iterate over its replicas and wake up those that haven't migrated yet (so that they will). It would keep doing this appropriately until it finds that nothing is left to do.

This is a very real issue that we'll want to address sooner rather than later. I like your proposed solution of a pre and post version migration step. However, the challenge with these kinds of solutions is that they require an acknowledgment from every node in the cluster, so every node needs to be live. Is that an acceptable requirement?

@danhhz
Copy link
Contributor

danhhz commented Jul 18, 2019

However, the challenge with these kinds of solutions is that they require an acknowledgment from every node in the cluster, so every node needs to be live.

Can we do something where any node that doesn't acknowledge within a certain timeframe has its rpc connections cut off and any reconnects fail until it's done what it needs to do?

@tbg
Copy link
Member

tbg commented Jul 18, 2019

So we'd have to make a single change: manual version bumps fail when there are non-dead nodes or nodes can't be reached for other reasons.

Here "non-dead" means: not dead in the sense of the webui, i.e. either not decommissioned and down for < TimeUntilStoreDead; or decommissioned but live, where we allow ourselves a liveness period or two to poll if necessary. Basically "get your cluster green before you upgrade". We can argue about whether decommissioning nodes are immediately excluded when they're down or whether the dead timeout applies to them too. The former gives users a convenient way to tell CRDB "trust me, this node is gone" without having to futz with the store dead timeout, which I never want people touching because they'll forget to reset it. But it's also a potential footgun in this world of k8s surprisingly restarting your pods.

Automatic updates already don't happen when nodes are down, so maybe at the end of the day we're just pushing most of that logic down into SET CLUSTER SETTING and adding the pre-migration hook in.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 18, 2019
Fixes cockroachdb#38859.

The explanation of what is going wrong is in cockroachdb#38859 (comment)
and the next comment.

The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch`
[here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600)
and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631).
We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325)
if the raft command was proposed with a `DeprecatedDelta` instead of a new
`Delta` field.

I'm not adding a unit test here because this is a glaring bug and we don't test
either of these migrations at all anymore (since 78e1866). Our test suite is
catching the issue, so I think that's sufficient.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 18, 2019
Fixes cockroachdb#38859.

The explanation of what is going wrong is in cockroachdb#38859 (comment)
and the next comment.

The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch`
[here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600)
and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631).
We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325)
if the raft command was proposed with a `DeprecatedDelta` instead of a new
`Delta` field.

I'm not adding a unit test here because this is a glaring bug and we don't test
either of these migrations at all anymore (since 78e1866). Our test suite is
catching the issue, so I think that's sufficient.

Release note: None
craig bot pushed a commit that referenced this issue Jul 19, 2019
38976: storage: account for below Raft stats changes when using DeprecatedDelta r=ajwerner a=nvanbenschoten

Fixes #38859.

The explanation of what is going wrong is in #38859 (comment) and the next comment.

The problem was that we modify the stats delta for entries in `applyRaftCommandToBatch` [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L600) and [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L631). We then conditionally replace this stats delta [here](https://github.com/cockroachdb/cockroach/blob/5a382477a5c38f1718829f2b443783dd8b34e92b/pkg/storage/replica_application.go#L325) if the raft command was proposed with a `DeprecatedDelta` instead of a new `Delta` field.

I'm not adding a unit test here because this is a glaring bug and we don't test either of these migrations at all anymore (since 78e1866). Our acceptance test suite is catching the issue, so I think that's sufficient.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38976 Jul 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants