-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: version/mixed/nodes=5 failed #72029
Comments
Node 3 died while node 2 was upgrading from version
It looks like what happened here is that node 3 is trying to add a learner replica of
This seems to be related to @irfansharif recently merging #70464, as this change removed @irfansharif any thoughts on how to handle this discrepancy? |
roachtest.version/mixed/nodes=5 failed with artifacts on master @ 40bd241158c8dd4a68d277fa732b91f56e411b38:
|
Nice sleuthing Alex! @irfansharif I think something must have gone wrong here, could you look into this?
cockroach/pkg/cmd/roachtest/tests/predecessor_version.go Lines 38 to 39 in f8cfdae
In the crash, we see that the active cluster version was 21.1-1164, which (the snippet below is from 21.2-beta.2) is older than the 21.2 final version: cockroach/pkg/clusterversion/cockroach_versions.go Lines 474 to 498 in fef2a3a
Ok, so we probably created that fixture in one of the earlier 21.2 candidates, before we had minted the final cluster version. However, none of this should matter because the migration that phased out the applied state ran is way older: cockroach/pkg/clusterversion/cockroach_versions.go Lines 333 to 339 in fef2a3a
The fixture should certainly not have anyone not using the range applied state. Is the problem here that we needed to "migrate out of the assertion"? 21.2 nodes have the assertion, so they require the field to always be set to "true". But 22.1 nodes have forgotten about the field and will never set it. Isn't that a problem? |
Concretely, in the instance Alex looked at, if n3 is receiving a snapshot, that snapshot will certainly be using the applied state. The stateloader (on n3, the old binary) will thus hit line 79 here: cockroach/pkg/kv/kvserver/stateloader/stateloader.go Lines 76 to 79 in a6e4b3e
But the in-memory state is populated here (again, on old binary n3) here: cockroach/pkg/kv/kvserver/replica_raftstorage.go Lines 1010 to 1014 in 793b4c8
where
On the sender side (new binary), the The lesson here is that we shouldn't use I'll file an issue for the long-term fix. |
Just sanity checking, you mean the |
Also, thanks for the all the legwork! Nice to have friends in europe, I can wake up to root-caused issues I introduced the evening before. |
72132: coldata: fix BenchmarkAppend so that it could be run with multiple count r=yuzefovich a=yuzefovich Previously, we could get an index out of bounds because we'd allocate larger `Bytes.data` slice than `int32` offset can support. Release note: None 72239: kvserver: resurrect using_applied_state_key in ReplicaState r=irfansharif a=irfansharif Fixes #72029. `using_applied_state_key` was previously used to check whether the Range had upgraded to begin using the `RangeAppliedState` key. When set to true, replicas receiving the state (through a `ReplicatedEvalResult`) knew to begin using the new key. In #58088 (in 21.1) we introduced a migration to iterate through all ranges in the system and have them start using the `RangeAppliedState` key. In 21.2, this field was always set to true -- 21.2 nodes were already using the `RangeAppliedState` key, or receiving messages from 21.1 nodes that were also using the `RangeAppliedState` key. In 21.2 (and in 21.1 for that matter) we didn't need to trigger the "if set to true in an incoming message, start using the `RangeAppliedState` key" code path. When looking to get rid of this field in 22.1 (#70464), we observed that there was an unintentional read of this field in 21.2 nodes (see #72029 and \#72222); the saga is as follows: - Removing this field in 22.1 meant it was set as false when received at 21.2 nodes. - This should've been fine! We weren't using this field to trigger any upgrade code paths (like it was originally intended for). - It turns out that in 21.2 we were using the `ReplicaState` from the incoming snapshot to update our in-memory replica state - Because the proto field was being phased out, there was now a divergence between the on-disk state (field set to true, from earlier 21.2 operations) and the in-memory state (field set to false, because sent from a version that attempted to get rid of this field). Removing proto fields from the replica state are not possible until we stop using the protobuf copy of the replica state when applying a snapshot (#72222). Once that's done, we should be able to stop sending the replica state as part of the snapshot in the subsequent release. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: irfan sharif <[email protected]>
roachtest.version/mixed/nodes=5 failed with artifacts on master @ 965b99b1a05eb893ee908d852680824a0bd4ffe3:
Help
See: roachtest README
|
See: How To Investigate (internal)
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: