-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/mixed-headroom/n5cpu16 failed #53535
Comments
error in newOrder: missing stock row this sounds concerning and it's blocking the v20.2 beta. Can you dig into this on Monday, @andreimatei? |
I've ran 20 iterations and got multiple problems:
|
@andreimatei I think there's a chance that this is due to the same underlying issue as #53540. For that issue, we're narrowing in on an optimization to Pebble as being the culprit. I just kicked off a series of tests with and without that optimization on The other thing this reminds me of is some of the issues we've had with the propagation of read within uncertainty errors. Given that this test is using a mixed-version cluster, I wonder if we're hitting something like that. |
It's looking like 95a13a8 is the issue here as well. Over the course of 12 runs, I saw no failures without that commit but 2x |
My batch of runs on master (including #54032) all passed except for one stuck AdminSplit (being investigated by @andreimatei elsewhere). So that's more indication that 95a13a8 was the issue. cc. @petermattis. By the way, @andreimatei the stuck split looked like:
Just a shot in the dark, but I wonder if this is related to #50408. Or maybe a recent change to the RangeDescriptor proto? Does that ring a bell? |
Yeah, that was my initial thinking too because we did remove a field from the proto, but I think it didn't add up because it was a new-version node where the request was looping, not an old-version node. I'm getting back to this now. |
Focusing on the stuck splits now:
|
If a descriptor on the replicas actually has it set, but a 20.2 node is the one loading the descriptor, it will lose that bit, meaning that any attempts at CPut using the marshalled repr of that proto will fail. This is a problem we've already solved in other places, essentially by holding on to the actual KV bytes. But we're not doing that here yet, note how we're using cockroach/pkg/kv/kvserver/replica_command.go Line 187 in ed34965
|
Given where we're at in the cycle, I say - bring back that field, make it clear that is unused, and mention that we can't remove it until all code paths, and in particular split/unsplit have moved off the pattern of marshalling a |
We do not re-marshal the proto as the base of the CPut (and we weren't in 20.1). The code reads the bytes from the DB again for that here: cockroach/pkg/kv/kvserver/replica_command.go Line 175 in 79c01d2
|
Oops, right, I stopped reading too early. Ok, where does the ConditionFailedError originate? Is it cockroach/pkg/kv/kvserver/replica_command.go Line 2031 in ed34965
or from the actual |
We dropped this field recently, but unfortunately that wasn't safe for mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the proto through 20.2 nodes in a fairly subtle way. When it comes back to the 20.1 node, the descriptor needs to compare Equal() to the original. We configure our protos to not preserve unrecognized fields, so removing the field breaks this round-tripping. Specifically, the scenario which broke is the following: 1. A 20.1 node processes an AdminSplit, and performs the tranction writing the new descriptors. The descriptors have the GenerationCompable field set. 2. The lease changes while the respective txn is running. The lease moves to a 20.2 node. 3. The 20.2 node evaluates the EndTxn, and computes the split trigger info that's going to be replicated. The EndTxn has split info in it containing the field set, but the field is dropped when converting that into the proposed SplitTrigger (since the 20.2 unmarshalls and re-marshalls the descriptors). 4. All the 20.1 replicas of the ranges involved now apply the respective trigger via Raft, and their in-memory state doesn't have the field set. This doesn't match the bytes written in the database, which have the field. 5. The discrepancy between the in-memory state and the db state is a problem, as it causes the 20.1 node to spin if it tries to perform subsequent merge/split operations. The reason is that the code performing these operations short-circuits itself if it detects that the descriptor has changed while the operation was running. This detection is performed via the generated Equals() method, and it mis-fires because of the phantom field. That detection happens here: https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957 This patch takes precautions so that we can remove the field again in 21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll come back to 20.2 and remove the field. Fixes cockroachdb#53535 Release note: None
We dropped this field recently, but unfortunately that wasn't safe for mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the proto through 20.2 nodes in a fairly subtle way. When it comes back to the 20.1 node, the descriptor needs to compare Equal() to the original. We configure our protos to not preserve unrecognized fields, so removing the field breaks this round-tripping. Specifically, the scenario which broke is the following: 1. A 20.1 node processes an AdminSplit, and performs the tranction writing the new descriptors. The descriptors have the GenerationCompable field set. 2. The lease changes while the respective txn is running. The lease moves to a 20.2 node. 3. The 20.2 node evaluates the EndTxn, and computes the split trigger info that's going to be replicated. The EndTxn has split info in it containing the field set, but the field is dropped when converting that into the proposed SplitTrigger (since the 20.2 unmarshalls and re-marshalls the descriptors). 4. All the 20.1 replicas of the ranges involved now apply the respective trigger via Raft, and their in-memory state doesn't have the field set. This doesn't match the bytes written in the database, which have the field. 5. The discrepancy between the in-memory state and the db state is a problem, as it causes the 20.1 node to spin if it tries to perform subsequent merge/split operations. The reason is that the code performing these operations short-circuits itself if it detects that the descriptor has changed while the operation was running. This detection is performed via the generated Equals() method, and it mis-fires because of the phantom field. That detection happens here: https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957 This patch takes precautions so that we can remove the field again in 21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll come back to 20.2 and remove the field. Namely, the patch changes RangeDesc.Equal() to ignore that field (and the method is no longer generated). Fixes cockroachdb#53535 Release note: None
We dropped this field recently, but unfortunately that wasn't safe for mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the proto through 20.2 nodes in a fairly subtle way. When it comes back to the 20.1 node, the descriptor needs to compare Equal() to the original. We configure our protos to not preserve unrecognized fields, so removing the field breaks this round-tripping. Specifically, the scenario which broke is the following: 1. A 20.1 node processes an AdminSplit, and performs the tranction writing the new descriptors. The descriptors have the GenerationCompable field set. 2. The lease changes while the respective txn is running. The lease moves to a 20.2 node. 3. The 20.2 node evaluates the EndTxn, and computes the split trigger info that's going to be replicated. The EndTxn has split info in it containing the field set, but the field is dropped when converting that into the proposed SplitTrigger (since the 20.2 unmarshalls and re-marshalls the descriptors). 4. All the 20.1 replicas of the ranges involved now apply the respective trigger via Raft, and their in-memory state doesn't have the field set. This doesn't match the bytes written in the database, which have the field. 5. The discrepancy between the in-memory state and the db state is a problem, as it causes the 20.1 node to spin if it tries to perform subsequent merge/split operations. The reason is that the code performing these operations short-circuits itself if it detects that the descriptor has changed while the operation was running. This detection is performed via the generated Equals() method, and it mis-fires because of the phantom field. That detection happens here: https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957 This patch takes precautions so that we can remove the field again in 21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll come back to 20.2 and remove the field. Namely, the patch changes RangeDesc.Equal() to ignore that field (and the method is no longer generated). Fixes cockroachdb#53535 Release note: None
54199: kvserver: reintroduce RangeDesc.GenerationComparable r=andreimatei a=andreimatei We dropped this field recently, but unfortunately that wasn't safe for mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the proto through 20.2 nodes in a fairly subtle way. When it comes back to the 20.1 node, the descriptor needs to compare Equal() to the original. We configure our protos to not preserve unrecognized fields, so removing the field breaks this round-tripping. Specifically, the scenario which broke is the following: 1. A 20.1 node processes an AdminSplit, and performs the tranction writing the new descriptors. The descriptors have the GenerationCompable field set. 2. The lease changes while the respective txn is running. The lease moves to a 20.2 node. 3. The 20.2 node evaluates the EndTxn, and computes the split trigger info that's going to be replicated. The EndTxn has split info in it containing the field set, but the field is dropped when converting that into the proposed SplitTrigger (since the 20.2 unmarshalls and re-marshalls the descriptors). 4. All the 20.1 replicas of the ranges involved now apply the respective trigger via Raft, and their in-memory state doesn't have the field set. This doesn't match the bytes written in the database, which have the field. 5. The discrepancy between the in-memory state and the db state is a problem, as it causes the 20.1 node to spin if it tries to perform subsequent merge/split operations. The reason is that the code performing these operations short-circuits itself if it detects that the descriptor has changed while the operation was running. This detection is performed via the generated Equals() method, and it mis-fires because of the phantom field. That detection happens here: https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957 This patch takes precautions so that we can remove the field again in 21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll come back to 20.2 and remove the field. Fixes #53535 Release note: None Co-authored-by: Andrei Matei <[email protected]>
We dropped this field recently, but unfortunately that wasn't safe for mixed-version clusters. The rub is that 20.1 nodes need to roundtrip the proto through 20.2 nodes in a fairly subtle way. When it comes back to the 20.1 node, the descriptor needs to compare Equal() to the original. We configure our protos to not preserve unrecognized fields, so removing the field breaks this round-tripping. Specifically, the scenario which broke is the following: 1. A 20.1 node processes an AdminSplit, and performs the tranction writing the new descriptors. The descriptors have the GenerationCompable field set. 2. The lease changes while the respective txn is running. The lease moves to a 20.2 node. 3. The 20.2 node evaluates the EndTxn, and computes the split trigger info that's going to be replicated. The EndTxn has split info in it containing the field set, but the field is dropped when converting that into the proposed SplitTrigger (since the 20.2 unmarshalls and re-marshalls the descriptors). 4. All the 20.1 replicas of the ranges involved now apply the respective trigger via Raft, and their in-memory state doesn't have the field set. This doesn't match the bytes written in the database, which have the field. 5. The discrepancy between the in-memory state and the db state is a problem, as it causes the 20.1 node to spin if it tries to perform subsequent merge/split operations. The reason is that the code performing these operations short-circuits itself if it detects that the descriptor has changed while the operation was running. This detection is performed via the generated Equals() method, and it mis-fires because of the phantom field. That detection happens here: https://github.com/cockroachdb/cockroach/blob/79c01d28da9c379f67bb41beef3d85ad3bee1da1/pkg/kv/kvserver/replica_command.go#L1957 This patch takes precautions so that we can remove the field again in 21.1 - I'm merging this in 21.1, I'll backport it to 20.2, and then I'll come back to 20.2 and remove the field. Namely, the patch changes RangeDesc.Equal() to ignore that field (and the method is no longer generated). Fixes #53535 Release note: None
(roachtest).tpcc/mixed-headroom/n5cpu16 failed on provisional_202008261913_v20.2.0-beta.1@eaa939ce6548a54a23970814ff00f30ad87680ac:
More
Artifacts: /tpcc/mixed-headroom/n5cpu16
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: