-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: dynamically adjusted replica count heuristic is wonky #34122
Comments
Also easy to reproduce on a 9-node cluster with a replication factor of 9. Wait for full replication. Stop the cluster and restart nodes 1-6. A handful of ranges will appear unavailable. I didn't notice unavailable ranges on the 15-node cluster. Possible I just missed them. |
Here is what is going on in the 9-node cluster case. When all 9-nodes are up, a range has 9 replicas and everything is copacetic. When the cluster is taken down and restarted with 6-nodes, This down-replication sounds reasonable on the surface, but it is fraught because there is no traffic on the cluster. But I think we shouldn't be down-replicating in the first place. @tbg, @bdarnell, @andreimatei Thoughts on the above? Am I missing anything about how this is supposed to be working? |
I think
|
`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None
Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.
Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None
Ugh. Thanks for digging into this, I'll have more comments on Tuesday. The dynamic replication factor is a lot more problematic than anticipated. |
Sounds reasonable to me. For historical reference, #32949 was a previous attempt to fix the same issue. |
Yep. With hindsight, that was insufficient. I've been trying to imagine what can go wrong with using a node-liveness record based count. So far nothing, but that could be a failure of imagination. |
`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None
Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.
Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None
`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None
Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.
Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None
`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None
Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.
Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None
33196: opt: implement use of sequences as data sources r=justinj a=justinj This commit allows sequences to be selected from. It adds them as a catalog item similar to tables. Release note (sql change): Using a sequence as a SELECT target is now supported by the cost-based optimizer. 34126: storage: fix various problems with dynamic replication factor r=tbg a=petermattis * roachtest: add replicate/wide roachtest * storage: reimplement StorePool.AvailableNodeCount * storage: fix lastReplicaAdded computation * storage: fix filterUnremovableReplicas badness Fixes #34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down. Co-authored-by: Justin Jaffray <[email protected]> Co-authored-by: Peter Mattis <[email protected]>
`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None
Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.
Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None
The target replica count for a range is computed is limited by the available nodes in the cluster (see
StorePool.GetAvailableNodes
andstorage.GetNeededReplicas
). This is wonky. Consider what happens if you have a 15-node cluster with 15-way replication (as seen in a customer setup). Take down 5 nodes and wait for those nodes to be declared dead.GetAvailableNodes
will now return 10 which will limit the target replication for the ranges to 9 (can't replicate to an even number). So we'll remove 6 replicas from each of the ranges, and we'll do so fairly quickly. If the 5 nodes are restarted, each of the ranges on those nodes will first have to wait for replica GC, and then we'll have to wait for up-replication to occur. This seems less than optimal.This can be triggered without waiting for the down nodes to be declared dead. If instead of stopping 5 nodes, the entire cluster is stopped and then only 10 nodes are started, the store pool will only have 10 node descriptors and the same problem will occur.
Here are steps to hork a local 15-node cluster:
I'm not sure what is going on yet, but this looks very reproducible.
The text was updated successfully, but these errors were encountered: