storage: dynamically adjusted replica count heuristic is wonky #34122

petermattis · 2019-01-19T01:16:30Z

The target replica count for a range is computed is limited by the available nodes in the cluster (see StorePool.GetAvailableNodes and storage.GetNeededReplicas). This is wonky. Consider what happens if you have a 15-node cluster with 15-way replication (as seen in a customer setup). Take down 5 nodes and wait for those nodes to be declared dead. GetAvailableNodes will now return 10 which will limit the target replication for the ranges to 9 (can't replicate to an even number). So we'll remove 6 replicas from each of the ranges, and we'll do so fairly quickly. If the 5 nodes are restarted, each of the ranges on those nodes will first have to wait for replica GC, and then we'll have to wait for up-replication to occur. This seems less than optimal.

This can be triggered without waiting for the down nodes to be declared dead. If instead of stopping 5 nodes, the entire cluster is stopped and then only 10 nodes are started, the store pool will only have 10 node descriptors and the same problem will occur.

Here are steps to hork a local 15-node cluster:

~ roachprod create local -n 15
~ roachprod start local --sequential
~ roachprod sql local:1 -- -e "ALTER RANGE default CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER TABLE system.public.jobs CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE system CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 15"
... wait for every node to have 20 ranges
~ roachprod stop local
~ roachprod start local:1-10

I'm not sure what is going on yet, but this looks very reproducible.

The text was updated successfully, but these errors were encountered:

petermattis · 2019-01-19T01:57:56Z

Also easy to reproduce on a 9-node cluster with a replication factor of 9. Wait for full replication. Stop the cluster and restart nodes 1-6. A handful of ranges will appear unavailable. I didn't notice unavailable ranges on the 15-node cluster. Possible I just missed them.

petermattis · 2019-01-19T16:59:13Z

Here is what is going on in the 9-node cluster case. When all 9-nodes are up, a range has 9 replicas and everything is copacetic. When the cluster is taken down and restarted with 6-nodes, StorePool.AvailableNodeCount returns 6 (actually, it temporarily returns even fewer than that which is itself problematic) because there are only 6 node descriptors being gossiped. GetNeededReplicas(9, 6) returns a value of 5 (it refuses to return an even number for replication). So the replicate queue will start to down-replicate all of the 9-replica ranges to 5 replicas.

This down-replication sounds reasonable on the surface, but it is fraught because there is no traffic on the cluster. filterUnremovableReplicas removes any replicas that are necessary for quorum, but it allows removal of replicas on the down nodes for which we don't know the state. Unfortunately, it also allows removal of replicas on the good nodes. The allocator logs clearly show what is happening to cause a range to become unavailable. We down-replicate from 9 replicas, to 8, to 7, to 6. If the replicas being removed are on the up nodes, we eventually hit a state where there is not a quorum of replicas on up nodes. This deficiency in filterUnremovableReplicas should be fixable. We should require that a quorum of replicas is on live nodes.

But I think we shouldn't be down-replicating in the first place. StorePool.AvailableNodeCount seems like a problematic signal for the number of nodes in the cluster. For one, it only returns nodes which have gossiped a node descriptor. Second, it only counts non-dead nodes. So after a node has been down for 5 minutes, we may start adjusting the target replication factor for all ranges in the cluster. That doesn't seem like the right thing to do. I think we should consider the cluster node count to be the number of nodes that have been added to the cluster minus the number of decommissioned nodes. If a node dies, it should still be considered part of the cluster until it is decommissioned.

@tbg, @bdarnell, @andreimatei Thoughts on the above? Am I missing anything about how this is supposed to be working?

petermattis · 2019-01-19T18:58:21Z

I think StorePool.AvailableNodeCount should be replaced with NodeLiveness.AvailableNodeCount. The latter would look at the node liveness records, skipping any nodes that have been decommissioned.

filterUnremovableReplicas had a bug due to the handling of "brand new replica". I have a fix on the way. This was only a problem because the calculation for Replica.mu.lastReplicaAdded had a bug which was causing it to be set to the maximum replica ID in the range whenever a replica was removed. I have a fix for this on the way. So the problem here appears to be the confluence of 3 separate problems.

`filterUnremovableReplicas` was allowing replicas that were a necessary part of quorum to be removed. This occurred due to `filterBehindReplicas` taking a "brand new replica" that was being considered up-to-date even when we didn't have evidence of it being up-to-date. `filterBehindReplicas` needs to return an accurate picture of the up-to-date replicas. Rather than push this work into `filterBehindReplicas`, `filterUnremovableReplicas` has been changed to perform the filtering of the "brand new replica" from the removable candidates. See cockroachdb#34122 Release note: None

Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum replica ID in the descriptor whenever the descriptor changed. This was invalid when a replica was removed from the range. For example, consider a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed from the range, `lastReplicaAdded` was being set to 9. Coupled with the bug in the previous commit, this was causing replica ID 9 to appear to be up-to-date when it wasn't. The fix here isn't strictly necessary, but is done to bring sanity: `lastReplicaAdded` should accurately reflect the last replica which was added, not the maximum replica ID in the range. See cockroachdb#34122 Release note: None

Reimplement `StorePool.AvailableNodeCount` in terms of `NodeLiveness.GetNodeCount`. The latter returns a count of the user's intended number of nodes in the cluster. This is added nodes minus decommissioning (or decommissioned) nodes. This fixes misbehavior of the dynamic replication factor heuristic. The previous `StorePool.AvailableNodeCount` implementation would fluctuate depending on the number of node descriptors that had been received from gossip and the number of dead nodes. So if you had a 5-node cluster and 2 nodes died for more than 5 min (and thus marked as dead), the cluster would suddenly start down-replicating ranges. Similarly, if you had a 5-node cluster and you took down all 5-nodes and only restarted 3, the cluster would start down-replicating ranges. The new behavior is to consider a node part of the cluster until it is decommissioned. This better matches user expectations. Fixes cockroachdb#34122 Release note (bug fix): Avoid down-replicating widely replicated ranges when nodes in the cluster are temporarily down.

Add the `replicate/wide` roachtest which starts up a 9-node cluster, sets the replication factor for all zones to 9, waits for full replication, and then restarts the cluster, bringing up only nodes 1-6. Previously, this would cause down-replication and that down-replication could cause unavailable ranges. Further, test decommissioning one of the nodes and verify that the replication of the ranges falls to 7. Lastly, decrease the replication factor to 5 and verify the replicas per range again falls. See cockroachdb#34122 Release note: None

tbg · 2019-01-20T21:17:18Z

Ugh. Thanks for digging into this, I'll have more comments on Tuesday. The dynamic replication factor is a lot more problematic than anticipated.

bdarnell · 2019-01-21T15:47:01Z

I think StorePool.AvailableNodeCount should be replaced with NodeLiveness.AvailableNodeCount. The latter would look at the node liveness records, skipping any nodes that have been decommissioned.

Sounds reasonable to me.

For historical reference, #32949 was a previous attempt to fix the same issue.

petermattis · 2019-01-21T15:50:28Z

For historical reference, #32949 was a previous attempt to fix the same issue.

Yep. With hindsight, that was insufficient. I've been trying to imagine what can go wrong with using a node-liveness record based count. So far nothing, but that could be a failure of imagination.