Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: dynamically adjusted replica count heuristic is wonky #34122

Closed
petermattis opened this issue Jan 19, 2019 · 6 comments · Fixed by #34126
Closed

storage: dynamically adjusted replica count heuristic is wonky #34122

petermattis opened this issue Jan 19, 2019 · 6 comments · Fixed by #34126
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change. S-1 High impact: many users impacted, serious risk of high unavailability or data loss

Comments

@petermattis
Copy link
Collaborator

The target replica count for a range is computed is limited by the available nodes in the cluster (see StorePool.GetAvailableNodes and storage.GetNeededReplicas). This is wonky. Consider what happens if you have a 15-node cluster with 15-way replication (as seen in a customer setup). Take down 5 nodes and wait for those nodes to be declared dead. GetAvailableNodes will now return 10 which will limit the target replication for the ranges to 9 (can't replicate to an even number). So we'll remove 6 replicas from each of the ranges, and we'll do so fairly quickly. If the 5 nodes are restarted, each of the ranges on those nodes will first have to wait for replica GC, and then we'll have to wait for up-replication to occur. This seems less than optimal.

This can be triggered without waiting for the down nodes to be declared dead. If instead of stopping 5 nodes, the entire cluster is stopped and then only 10 nodes are started, the store pool will only have 10 node descriptors and the same problem will occur.

Here are steps to hork a local 15-node cluster:

~ roachprod create local -n 15
~ roachprod start local --sequential
~ roachprod sql local:1 -- -e "ALTER RANGE default CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER TABLE system.public.jobs CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE system CONFIGURE ZONE USING num_replicas = 15"
~ roachprod sql local:1 -- -e "ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 15"
... wait for every node to have 20 ranges
~ roachprod stop local
~ roachprod start local:1-10

I'm not sure what is going on yet, but this looks very reproducible.

@petermattis petermattis added the C-investigation Further steps needed to qualify. C-label will change. label Jan 19, 2019
@petermattis petermattis self-assigned this Jan 19, 2019
@petermattis
Copy link
Collaborator Author

Also easy to reproduce on a 9-node cluster with a replication factor of 9. Wait for full replication. Stop the cluster and restart nodes 1-6. A handful of ranges will appear unavailable. I didn't notice unavailable ranges on the 15-node cluster. Possible I just missed them.

@petermattis
Copy link
Collaborator Author

Here is what is going on in the 9-node cluster case. When all 9-nodes are up, a range has 9 replicas and everything is copacetic. When the cluster is taken down and restarted with 6-nodes, StorePool.AvailableNodeCount returns 6 (actually, it temporarily returns even fewer than that which is itself problematic) because there are only 6 node descriptors being gossiped. GetNeededReplicas(9, 6) returns a value of 5 (it refuses to return an even number for replication). So the replicate queue will start to down-replicate all of the 9-replica ranges to 5 replicas.

This down-replication sounds reasonable on the surface, but it is fraught because there is no traffic on the cluster. filterUnremovableReplicas removes any replicas that are necessary for quorum, but it allows removal of replicas on the down nodes for which we don't know the state. Unfortunately, it also allows removal of replicas on the good nodes. The allocator logs clearly show what is happening to cause a range to become unavailable. We down-replicate from 9 replicas, to 8, to 7, to 6. If the replicas being removed are on the up nodes, we eventually hit a state where there is not a quorum of replicas on up nodes. This deficiency in filterUnremovableReplicas should be fixable. We should require that a quorum of replicas is on live nodes.

But I think we shouldn't be down-replicating in the first place. StorePool.AvailableNodeCount seems like a problematic signal for the number of nodes in the cluster. For one, it only returns nodes which have gossiped a node descriptor. Second, it only counts non-dead nodes. So after a node has been down for 5 minutes, we may start adjusting the target replication factor for all ranges in the cluster. That doesn't seem like the right thing to do. I think we should consider the cluster node count to be the number of nodes that have been added to the cluster minus the number of decommissioned nodes. If a node dies, it should still be considered part of the cluster until it is decommissioned.

@tbg, @bdarnell, @andreimatei Thoughts on the above? Am I missing anything about how this is supposed to be working?

@petermattis petermattis added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Jan 19, 2019
@petermattis
Copy link
Collaborator Author

I think StorePool.AvailableNodeCount should be replaced with NodeLiveness.AvailableNodeCount. The latter would look at the node liveness records, skipping any nodes that have been decommissioned.

filterUnremovableReplicas had a bug due to the handling of "brand new replica". I have a fix on the way. This was only a problem because the calculation for Replica.mu.lastReplicaAdded had a bug which was causing it to be set to the maximum replica ID in the range whenever a replica was removed. I have a fix for this on the way. So the problem here appears to be the confluence of 3 separate problems.

petermattis added a commit to petermattis/cockroach that referenced this issue Jan 19, 2019
`filterUnremovableReplicas` was allowing replicas that were a necessary
part of quorum to be removed. This occurred due to
`filterBehindReplicas` taking a "brand new replica" that was being
considered up-to-date even when we didn't have evidence of it being
up-to-date. `filterBehindReplicas` needs to return an accurate picture
of the up-to-date replicas. Rather than push this work into
`filterBehindReplicas`, `filterUnremovableReplicas` has been changed to
perform the filtering of the "brand new replica" from the removable
candidates.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 19, 2019
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum
replica ID in the descriptor whenever the descriptor changed. This was
invalid when a replica was removed from the range. For example, consider
a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed
from the range, `lastReplicaAdded` was being set to 9. Coupled with the
bug in the previous commit, this was causing replica ID 9 to appear to
be up-to-date when it wasn't. The fix here isn't strictly necessary, but
is done to bring sanity: `lastReplicaAdded` should accurately reflect
the last replica which was added, not the maximum replica ID in the
range.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 19, 2019
Reimplement `StorePool.AvailableNodeCount` in terms of
`NodeLiveness.GetNodeCount`. The latter returns a count of the user's
intended number of nodes in the cluster. This is added nodes minus
decommissioning (or decommissioned) nodes.

This fixes misbehavior of the dynamic replication factor heuristic. The
previous `StorePool.AvailableNodeCount` implementation would fluctuate
depending on the number of node descriptors that had been received from
gossip and the number of dead nodes. So if you had a 5-node cluster and
2 nodes died for more than 5 min (and thus marked as dead), the cluster
would suddenly start down-replicating ranges. Similarly, if you had a
5-node cluster and you took down all 5-nodes and only restarted 3, the
cluster would start down-replicating ranges. The new behavior is to
consider a node part of the cluster until it is decommissioned. This
better matches user expectations.

Fixes cockroachdb#34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 19, 2019
Add the `replicate/wide` roachtest which starts up a 9-node cluster,
sets the replication factor for all zones to 9, waits for full
replication, and then restarts the cluster, bringing up only nodes
1-6. Previously, this would cause down-replication and that
down-replication could cause unavailable ranges.

Further, test decommissioning one of the nodes and verify that the
replication of the ranges falls to 7. Lastly, decrease the replication
factor to 5 and verify the replicas per range again falls.

See cockroachdb#34122

Release note: None
@tbg
Copy link
Member

tbg commented Jan 20, 2019

Ugh. Thanks for digging into this, I'll have more comments on Tuesday. The dynamic replication factor is a lot more problematic than anticipated.

@bdarnell
Copy link
Contributor

I think StorePool.AvailableNodeCount should be replaced with NodeLiveness.AvailableNodeCount. The latter would look at the node liveness records, skipping any nodes that have been decommissioned.

Sounds reasonable to me.

For historical reference, #32949 was a previous attempt to fix the same issue.

@petermattis
Copy link
Collaborator Author

For historical reference, #32949 was a previous attempt to fix the same issue.

Yep. With hindsight, that was insufficient. I've been trying to imagine what can go wrong with using a node-liveness record based count. So far nothing, but that could be a failure of imagination.

petermattis added a commit to petermattis/cockroach that referenced this issue Jan 21, 2019
`filterUnremovableReplicas` was allowing replicas that were a necessary
part of quorum to be removed. This occurred due to
`filterBehindReplicas` taking a "brand new replica" that was being
considered up-to-date even when we didn't have evidence of it being
up-to-date. `filterBehindReplicas` needs to return an accurate picture
of the up-to-date replicas. Rather than push this work into
`filterBehindReplicas`, `filterUnremovableReplicas` has been changed to
perform the filtering of the "brand new replica" from the removable
candidates.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 21, 2019
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum
replica ID in the descriptor whenever the descriptor changed. This was
invalid when a replica was removed from the range. For example, consider
a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed
from the range, `lastReplicaAdded` was being set to 9. Coupled with the
bug in the previous commit, this was causing replica ID 9 to appear to
be up-to-date when it wasn't. The fix here isn't strictly necessary, but
is done to bring sanity: `lastReplicaAdded` should accurately reflect
the last replica which was added, not the maximum replica ID in the
range.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 21, 2019
Reimplement `StorePool.AvailableNodeCount` in terms of
`NodeLiveness.GetNodeCount`. The latter returns a count of the user's
intended number of nodes in the cluster. This is added nodes minus
decommissioning (or decommissioned) nodes.

This fixes misbehavior of the dynamic replication factor heuristic. The
previous `StorePool.AvailableNodeCount` implementation would fluctuate
depending on the number of node descriptors that had been received from
gossip and the number of dead nodes. So if you had a 5-node cluster and
2 nodes died for more than 5 min (and thus marked as dead), the cluster
would suddenly start down-replicating ranges. Similarly, if you had a
5-node cluster and you took down all 5-nodes and only restarted 3, the
cluster would start down-replicating ranges. The new behavior is to
consider a node part of the cluster until it is decommissioned. This
better matches user expectations.

Fixes cockroachdb#34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 21, 2019
Add the `replicate/wide` roachtest which starts up a 9-node cluster,
sets the replication factor for all zones to 9, waits for full
replication, and then restarts the cluster, bringing up only nodes
1-6. Previously, this would cause down-replication and that
down-replication could cause unavailable ranges.

Further, test decommissioning one of the nodes and verify that the
replication of the ranges falls to 7. Lastly, decrease the replication
factor to 5 and verify the replicas per range again falls.

See cockroachdb#34122

Release note: None
@andreimatei andreimatei removed their assignment Jan 22, 2019
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
`filterUnremovableReplicas` was allowing replicas that were a necessary
part of quorum to be removed. This occurred due to
`filterBehindReplicas` taking a "brand new replica" that was being
considered up-to-date even when we didn't have evidence of it being
up-to-date. `filterBehindReplicas` needs to return an accurate picture
of the up-to-date replicas. Rather than push this work into
`filterBehindReplicas`, `filterUnremovableReplicas` has been changed to
perform the filtering of the "brand new replica" from the removable
candidates.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum
replica ID in the descriptor whenever the descriptor changed. This was
invalid when a replica was removed from the range. For example, consider
a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed
from the range, `lastReplicaAdded` was being set to 9. Coupled with the
bug in the previous commit, this was causing replica ID 9 to appear to
be up-to-date when it wasn't. The fix here isn't strictly necessary, but
is done to bring sanity: `lastReplicaAdded` should accurately reflect
the last replica which was added, not the maximum replica ID in the
range.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Reimplement `StorePool.AvailableNodeCount` in terms of
`NodeLiveness.GetNodeCount`. The latter returns a count of the user's
intended number of nodes in the cluster. This is added nodes minus
decommissioning (or decommissioned) nodes.

This fixes misbehavior of the dynamic replication factor heuristic. The
previous `StorePool.AvailableNodeCount` implementation would fluctuate
depending on the number of node descriptors that had been received from
gossip and the number of dead nodes. So if you had a 5-node cluster and
2 nodes died for more than 5 min (and thus marked as dead), the cluster
would suddenly start down-replicating ranges. Similarly, if you had a
5-node cluster and you took down all 5-nodes and only restarted 3, the
cluster would start down-replicating ranges. The new behavior is to
consider a node part of the cluster until it is decommissioned. This
better matches user expectations.

Fixes cockroachdb#34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Add the `replicate/wide` roachtest which starts up a 9-node cluster,
sets the replication factor for all zones to 9, waits for full
replication, and then restarts the cluster, bringing up only nodes
1-6. Previously, this would cause down-replication and that
down-replication could cause unavailable ranges.

Further, test decommissioning one of the nodes and verify that the
replication of the ranges falls to 7. Lastly, decrease the replication
factor to 5 and verify the replicas per range again falls.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
`filterUnremovableReplicas` was allowing replicas that were a necessary
part of quorum to be removed. This occurred due to
`filterBehindReplicas` taking a "brand new replica" that was being
considered up-to-date even when we didn't have evidence of it being
up-to-date. `filterBehindReplicas` needs to return an accurate picture
of the up-to-date replicas. Rather than push this work into
`filterBehindReplicas`, `filterUnremovableReplicas` has been changed to
perform the filtering of the "brand new replica" from the removable
candidates.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum
replica ID in the descriptor whenever the descriptor changed. This was
invalid when a replica was removed from the range. For example, consider
a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed
from the range, `lastReplicaAdded` was being set to 9. Coupled with the
bug in the previous commit, this was causing replica ID 9 to appear to
be up-to-date when it wasn't. The fix here isn't strictly necessary, but
is done to bring sanity: `lastReplicaAdded` should accurately reflect
the last replica which was added, not the maximum replica ID in the
range.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Reimplement `StorePool.AvailableNodeCount` in terms of
`NodeLiveness.GetNodeCount`. The latter returns a count of the user's
intended number of nodes in the cluster. This is added nodes minus
decommissioning (or decommissioned) nodes.

This fixes misbehavior of the dynamic replication factor heuristic. The
previous `StorePool.AvailableNodeCount` implementation would fluctuate
depending on the number of node descriptors that had been received from
gossip and the number of dead nodes. So if you had a 5-node cluster and
2 nodes died for more than 5 min (and thus marked as dead), the cluster
would suddenly start down-replicating ranges. Similarly, if you had a
5-node cluster and you took down all 5-nodes and only restarted 3, the
cluster would start down-replicating ranges. The new behavior is to
consider a node part of the cluster until it is decommissioned. This
better matches user expectations.

Fixes cockroachdb#34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Add the `replicate/wide` roachtest which starts up a 9-node cluster,
sets the replication factor for all zones to 9, waits for full
replication, and then restarts the cluster, bringing up only nodes
1-6. Previously, this would cause down-replication and that
down-replication could cause unavailable ranges.

Further, test decommissioning one of the nodes and verify that the
replication of the ranges falls to 7. Lastly, decrease the replication
factor to 5 and verify the replicas per range again falls.

See cockroachdb#34122

Release note: None
craig bot pushed a commit that referenced this issue Jan 23, 2019
33196: opt: implement use of sequences as data sources r=justinj a=justinj

This commit allows sequences to be selected from. It adds them as a
catalog item similar to tables.

Release note (sql change): Using a sequence as a SELECT target is now
supported by the cost-based optimizer.

34126: storage: fix various problems with dynamic replication factor r=tbg a=petermattis

* roachtest: add replicate/wide roachtest
* storage: reimplement StorePool.AvailableNodeCount
* storage: fix lastReplicaAdded computation
* storage: fix filterUnremovableReplicas badness

Fixes #34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.

Co-authored-by: Justin Jaffray <[email protected]>
Co-authored-by: Peter Mattis <[email protected]>
@craig craig bot closed this as completed in #34126 Jan 23, 2019
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
`filterUnremovableReplicas` was allowing replicas that were a necessary
part of quorum to be removed. This occurred due to
`filterBehindReplicas` taking a "brand new replica" that was being
considered up-to-date even when we didn't have evidence of it being
up-to-date. `filterBehindReplicas` needs to return an accurate picture
of the up-to-date replicas. Rather than push this work into
`filterBehindReplicas`, `filterUnremovableReplicas` has been changed to
perform the filtering of the "brand new replica" from the removable
candidates.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Previously, `Replica.mu.lastReplicaAdded` was being set to the maximum
replica ID in the descriptor whenever the descriptor changed. This was
invalid when a replica was removed from the range. For example, consider
a range with 9 replicas, IDs 1 through 9. If replica ID 5 is removed
from the range, `lastReplicaAdded` was being set to 9. Coupled with the
bug in the previous commit, this was causing replica ID 9 to appear to
be up-to-date when it wasn't. The fix here isn't strictly necessary, but
is done to bring sanity: `lastReplicaAdded` should accurately reflect
the last replica which was added, not the maximum replica ID in the
range.

See cockroachdb#34122

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Reimplement `StorePool.AvailableNodeCount` in terms of
`NodeLiveness.GetNodeCount`. The latter returns a count of the user's
intended number of nodes in the cluster. This is added nodes minus
decommissioning (or decommissioned) nodes.

This fixes misbehavior of the dynamic replication factor heuristic. The
previous `StorePool.AvailableNodeCount` implementation would fluctuate
depending on the number of node descriptors that had been received from
gossip and the number of dead nodes. So if you had a 5-node cluster and
2 nodes died for more than 5 min (and thus marked as dead), the cluster
would suddenly start down-replicating ranges. Similarly, if you had a
5-node cluster and you took down all 5-nodes and only restarted 3, the
cluster would start down-replicating ranges. The new behavior is to
consider a node part of the cluster until it is decommissioned. This
better matches user expectations.

Fixes cockroachdb#34122

Release note (bug fix): Avoid down-replicating widely replicated ranges
when nodes in the cluster are temporarily down.
petermattis added a commit to petermattis/cockroach that referenced this issue Jan 23, 2019
Add the `replicate/wide` roachtest which starts up a 9-node cluster,
sets the replication factor for all zones to 9, waits for full
replication, and then restarts the cluster, bringing up only nodes
1-6. Previously, this would cause down-replication and that
down-replication could cause unavailable ranges.

Further, test decommissioning one of the nodes and verify that the
replication of the ranges falls to 7. Lastly, decrease the replication
factor to 5 and verify the replicas per range again falls.

See cockroachdb#34122

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. S-1 High impact: many users impacted, serious risk of high unavailability or data loss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants