storage: replicate queue has tendency to remove just-added replicas #17879

a-robinson · 2017-08-23T22:06:31Z

After the replicate queue adds a new replica as part of a rebalance, it immediately re-queues the replica for processing so that it can choose to remove a replica.
As part of choosing which replica to remove, we filter from consideration any replicas that are necessary for the range's quorum. This filtering compares the raft commit index from each replica to the quorum commit index, considering any replicas even a single commit behind not part of the quorum.

Both of these actions are reasonable, but they interact poorly in high-latency clusters or any cluster where one of the previously existing replicas was lagging behind. The problem is that in many cases, the newly added replica won't have finished receiving/applying its snapshot and catching up (or at least the leaseholder isn't yet aware that it has done so), and so if any of the other replicas is also behind (as can easily happen on indigo where the nodes are different distances apart) then those two behind replicas are the only ones we'll consider removing. The other two replicas are considered necessary for the quorum since they're the only two that are up-to-date.

This unfortunate behavior can easily lead to thrashing if the replica that the allocator wanted to rebalance away from is one of the two that can't be removed. This affects both stats-based rebalancing and the range-count form of rebalancing, although its effect is more severe for stats-based rebalancing. It happens very reliably whenever running indigo with no data other than the timeseries ranges.

We could fix it in a few different ways, but we might not want to so close to 1.1. If we don't fix it, we'll definitely have to disable stats-based rebalancing by default (#17645).

The first approaches to fixing that come to mind:

Wait before re-queueing after doing a rebalance. We could just wait for a set amount of time on the order of seconds, or for a dynamic time estimated based on how long the snapshot should take, or we could poll the range state until the new replica gets caught up. These approaches would all be susceptible to the Scanner re-adding the replica to the replicate queue, circumventing the wait.
Be less harsh in filterUnremovableReplicas. If a replica is less than N commits behind, don't rule it out. If we did this, then even if an existing replica is behind by a commit or two there will still be 3 valid replicas, meaning any replica can be behind even if the new replica hasn't caught up.

I like 2, but may be forgetting a reason why we can't loosen that up. I thought we used to allow a cushion here, but we clearly don't right now. @petermattis

The text was updated successfully, but these errors were encountered:

bdarnell · 2017-08-23T22:12:50Z

If the new replica is still applying its snapshot, the match index will be zero, so a check for "less than N commits behind" wouldn't always help.

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

petermattis · 2017-08-24T00:13:15Z

I'm not fond of option 1 as I feel it might have other side-effects like slowing down up-replication. Note that the description isn't quite accurate. After adding a new replica, we will definitely have finished sending and applying the preemptive snapshot before re-queuing the replica for rebalancing. The problem is that while sending/apply the snapshot the raft log will have grown so the newly added replica will be behind.

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

This is interesting, though I think we'd want a time limit on this. If the most recently added replica was added 2 days ago, it shouldn't be considered a necessary part of the current quorum. Adding a time limit here could make this workable. I think option 2 could also work, though I'm not sure what value of N to use.

a-robinson · 2017-08-24T01:42:36Z

If the new replica is still applying its snapshot, the match index will be zero, so a check for "less than N commits behind" wouldn't always help.

Sorry, I didn't explain very clearly. If all 3 of the existing replicas are considered up-to-date, then filterUnremovableReplicas will be fine with removing any of them even if the fourth replica is behind, because no matter which replica is removed, there will still be at least 2 up-to-date after removing one. In cases like what's happening on indigo, considering the third existing replica up-to-date would fix things even given that the new replica is behind.

After adding a new replica, we will definitely have finished sending and applying the preemptive snapshot before re-queuing the replica for rebalancing. The problem is that while sending/apply the snapshot the raft log will have grown so the newly added replica will be behind.

That's not true in practice, at least not the leaseholder being made aware of the new commit index. If it's meant to be true, something is wrong. In all the cases I saw on indigo earlier, the newest replica's commit index was 0. There are dozens of log lines like this, where one of the existing replicas is one commit behind (presumably the ADD_REPLICA command) and one is at 0:

I170823 20:57:21.174696 184 storage/replicate_queue.go:371  [replicate,n3,s3,r34/3:/System/tsd/cr.store.{qu…-re…}] removing replica (n2,s2):20 due to over-replication: [8:2333, 2:2334, 3*:2334, 20:0]

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

I like this approach pretty well, but do we have a good way of knowing how long a replica has been part of the range? Because as @petermattis mentions, there might be side effects to allowing this for the most recent replica forever.

petermattis · 2017-08-24T11:45:26Z

That's not true in practice, at least not the leaseholder being made aware of the new commit index.

Right. The snapshot will have finished, but the leaseholder won't necessarily have updated Raft state.

I like this approach pretty well, but do we have a good way of knowing how long a replica has been part of the range?

Not right now. We'd have to add that facility. I suggest a field in Replica.mu that indicates the timestamp of when the highest replica ID was added with a zero value indicating "long ago". Adding something to roachpb.RangeDescriptor feels more involved and permanent which I'm uncomfortable with at this point in the cycle.

Fixes cockroachdb#17879

…ed replicas In cockroachdb#17879/cockroachdb#17930, a special case was added to `FilterUnremovableReplicas` to avoid a common interaction where a newly-added replica was immediately removed when a range followed a replica addition with a replica removal. This special case was later refined in cockroachdb#34126. This was important at the time, but it no longer necessary because replica rebalancing (replica addition + removal) is performed atomically through a joint configuration. As a result, we can get rid of the subtle logic surrounding this special-case.

a-robinson added this to the 1.1 milestone Aug 23, 2017

a-robinson self-assigned this Aug 23, 2017

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 25, 2017

storage: Consider recently-added replicas healthy when rebalancing

39ea368

Fixes cockroachdb#17879

a-robinson mentioned this issue Aug 25, 2017

storage: Consider recently-added replicas healthy when rebalancing #17930

Merged

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 25, 2017

storage: Consider recently-added replicas healthy when rebalancing

acb45de

Fixes cockroachdb#17879

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 27, 2017

storage: Consider recently-added replicas healthy when rebalancing

3a386a6

Fixes cockroachdb#17879

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 28, 2017

storage: Consider recently-added replicas healthy when rebalancing

54618ea

Fixes cockroachdb#17879

a-robinson closed this as completed in #17930 Aug 28, 2017

a-robinson mentioned this issue Aug 28, 2017

storage: Consider which replica will be removed when adding a replica to improve balance #17971

Closed

a-robinson mentioned this issue Sep 12, 2017

storage: crazy back-and-forth between two nodes #18456

Closed

nvanbenschoten mentioned this issue Jul 11, 2022

kv: remove special-case in FilterUnremovableReplicas for recently-added replicas #84160

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: replicate queue has tendency to remove just-added replicas #17879

storage: replicate queue has tendency to remove just-added replicas #17879

a-robinson commented Aug 23, 2017

bdarnell commented Aug 23, 2017

petermattis commented Aug 24, 2017

a-robinson commented Aug 24, 2017

petermattis commented Aug 24, 2017

storage: replicate queue has tendency to remove just-added replicas #17879

storage: replicate queue has tendency to remove just-added replicas #17879

Comments

a-robinson commented Aug 23, 2017

bdarnell commented Aug 23, 2017

petermattis commented Aug 24, 2017

a-robinson commented Aug 24, 2017

petermattis commented Aug 24, 2017