-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: replicate queue has tendency to remove just-added replicas #17879
Comments
If the new replica is still applying its snapshot, the match index will be zero, so a check for "less than N commits behind" wouldn't always help. We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable. |
I'm not fond of option 1 as I feel it might have other side-effects like slowing down up-replication. Note that the description isn't quite accurate. After adding a new replica, we will definitely have finished sending and applying the preemptive snapshot before re-queuing the replica for rebalancing. The problem is that while sending/apply the snapshot the raft log will have grown so the newly added replica will be behind.
This is interesting, though I think we'd want a time limit on this. If the most recently added replica was added 2 days ago, it shouldn't be considered a necessary part of the current quorum. Adding a time limit here could make this workable. I think option 2 could also work, though I'm not sure what value of N to use. |
Sorry, I didn't explain very clearly. If all 3 of the existing replicas are considered up-to-date, then
That's not true in practice, at least not the leaseholder being made aware of the new commit index. If it's meant to be true, something is wrong. In all the cases I saw on indigo earlier, the newest replica's commit index was 0. There are dozens of log lines like this, where one of the existing replicas is one commit behind (presumably the
I like this approach pretty well, but do we have a good way of knowing how long a replica has been part of the range? Because as @petermattis mentions, there might be side effects to allowing this for the most recent replica forever. |
Right. The snapshot will have finished, but the leaseholder won't necessarily have updated Raft state.
Not right now. We'd have to add that facility. I suggest a field in |
…ed replicas In cockroachdb#17879/cockroachdb#17930, a special case was added to `FilterUnremovableReplicas` to avoid a common interaction where a newly-added replica was immediately removed when a range followed a replica addition with a replica removal. This special case was later refined in cockroachdb#34126. This was important at the time, but it no longer necessary because replica rebalancing (replica addition + removal) is performed atomically through a joint configuration. As a result, we can get rid of the subtle logic surrounding this special-case.
Both of these actions are reasonable, but they interact poorly in high-latency clusters or any cluster where one of the previously existing replicas was lagging behind. The problem is that in many cases, the newly added replica won't have finished receiving/applying its snapshot and catching up (or at least the leaseholder isn't yet aware that it has done so), and so if any of the other replicas is also behind (as can easily happen on indigo where the nodes are different distances apart) then those two behind replicas are the only ones we'll consider removing. The other two replicas are considered necessary for the quorum since they're the only two that are up-to-date.
This unfortunate behavior can easily lead to thrashing if the replica that the allocator wanted to rebalance away from is one of the two that can't be removed. This affects both stats-based rebalancing and the range-count form of rebalancing, although its effect is more severe for stats-based rebalancing. It happens very reliably whenever running indigo with no data other than the timeseries ranges.
We could fix it in a few different ways, but we might not want to so close to 1.1. If we don't fix it, we'll definitely have to disable stats-based rebalancing by default (#17645).
The first approaches to fixing that come to mind:
filterUnremovableReplicas
. If a replica is less than N commits behind, don't rule it out. If we did this, then even if an existing replica is behind by a commit or two there will still be 3 valid replicas, meaning any replica can be behind even if the new replica hasn't caught up.I like 2, but may be forgetting a reason why we can't loosen that up. I thought we used to allow a cushion here, but we clearly don't right now. @petermattis
The text was updated successfully, but these errors were encountered: