Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: replicate queue has tendency to remove just-added replicas #17879

Closed
a-robinson opened this issue Aug 23, 2017 · 4 comments
Closed

storage: replicate queue has tendency to remove just-added replicas #17879

a-robinson opened this issue Aug 23, 2017 · 4 comments
Assignees
Milestone

Comments

@a-robinson
Copy link
Contributor

  1. After the replicate queue adds a new replica as part of a rebalance, it immediately re-queues the replica for processing so that it can choose to remove a replica.
  2. As part of choosing which replica to remove, we filter from consideration any replicas that are necessary for the range's quorum. This filtering compares the raft commit index from each replica to the quorum commit index, considering any replicas even a single commit behind not part of the quorum.

Both of these actions are reasonable, but they interact poorly in high-latency clusters or any cluster where one of the previously existing replicas was lagging behind. The problem is that in many cases, the newly added replica won't have finished receiving/applying its snapshot and catching up (or at least the leaseholder isn't yet aware that it has done so), and so if any of the other replicas is also behind (as can easily happen on indigo where the nodes are different distances apart) then those two behind replicas are the only ones we'll consider removing. The other two replicas are considered necessary for the quorum since they're the only two that are up-to-date.

This unfortunate behavior can easily lead to thrashing if the replica that the allocator wanted to rebalance away from is one of the two that can't be removed. This affects both stats-based rebalancing and the range-count form of rebalancing, although its effect is more severe for stats-based rebalancing. It happens very reliably whenever running indigo with no data other than the timeseries ranges.

We could fix it in a few different ways, but we might not want to so close to 1.1. If we don't fix it, we'll definitely have to disable stats-based rebalancing by default (#17645).

The first approaches to fixing that come to mind:

  1. Wait before re-queueing after doing a rebalance. We could just wait for a set amount of time on the order of seconds, or for a dynamic time estimated based on how long the snapshot should take, or we could poll the range state until the new replica gets caught up. These approaches would all be susceptible to the Scanner re-adding the replica to the replicate queue, circumventing the wait.
  2. Be less harsh in filterUnremovableReplicas. If a replica is less than N commits behind, don't rule it out. If we did this, then even if an existing replica is behind by a commit or two there will still be 3 valid replicas, meaning any replica can be behind even if the new replica hasn't caught up.

I like 2, but may be forgetting a reason why we can't loosen that up. I thought we used to allow a cushion here, but we clearly don't right now. @petermattis

@a-robinson a-robinson added this to the 1.1 milestone Aug 23, 2017
@a-robinson a-robinson self-assigned this Aug 23, 2017
@bdarnell
Copy link
Contributor

If the new replica is still applying its snapshot, the match index will be zero, so a check for "less than N commits behind" wouldn't always help.

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

@petermattis
Copy link
Collaborator

I'm not fond of option 1 as I feel it might have other side-effects like slowing down up-replication. Note that the description isn't quite accurate. After adding a new replica, we will definitely have finished sending and applying the preemptive snapshot before re-queuing the replica for rebalancing. The problem is that while sending/apply the snapshot the raft log will have grown so the newly added replica will be behind.

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

This is interesting, though I think we'd want a time limit on this. If the most recently added replica was added 2 days ago, it shouldn't be considered a necessary part of the current quorum. Adding a time limit here could make this workable. I think option 2 could also work, though I'm not sure what value of N to use.

@a-robinson
Copy link
Contributor Author

If the new replica is still applying its snapshot, the match index will be zero, so a check for "less than N commits behind" wouldn't always help.

Sorry, I didn't explain very clearly. If all 3 of the existing replicas are considered up-to-date, then filterUnremovableReplicas will be fine with removing any of them even if the fourth replica is behind, because no matter which replica is removed, there will still be at least 2 up-to-date after removing one. In cases like what's happening on indigo, considering the third existing replica up-to-date would fix things even given that the new replica is behind.

After adding a new replica, we will definitely have finished sending and applying the preemptive snapshot before re-queuing the replica for rebalancing. The problem is that while sending/apply the snapshot the raft log will have grown so the newly added replica will be behind.

That's not true in practice, at least not the leaseholder being made aware of the new commit index. If it's meant to be true, something is wrong. In all the cases I saw on indigo earlier, the newest replica's commit index was 0. There are dozens of log lines like this, where one of the existing replicas is one commit behind (presumably the ADD_REPLICA command) and one is at 0:

I170823 20:57:21.174696 184 storage/replicate_queue.go:371  [replicate,n3,s3,r34/3:/System/tsd/cr.store.{qu…-re…}] removing replica (n2,s2):20 due to over-replication: [8:2333, 2:2334, 3*:2334, 20:0]

We could also special-case the most recently-added replica (i.e. the one with the highest replica ID). If the new replica has a match index of zero, consider it a part of the current quorum instead of being expendable.

I like this approach pretty well, but do we have a good way of knowing how long a replica has been part of the range? Because as @petermattis mentions, there might be side effects to allowing this for the most recent replica forever.

@petermattis
Copy link
Collaborator

That's not true in practice, at least not the leaseholder being made aware of the new commit index.

Right. The snapshot will have finished, but the leaseholder won't necessarily have updated Raft state.

I like this approach pretty well, but do we have a good way of knowing how long a replica has been part of the range?

Not right now. We'd have to add that facility. I suggest a field in Replica.mu that indicates the timestamp of when the highest replica ID was added with a zero value indicating "long ago". Adding something to roachpb.RangeDescriptor feels more involved and permanent which I'm uncomfortable with at this point in the cycle.

a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 25, 2017
a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 25, 2017
a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 27, 2017
a-robinson added a commit to a-robinson/cockroach that referenced this issue Aug 28, 2017
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 11, 2022
…ed replicas

In cockroachdb#17879/cockroachdb#17930, a special case was added to `FilterUnremovableReplicas`
to avoid a common interaction where a newly-added replica was immediately
removed when a range followed a replica addition with a replica removal.
This special case was later refined in cockroachdb#34126.

This was important at the time, but it no longer necessary because replica
rebalancing (replica addition + removal) is performed atomically through a
joint configuration. As a result, we can get rid of the subtle logic
surrounding this special-case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants