storage: Consider which replica will be removed when adding a replica to improve balance #17971

a-robinson · 2017-08-28T19:54:43Z

In clusters with multiple localities, we try to maintain diversity by spreading the replicas for each range as evenly across localities as possible. This means that in a cluster with 3 localities and 3 replicas per range, we'll try to keep ` replica in each locality for each range.

When in a rebalancing state where one locality has 1 replicas and the other localities each have 1, we'll always remove a replica from the locality that has 2. This is good.

What isn't as good is that we don't consider this when deciding whether to rebalance in the first place. Our rebalancing logic will kick in if a possible new destination is a better fit than any of the existing replicas, even if the new destination is in a different locality and thus won't actually be a direct replacement for the existing replica that isn't a great fit.

I haven't seen this cause massive problems by itself, but in combination with another problem (I've seen it flare up with both #17879 and #17970) it can make for rebalance thrashing, where we repeatedly add and remove a replica on the same 1 or 2 nodes.

Not all situations will be as straightforward as the 3-locality example above, so we'll have to make sure the fix for this is somewhat more general than just making sure that a potential replica to add is in the same locality as the worst existing replica.

a6802739 · 2017-09-06T08:52:21Z

@a-robinson, here is my understanding, I don't know if it's right.
when we need to make a rebalance, we should first add a replica of the range and then remove a replica of a range.
So for the situation you described above, we need to make sure the replica we add is in the same locality with the replica we want to remove?

a-robinson · 2017-09-06T15:14:05Z

@a6802739 it won't always be the case that the replica we add is in the same locality as the one that we'll want to remove. For example, if a cluster has localities a, b, c, and d and a range is in localities a, b, and c, then adding a replica to locality d would be fine.

This will need to be a little more sophisticated. Actually running through the RemoveTarget logic before settling on a RebalanceTarget would be one way you could imagine doing this.

a6802739 · 2017-09-07T02:49:47Z

@a-robinson, So what we should do is make sure each replica in different localities？

And what do you mean Actually running through the RemoveTarget logic before settling on a RebalanceTarget? Sorry, I quite understand this.

a-robinson · 2017-09-07T04:17:06Z

So what we should do is make sure each replica in different localities？

Roughly, although we do already have code for prioritizing diversity of localities. The issue here is more that when we're considering whether to do a rebalance, we'll sometimes rebalance to some store x1 on the basis of it being better than store y, even though when it comes time to remove a replica after up-replicating beyond the desired number of replicas we'll end up comparing x1 and x2 (and potentially x3, x4, etc.) to decide which replica to remove. This can mean that we'll remove x1 immediately after adding it, because it may be worse than x2 even though it's much better than y.

And what do you mean Actually running through the RemoveTarget logic before settling on a RebalanceTarget? Sorry, I quite understand this.

When deciding whether to do a rebalance, we run (*Allocator).RebalanceTarget to try to find a store worth rebalancing to. If it finds one, we'll add a replica to that store. After adding a replica, the range will have more replicas than desired, so the replicate queue will then have to pick a replica to remove from the range. That logic is contained in (*Allocator).RemoveTarget. I was just saying that one way to make sure we won't repeatedly add and remove replicas from the same store would be to simulate running (*Allocator).RemoveTarget before we actually add the replica. If the decision it makes is to remove the replica that we're thinking of adding, then we could avoid adding it.

a6802739 · 2017-09-07T07:32:34Z

@a-robinson , Thank you very much.

As you said, we'll sometimes rebalance to some store x1 on the basis of it being better than store y

if we want to rebalance a replica from store y to store x1, we could just add a replica to store x1, and just remove the replica on store y. so the replicas for this range will not be up-replicated beyond the desired number of replicas, right?

This can mean that we'll remove x1 immediately after adding it, because it may be worse than x2 even though it's much better than y

Yeah, I just don't understand if it's better than y, why don't we just remove the replica from y directly?

I was just saying that one way to make sure we won't repeatedly add and remove replicas from the same store would be to simulate running (*Allocator).RemoveTarget before we actually add the replica. If the decision it makes is to remove the replica that we're thinking of adding, then we could avoid adding it.

Yeah, we could get a replica from (*Allocator).RemoveTarget. If we don't add the replica beforehand, that means we will never remove this replica, because this replica doesn't exists when we call (*Allocator).RemoveTarget.

Thank you for your kind explanation.

a-robinson · 2017-09-08T05:17:23Z

if we want to rebalance a replica from store y to store x1, we could just add a replica to store x1, and just remove the replica on store y. so the replicas for this range will not be up-replicated beyond the desired number of replicas, right?

That's true (although the add and remove can't be done atomically). In this hypothetical scenario, the Allocator is essentially making a bad choice (adding x1 because it's better for balance than y) that it doesn't realize is bad (due to diversity reasons) until it comes time to remove one of the replicas.

Yeah, I just don't understand if it's better than y, why don't we just remove the replica from y directly?

We grade stores in three different ways:

Whether they meet all ZoneConfig constraints
How good they are for diversity of Locality
How overfull or underfull they are relative to other nodes in the cluster

In this hypothetical scenario, x1 is better than y for (3), but worse for (2) due to the existence of x2.

a-robinson added this to the 1.2 milestone Aug 28, 2017

a-robinson self-assigned this Aug 28, 2017

a-robinson mentioned this issue Aug 28, 2017

storage: Delayed update of per-store write stats can cause rebalance thrashing #17970

Closed

cuongdo mentioned this issue Aug 28, 2017

storage: re-enable stats-based rebalancing #17979

Closed

6 tasks

a6802739 self-assigned this Sep 7, 2017

a6802739 mentioned this issue Sep 8, 2017

storage: Consider which replica will be removed when adding a replica… #18364

Merged

a-robinson mentioned this issue Oct 6, 2017

trouble adding a fourth node - will not become functional #19013

Closed

a6802739 closed this as completed in #18364 Oct 11, 2017

a-robinson mentioned this issue Dec 13, 2017

Unexpected behaviour in a split-brain scenario #20241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Consider which replica will be removed when adding a replica to improve balance #17971

storage: Consider which replica will be removed when adding a replica to improve balance #17971

a-robinson commented Aug 28, 2017

a6802739 commented Sep 6, 2017

a-robinson commented Sep 6, 2017

a6802739 commented Sep 7, 2017

a-robinson commented Sep 7, 2017

a6802739 commented Sep 7, 2017 •

edited

Loading

a-robinson commented Sep 8, 2017

storage: Consider which replica will be removed when adding a replica to improve balance #17971

storage: Consider which replica will be removed when adding a replica to improve balance #17971

Comments

a-robinson commented Aug 28, 2017

a6802739 commented Sep 6, 2017

a-robinson commented Sep 6, 2017

a6802739 commented Sep 7, 2017

a-robinson commented Sep 7, 2017

a6802739 commented Sep 7, 2017 • edited Loading

a-robinson commented Sep 8, 2017

a6802739 commented Sep 7, 2017 •

edited

Loading