storage: allocator balance disrupted by splits #9435

petermattis · 2016-09-17T00:58:32Z

The current allocator heuristics reach steady state when no node is >5% above or <5% below the average number of replicas in the cluster. But consider what happens when a range splits. For example, let's say we have a 10 node cluster containing 999 replicas (333 ranges). Our target for the number of replicas per node is [95, 105]. Now, let's say the per-node replica counts are:

95 95 95 95 95 104 105 105 105 105

If a range splits that is present on the fuller nodes we can transition to a state like:

95 95 95 95 95 104 105 106 106 106

The nodes with 106 replicas are now overfull per the heuristics and we'll have to rebalance off them. Thankfully there are 5 acceptable targets which means that we'll perform 3 concurrent rebalances on the cluster. I'm pretty sure I'm seeing exactly this scenario on delta right now.

Balancing purely on range count is a bit unfortunate in this regard. If we were balancing on storage there likely wouldn't be an issue since a split doesn't actually create more space.

Cc @cockroachdb/stability

The text was updated successfully, but these errors were encountered:

petermattis · 2016-09-19T15:57:28Z

Rather than balancing on range count, perhaps we should balance on "live bytes". When a split occurs, live bytes does not change (or it changes minimally). Seems like a relatively straightforward change.

spencerkimball · 2016-09-19T15:59:36Z

Good idea.

On Monday, September 19, 2016, Peter Mattis [email protected]
wrote:

Rather than balancing on range count, perhaps we should balance on "live
bytes". When a split occurs, live bytes does not change (or it changes
minimally). Seems like a relatively straightforward change.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#9435 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTQ5xqZjtzHfFQU9Yj4pJwdAHiRfnks5qrrD2gaJpZM4J_dzf
.

cuongdo · 2016-09-19T21:04:20Z

Seems like a good idea. There are some small details that'll have to addressed, such as the 5% threshold, which might not work great if 5% of the total live bytes is less than 64 MB.

My main concern is: are the live bytes calculations accurate enough for rebalancing decisions? Just a short while back, they were negative. Now, they're consistently positive, but I don't have a sense for how accurate they are.

petermattis · 2016-09-19T21:13:42Z

The bugs in the live bytes calculation were fixed. Pretty sure we're good to go on that front. We also already keep an aggregate live bytes value for the store as a whole. Adding StoreCapacity.LiveBytes is straightforward.

Can you elaborate on what your concern is with the total live bytes being less than 64 MB? I can see a problem with rebalancing causing thrashing if we have some very different size ranges, though I think that can be alleviated by passing in the size of the range being considered for rebalancing to Allocator.RebalanceTarget.

petermattis · 2016-09-19T21:20:51Z

The zerosum tool actually presents an unrealistic challenge for balancing based on live-bytes because it splits at keys chosen using a zipf distribution making ranges fairly different in size. With normal size splitting the size of ranges will be much more uniform.

bdarnell · 2016-09-20T04:04:36Z

We originally used a combination of available bytes and range count; we switched to rely solely on range count in #6133 because the available bytes metric is noisy and in a small cluster you wouldn't see ranges being evenly distributed (which is more of a perceptual issue than a real one; in a cluster this small it doesn't really matter how the ranges are distributed). Live bytes would be a bit more stable than available bytes.

I think it would be good to use metrics other than range count, but this also doesn't seem like much of a priority - it doesn't look like this is causing a large number of moves. We'll need to be careful when making this change because it has a lot of opportunities to introduce thrashing and other problems (will small ranges be preferentially passed around because they can fit under the 5% threshold, causing them to become less available?)

petermattis · 2016-09-20T11:12:00Z

Agreed that this isn't a high priority.

cuongdo · 2016-09-20T15:16:25Z

Agreed this isn't high priority.

For small clusters, one of the issues that existed prior to #6133 was that
adding a 4th node would not cause any rebalances to occur. This meant
that the 4th node had no data at all and wasn't helping increase
availability or spread load. Just wanted to document this for posterity.

So, whenever we decide to do this change, we need to test with at least the
following scenarios:

block_writer (evenly sized ranges)
something that generates uneven ranges
small clusters -- these are the first impressions someone forms of
CockroachDB

On Tue, Sep 20, 2016 at 7:12 AM Peter Mattis [email protected]
wrote:

Agreed that this isn't a high priority.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#9435 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffplp12dv8_hYhP19UuyvfTqHnC9Idks5qr7-IgaJpZM4J_dzf
.

a-robinson · 2017-08-23T14:56:18Z

This is a nice piece of history. It can be closed now, though, since I can now run thousands of splits on a cluster without rebalancing kicking in (because logical bytes and writes-per-second are still balanced).

petermattis self-assigned this Sep 19, 2016

petermattis removed their assignment Sep 20, 2016

petermattis mentioned this issue Oct 28, 2016

docs/RFCS: leaseholder rebalancing #10262

Merged

petermattis added this to the Later milestone Feb 22, 2017

a-robinson closed this as completed Aug 23, 2017

a-robinson self-assigned this Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: allocator balance disrupted by splits #9435

storage: allocator balance disrupted by splits #9435

petermattis commented Sep 17, 2016

petermattis commented Sep 19, 2016

spencerkimball commented Sep 19, 2016

cuongdo commented Sep 19, 2016

petermattis commented Sep 19, 2016

petermattis commented Sep 19, 2016

bdarnell commented Sep 20, 2016

petermattis commented Sep 20, 2016

cuongdo commented Sep 20, 2016

a-robinson commented Aug 23, 2017

storage: allocator balance disrupted by splits #9435

storage: allocator balance disrupted by splits #9435

Comments

petermattis commented Sep 17, 2016

petermattis commented Sep 19, 2016

spencerkimball commented Sep 19, 2016

cuongdo commented Sep 19, 2016

petermattis commented Sep 19, 2016

petermattis commented Sep 19, 2016

bdarnell commented Sep 20, 2016

petermattis commented Sep 20, 2016

cuongdo commented Sep 20, 2016

a-robinson commented Aug 23, 2017