Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: allocator balance disrupted by splits #9435

Closed
petermattis opened this issue Sep 17, 2016 · 9 comments
Closed

storage: allocator balance disrupted by splits #9435

petermattis opened this issue Sep 17, 2016 · 9 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

The current allocator heuristics reach steady state when no node is >5% above or <5% below the average number of replicas in the cluster. But consider what happens when a range splits. For example, let's say we have a 10 node cluster containing 999 replicas (333 ranges). Our target for the number of replicas per node is [95, 105]. Now, let's say the per-node replica counts are:

95 95 95 95 95 104 105 105 105 105

If a range splits that is present on the fuller nodes we can transition to a state like:

95 95 95 95 95 104 105 106 106 106

The nodes with 106 replicas are now overfull per the heuristics and we'll have to rebalance off them. Thankfully there are 5 acceptable targets which means that we'll perform 3 concurrent rebalances on the cluster. I'm pretty sure I'm seeing exactly this scenario on delta right now.

Balancing purely on range count is a bit unfortunate in this regard. If we were balancing on storage there likely wouldn't be an issue since a split doesn't actually create more space.

Cc @cockroachdb/stability

@petermattis
Copy link
Collaborator Author

Rather than balancing on range count, perhaps we should balance on "live bytes". When a split occurs, live bytes does not change (or it changes minimally). Seems like a relatively straightforward change.

@spencerkimball
Copy link
Member

Good idea.

On Monday, September 19, 2016, Peter Mattis [email protected]
wrote:

Rather than balancing on range count, perhaps we should balance on "live
bytes". When a split occurs, live bytes does not change (or it changes
minimally). Seems like a relatively straightforward change.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#9435 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTQ5xqZjtzHfFQU9Yj4pJwdAHiRfnks5qrrD2gaJpZM4J_dzf
.

@cuongdo
Copy link
Contributor

cuongdo commented Sep 19, 2016

Seems like a good idea. There are some small details that'll have to addressed, such as the 5% threshold, which might not work great if 5% of the total live bytes is less than 64 MB.

My main concern is: are the live bytes calculations accurate enough for rebalancing decisions? Just a short while back, they were negative. Now, they're consistently positive, but I don't have a sense for how accurate they are.

@petermattis petermattis self-assigned this Sep 19, 2016
@petermattis
Copy link
Collaborator Author

The bugs in the live bytes calculation were fixed. Pretty sure we're good to go on that front. We also already keep an aggregate live bytes value for the store as a whole. Adding StoreCapacity.LiveBytes is straightforward.

Can you elaborate on what your concern is with the total live bytes being less than 64 MB? I can see a problem with rebalancing causing thrashing if we have some very different size ranges, though I think that can be alleviated by passing in the size of the range being considered for rebalancing to Allocator.RebalanceTarget.

@petermattis
Copy link
Collaborator Author

The zerosum tool actually presents an unrealistic challenge for balancing based on live-bytes because it splits at keys chosen using a zipf distribution making ranges fairly different in size. With normal size splitting the size of ranges will be much more uniform.

@bdarnell
Copy link
Contributor

We originally used a combination of available bytes and range count; we switched to rely solely on range count in #6133 because the available bytes metric is noisy and in a small cluster you wouldn't see ranges being evenly distributed (which is more of a perceptual issue than a real one; in a cluster this small it doesn't really matter how the ranges are distributed). Live bytes would be a bit more stable than available bytes.

I think it would be good to use metrics other than range count, but this also doesn't seem like much of a priority - it doesn't look like this is causing a large number of moves. We'll need to be careful when making this change because it has a lot of opportunities to introduce thrashing and other problems (will small ranges be preferentially passed around because they can fit under the 5% threshold, causing them to become less available?)

@petermattis petermattis removed their assignment Sep 20, 2016
@petermattis
Copy link
Collaborator Author

Agreed that this isn't a high priority.

@cuongdo
Copy link
Contributor

cuongdo commented Sep 20, 2016

Agreed this isn't high priority.

For small clusters, one of the issues that existed prior to #6133 was that
adding a 4th node would not cause any rebalances to occur. This meant
that the 4th node had no data at all and wasn't helping increase
availability or spread load. Just wanted to document this for posterity.

So, whenever we decide to do this change, we need to test with at least the
following scenarios:

  1. block_writer (evenly sized ranges)
  2. something that generates uneven ranges
  3. small clusters -- these are the first impressions someone forms of
    CockroachDB

On Tue, Sep 20, 2016 at 7:12 AM Peter Mattis [email protected]
wrote:

Agreed that this isn't a high priority.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#9435 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABffplp12dv8_hYhP19UuyvfTqHnC9Idks5qr7-IgaJpZM4J_dzf
.

@a-robinson
Copy link
Contributor

This is a nice piece of history. It can be closed now, though, since I can now run thousands of splits on a cluster without rebalancing kicking in (because logical bytes and writes-per-second are still balanced).

@a-robinson a-robinson self-assigned this Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants