-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: allocator balance disrupted by splits #9435
Comments
Rather than balancing on range count, perhaps we should balance on "live bytes". When a split occurs, live bytes does not change (or it changes minimally). Seems like a relatively straightforward change. |
Good idea. On Monday, September 19, 2016, Peter Mattis [email protected]
|
Seems like a good idea. There are some small details that'll have to addressed, such as the 5% threshold, which might not work great if 5% of the total live bytes is less than 64 MB. My main concern is: are the live bytes calculations accurate enough for rebalancing decisions? Just a short while back, they were negative. Now, they're consistently positive, but I don't have a sense for how accurate they are. |
The bugs in the live bytes calculation were fixed. Pretty sure we're good to go on that front. We also already keep an aggregate live bytes value for the store as a whole. Adding Can you elaborate on what your concern is with the total live bytes being less than 64 MB? I can see a problem with rebalancing causing thrashing if we have some very different size ranges, though I think that can be alleviated by passing in the size of the range being considered for rebalancing to |
The |
We originally used a combination of available bytes and range count; we switched to rely solely on range count in #6133 because the available bytes metric is noisy and in a small cluster you wouldn't see ranges being evenly distributed (which is more of a perceptual issue than a real one; in a cluster this small it doesn't really matter how the ranges are distributed). Live bytes would be a bit more stable than available bytes. I think it would be good to use metrics other than range count, but this also doesn't seem like much of a priority - it doesn't look like this is causing a large number of moves. We'll need to be careful when making this change because it has a lot of opportunities to introduce thrashing and other problems (will small ranges be preferentially passed around because they can fit under the 5% threshold, causing them to become less available?) |
Agreed that this isn't a high priority. |
Agreed this isn't high priority. For small clusters, one of the issues that existed prior to #6133 was that So, whenever we decide to do this change, we need to test with at least the
On Tue, Sep 20, 2016 at 7:12 AM Peter Mattis [email protected]
|
This is a nice piece of history. It can be closed now, though, since I can now run thousands of splits on a cluster without rebalancing kicking in (because logical bytes and writes-per-second are still balanced). |
The current allocator heuristics reach steady state when no node is >5% above or <5% below the average number of replicas in the cluster. But consider what happens when a range splits. For example, let's say we have a 10 node cluster containing 999 replicas (333 ranges). Our target for the number of replicas per node is
[95, 105]
. Now, let's say the per-node replica counts are:If a range splits that is present on the fuller nodes we can transition to a state like:
The nodes with 106 replicas are now overfull per the heuristics and we'll have to rebalance off them. Thankfully there are 5 acceptable targets which means that we'll perform 3 concurrent rebalances on the cluster. I'm pretty sure I'm seeing exactly this scenario on delta right now.
Balancing purely on range count is a bit unfortunate in this regard. If we were balancing on storage there likely wouldn't be an issue since a split doesn't actually create more space.
Cc @cockroachdb/stability
The text was updated successfully, but these errors were encountered: