-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/RFCS: leaseholder rebalancing #10262
docs/RFCS: leaseholder rebalancing #10262
Conversation
Cc @cockroachdb/stability |
e4b1ec1
to
3a5949c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good first step.
cluster. The leaseholder for a range has extra duties when compared to | ||
a follower: it performs all reads for a range and proposes almost all | ||
writes. [Proposer evaluated KV](propser_evaluated_kv.md) will remove | ||
more write work from followers exacerbating the difference between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/more write/read/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess technically it removes read work, though my intention was to note that writes become cheaper on followers (because they aren't doing reads necessary during KV processing on the proposer). I'll try to clarify the wording.
3a5949c
to
6914af1
Compare
the problem of evenly distributing leaseholders across a cluster will | ||
also address the inability to rebalance a replica away from the | ||
leaseholder as we'll always have sufficient non-leaseholder replicas | ||
in order to perform rebalancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite follow this statement/problem being solved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read this again and it made sense to me. Can you elaborate on what is confusing? Happy to discuss this in person next week too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the first motivation I understand what is meant here. I think it can be clarified by saying,
Note that addressing the second motivation of evenly distributing leaseholders across a cluster will also address the primary motivation of the inability to rebalance a replica away from the leaseholder as we'll always have sufficient non-leaseholder replicas in order to perform rebalancing.
Or perhaps naming these motivations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
the least loaded follower less than the mean. | ||
6. If the leaseholder store has a leaseholder count above the mean and | ||
one of the followers has an underfull leaseholder count transfer the | ||
lease to the least loaded follower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need both 5 and 6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the same logic that we use for rebalancing replicas. If you don't have both of these rules you can get into situations where we're not rebalancing to an underfull store because other stores are above the mean but not overfull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can benefit from a sentence about overfull and underfull or a reference to another RFC about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
At a high-level, expected load on a node is proportional to the number | ||
of replicas/leaseholders on the node. But a more accurate description | ||
is that it is proportional to the number of bytes on the node. Rather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we already do this to prevent a range split from triggering rebalancing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which part do you think we're already doing? Range splits can definitely trigger rebalancing for exactly the reason mentioned in this paragraph. See #9435 for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BramGruneir did you fix this and forget to close the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code balances on replica counts. There are no inflight PRs which change this AFAIK.
81539c0
to
e5e6b12
Compare
docs/RFCS/leaseholder_rebalancing.md, line 98 at r2 (raw file):
|
I've got a few questions on this:
As a simple alternative, why not just always transfer the lease after any up-replicate. This way, the lease transfer piggybacks on rebalancing directly. Since we already rebalance to less full clusters, this should allow clusters to rebalance away from overstuffed nodes. Also it would benefit from any load based rebalancing we're planning to do in the future. Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. Comments from Reviewable |
Reviewed 1 of 1 files at r2. Comments from Reviewable |
Reviewed 1 of 1 files at r3. Comments from Reviewable |
Yeah, I've been thinking about this too. The short answer is I'm not sure. What is proposed in this RFC is almost certainly going to be better than the current state of the world and doesn't preclude an additional mechanism to avoid a thundering herd. I'm less worried about a thundering herd style problem than leaseholder thrashing. I think the latter can be avoided by sending an RPC to the candidates before the lease transfer to find out their current lease counts, though I'm not sure this is necessary (perhaps the periodic gossip of lease counts will be sufficient).
This isn't entirely accurate. A range can require rebalancing but be unable to rebalance because the replica we want to rebalance is the leaseholder. In practice, the rebalancing replicas and leaseholder transfer will likely interleave (i.e. transfer a lease, then rebalance the replica).
I don't understand the first sentence, can you expand or rephrase it? Where are you proposing to add randomization?
We can't actually piggyback lease transfers on rebalancing as they are separate operations, but I don't think that's what you meant. Tying lease transfers to rebalancing is problematic because we need to move leases in some scenarios where no rebalancing is happening. Here are a few examples:
|
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. docs/RFCS/leaseholder_rebalancing.md, line 75 at r3 (raw file):
It'd be clearer if overfull / underfull were defined with their first use or in a paragraph that precedes their first use. docs/RFCS/leaseholder_rebalancing.md, line 106 at r3 (raw file):
It also becomes harder to reason about the interactions between the metrics used for rebalancing as we add more of them. While we should probably add load-based rebalancing at some point, care should be taken to understand and model the different scenarios when each metric will be the deciding factor for rebalancing. Comments from Reviewable |
The lease transfer request could be denied, similar to the snapshot. But I'm not sure if that's a good idea. I was thinking some more on this, since you can't add a lease to any store, just ones that already have a replica, there's an upper limit to the thundering herd. But you're right, thrashing might become rampant. Perhaps the opposite, making them more sticky. e.g. require X plus a random factor number of extensions before starting a transfer should eliminate that problem.
Ah, good point.
Exactly, let's add some randomization. Something simple. e.g. on the first extending of the lease, we add a 50% chance of transfer, than on the next a 75% chance, etc. Always increasing the chance that the lease transfer happens.
The third case here is the most interesting one, but also not what we are trying to address in this rfc. The others can be addressed by adding a mild amount of randomness to the lease extending which seems like a much simpler fix. As for the 3rd scenario, I do think that performing a lease transfer after every up-replicate makes some sense as well as we will most likely move data to where it is being accessed the most and pass the lease along. Clearly this won't work for completely stable clusters, and that's where adding some randomness to lease extending makes sense. Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
With epoch-based leases there won't be lease extensions for the vast majority of leases.
Before you were suggesting that we piggy-back lease transfers on replica transfers. Now you're proposing to randomly transfer leases during lease extensions. This latter idea does not mesh with epoch-based leases which are coming very soon. |
e5e6b12
to
28f2a42
Compare
Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions. docs/RFCS/leaseholder_rebalancing.md, line 75 at r3 (raw file):
|
@BramGruneir Btw, rebalancing during lease acquisition/extension is mentioned in the |
Yeah. So we will need some form of active lease transfer.
Let me clarify. The first is we should always perform a lease transfer immediately after a successful up-replicate. This will ensure that leases will always make their way to underfilled stores. The second is instead of looking at the state of the world, which is bound to be out of date, consider performing lease transfers randomly. If we were not moving to epoch-based leases, the most appropriate place would have been during the lease extending, so with epoch-based leases we should come up with a new mechanism. I agree that the replicate queue (or its own queue) is the best way to go. But I disagree with doing any smart rebalancing of the leases at this point. Randomization will be the easiest way to avoid thrashing and herding. Let's not look at overfull or underfull stores and just assign them randomly (with a decaying chance of them not moving as mentioned in my previous comment). There are two types of systems we're trying to solve for, the first is a stable system. 3 nodes, no rebalancing happening. In those cases, randomly transferring leases should work wonderfully. The second type of system is one that is large, ever changing and undergoing a small amount of continuous rebalancing. For those, the transfer leases post up-replicate make a lot of sense. But we might be already optimizing too much in that case. As I type this, I keep thinking that all we need it the random lease transferring. As for geographically distributed replicas, I think it's too ambitious to try to solve for that now. I think lease transfer isn't going to solve the problem and there may be other ways to speed up reads/write without going in that direction.
Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
The overfull/underfull heuristics seem to be working fine for replica rebalancing. I think it would be riskier to introduce a new heuristic for leaseholder rebalancing (the decaying randomization factor) than to adapt existing heuristics for a new purpose.
Perhaps in the short term, but we're almost certainly going to want more intelligence here fairly soon (e.g. for load-based and locality-based rebalancing).
I agree. That's why this is mentioned in the Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
Sure. But it takes a good amount of time to perform a snapshot. Lease transfers should be much quicker and may account for a lot of trashing. We might need to add something to slow this down. Just something to think about when observe in a real world scenario.
Makes sense. Reviewed 1 of 1 files at r4. Comments from Reviewable |
Agreed. I think I mentioned this elsewhere. Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. Comments from Reviewable |
Reviewed 1 of 1 files at r4. Comments from Reviewable |
Another signal that could be used in conjunction with the proposed heuristic is the time since the lease was last transferred. If we disallow frequent transfers we can prevent thrashing and enforce an upper bound on the rate of "hiccups". And this value can help us choose which lease to transfer from an overfull store. |
Reviewed 1 of 1 files at r4. docs/RFCS/leaseholder_rebalancing.md, line 18 at r1 (raw file):
s/for which a node is the leaseholder/which holds a range lease/ unless this phrasing is meaningful in the context of node leases in a way which I don't quite understand. Comments from Reviewable |
28f2a42
to
4fdf200
Compare
@RaduBerinde Added a note about that idea to the Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. docs/RFCS/leaseholder_rebalancing.md, line 18 at r1 (raw file):
|
That wouldn't work well with the current implementation of transfers. The old lease holder effectively destroys its lease at the start of the process, and doesn't get any direct feedback from the transferee. We'd need to redesign the transfer process to make it possible for the target to reject the transfer. Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/leaseholder_rebalancing.md, line 45 at r5 (raw file):
This is written as if there is only one gateway node for the range, which is untrue. The doc should acknowledge that there are likely to be multiple active gateways and we're trying to minimize the aggregate RTT. This goal is at odds with the previous two: in a geographically distributed deployment with load that follows the sun, concentrating all the leaseholders in a single datacenter is going to be best for latency but worst for throughput. This may be a desirable tradeoff for some applications. This isn't the "last motivation" for "ensuring leaseholders are distributed throughout a cluster", it's a motivation for doing something different, so this should be called out as a non-goal or alternative instead of being listed as a motivation for this RFC. docs/RFCS/leaseholder_rebalancing.md, line 51 at r5 (raw file):
I think this would read better with docs/RFCS/leaseholder_rebalancing.md, line 58 at r5 (raw file):
s/leaseholder_count/lease_count/g? docs/RFCS/leaseholder_rebalancing.md, line 60 at r5 (raw file):
A simple way to prevent thrashing would be to use a large value for docs/RFCS/leaseholder_rebalancing.md, line 75 at r5 (raw file):
The potential targets for a least transfer are very limited: just two (or four) replicas instead of all the stores in the cluster. Ranges could end up with all of their replicas on stores that are on the same side of the mean lease count. Should we consider lease count in replica placement too (i.e. prefer to place replicas on stores that are below the mean for both range count and lease count)? I think we're unlikely to get stuck in an equilibrium where we can't achieve balance, but considering the two factors independently and sequentially might lead to more thrashing before things stabilize. docs/RFCS/leaseholder_rebalancing.md, line 105 at r5 (raw file):
I don't think it's true that load is generally proportional to byte size. One of the reasons an admin might reduce the range split size for a table is because that table has more load relative to its size and needs to be spread across more stores. Counting bytes rather than ranges would have exactly the wrong result in this case. I think that until we have metrics for actual load, range count is the best approximation we have. docs/RFCS/leaseholder_rebalancing.md, line 144 at r5 (raw file):
Random transfers also have the disadvantage of causing minor availability disruptions. The system should be able to reach an equilibrium in which lease transfers are rare. Comments from Reviewable |
4fdf200
to
1d1e7be
Compare
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/leaseholder_rebalancing.md, line 45 at r5 (raw file):
|
Reviewed 1 of 1 files at r6. Comments from Reviewable |
This change is