docs/RFCS: leaseholder rebalancing #10262

petermattis · 2016-10-27T14:06:15Z

This change is

petermattis · 2016-10-27T14:06:33Z

Cc @cockroachdb/stability

spencerkimball

Seems like a good first step.

spencerkimball · 2016-10-28T17:20:40Z

docs/RFCS/leaseholder_rebalancing.md

+cluster. The leaseholder for a range has extra duties when compared to
+a follower: it performs all reads for a range and proposes almost all
+writes. [Proposer evaluated KV](propser_evaluated_kv.md) will remove
+more write work from followers exacerbating the difference between


s/more write/read/

Yeah, I guess technically it removes read work, though my intention was to note that writes become cheaper on followers (because they aren't doing reads necessary during KV processing on the proposer). I'll try to clarify the wording.

vivekmenezes · 2016-10-28T18:26:20Z

docs/RFCS/leaseholder_rebalancing.md

+the problem of evenly distributing leaseholders across a cluster will
+also address the inability to rebalance a replica away from the
+leaseholder as we'll always have sufficient non-leaseholder replicas
+in order to perform rebalancing.


I didn't quite follow this statement/problem being solved

I just read this again and it made sense to me. Can you elaborate on what is confusing? Happy to discuss this in person next week too.

After reading the first motivation I understand what is meant here. I think it can be clarified by saying,

Note that addressing the second motivation of evenly distributing leaseholders across a cluster will also address the primary motivation of the inability to rebalance a replica away from the leaseholder as we'll always have sufficient non-leaseholder replicas in order to perform rebalancing.

Or perhaps naming these motivations

vivekmenezes · 2016-10-28T18:32:28Z

docs/RFCS/leaseholder_rebalancing.md

+the least loaded follower less than the mean.
+6. If the leaseholder store has a leaseholder count above the mean and
+one of the followers has an underfull leaseholder count transfer the
+lease to the least loaded follower.


Why do you need both 5 and 6?

This is following the same logic that we use for rebalancing replicas. If you don't have both of these rules you can get into situations where we're not rebalancing to an underfull store because other stores are above the mean but not overfull.

It can benefit from a sentence about overfull and underfull or a reference to another RFC about them.

vivekmenezes · 2016-10-28T18:34:40Z

docs/RFCS/leaseholder_rebalancing.md

+
+At a high-level, expected load on a node is proportional to the number
+of replicas/leaseholders on the node. But a more accurate description
+is that it is proportional to the number of bytes on the node. Rather


I believe we already do this to prevent a range split from triggering rebalancing.

Which part do you think we're already doing? Range splits can definitely trigger rebalancing for exactly the reason mentioned in this paragraph. See #9435 for more details.

@BramGruneir did you fix this and forget to close the issue?

The current code balances on replica counts. There are no inflight PRs which change this AFAIK.

BramGruneir · 2016-10-29T06:45:27Z

docs/RFCS/leaseholder_rebalancing.md, line 98 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The current code balances on replica counts. There are no inflight PRs which change this AFAIK.

It was discussed, but decided that it wouldn't be implemented until after the code yellow and would obviously need to be tested extensively.

Comments from Reviewable

BramGruneir · 2016-10-30T06:18:01Z

I've got a few questions on this:

Since this is similar to rebalancing, a thundering herd of lease transfer might be possible. Will we require a similar reservation system to prevent it?
This mechanism prioritizes rebalancing over lease transfers, is this wise? Rebalancing operations are much slower, well should be, and this means that if a range requires rebalancing then it won't get a chance to transfer the lease. This could leave hot spots around longer.
Does holding the lease while operations are ongoing the best solution? Why not add some randomization to this that will eventually lead to a lease transfer?

As a simple alternative, why not just always transfer the lease after any up-replicate. This way, the lease transfer piggybacks on rebalancing directly. Since we already rebalance to less full clusters, this should allow clusters to rebalance away from overstuffed nodes. Also it would benefit from any load based rebalancing we're planning to do in the future.

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

BramGruneir · 2016-10-30T06:18:20Z

Reviewed 1 of 1 files at r2.
Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

BramGruneir · 2016-10-30T06:18:39Z

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2016-10-30T15:32:41Z

Since this is similar to rebalancing, a thundering herd of lease transfer might be possible. Will we require a similar reservation system to prevent it?

Yeah, I've been thinking about this too. The short answer is I'm not sure. What is proposed in this RFC is almost certainly going to be better than the current state of the world and doesn't preclude an additional mechanism to avoid a thundering herd. I'm less worried about a thundering herd style problem than leaseholder thrashing. I think the latter can be avoided by sending an RPC to the candidates before the lease transfer to find out their current lease counts, though I'm not sure this is necessary (perhaps the periodic gossip of lease counts will be sufficient).

This mechanism prioritizes rebalancing over lease transfers, is this wise? Rebalancing operations are much slower, well should be, and this means that if a range requires rebalancing then it won't get a chance to transfer the lease. This could leave hot spots around longer.

This isn't entirely accurate. A range can require rebalancing but be unable to rebalance because the replica we want to rebalance is the leaseholder. In practice, the rebalancing replicas and leaseholder transfer will likely interleave (i.e. transfer a lease, then rebalance the replica).

Does holding the lease while operations are ongoing the best solution? Why not add some randomization to this that will eventually lead to a lease transfer?

I don't understand the first sentence, can you expand or rephrase it? Where are you proposing to add randomization?

As a simple alternative, why not just always transfer the lease after any up-replicate. This way, the lease transfer piggybacks on rebalancing directly. Since we already rebalance to less full clusters, this should allow clusters to rebalance away from overstuffed nodes. Also it would benefit from any load based rebalancing we're planning to do in the future.

We can't actually piggyback lease transfers on rebalancing as they are separate operations, but I don't think that's what you meant. Tying lease transfers to rebalancing is problematic because we need to move leases in some scenarios where no rebalancing is happening. Here are a few examples:

In a 3-node cluster, no rebalancing will occur but we want to balance the leaseholders across the nodes.
In a multi-node cluster, if 1 node dies and stays dead for a short duration (e.g. 30s) all of the leases it held will be acquired by other nodes. When the node comes back up it will have done so before any significant rebalancing occurs so the cluster will still be in replica balance but not in leaseholder balance.
In a geographically distributed cluster, we will fairly soon want to move the leaseholder depending on changing load conditions. But such leaseholder movement might be independent of replica movement. For example, if the cluster is spread across 3 regions and the zone config requires a replica in each region, we want to move the leaseholder close to where the data is being accessed from, but are unable to move replicas. Where the data is being accessed might change during the course of the day. Note that this last scenario is explicitly not something I'm trying to tackle in the RFC, but I think it is worth keeping in mind.

cuongdo · 2016-10-31T14:39:42Z

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

docs/RFCS/leaseholder_rebalancing.md, line 75 at r3 (raw file):

lease to the least loaded follower.

Note that overfull/underfull indicates a store that has x%

It'd be clearer if overfull / underfull were defined with their first use or in a paragraph that precedes their first use.

docs/RFCS/leaseholder_rebalancing.md, line 106 at r3 (raw file):

the range size (i.e. the "used-bytes" metric).

the second idea is to account for actual load on ranges. The simple

It also becomes harder to reason about the interactions between the metrics used for rebalancing as we add more of them. While we should probably add load-based rebalancing at some point, care should be taken to understand and model the different scenarios when each metric will be the deciding factor for rebalancing.

Comments from Reviewable

BramGruneir · 2016-10-31T14:55:01Z

Since this is similar to rebalancing, a thundering herd of lease transfer might be possible. Will we require a similar reservation system to prevent it?
Yeah, I've been thinking about this too. The short answer is I'm not sure. What is proposed in this RFC is almost certainly going to be better than the current state of the world and doesn't preclude an additional mechanism to avoid a thundering herd. I'm less worried about a thundering herd style problem than leaseholder thrashing. I think the latter can be avoided by sending an RPC to the candidates before the lease transfer to find out their current lease counts, though I'm not sure this is necessary (perhaps the periodic gossip of lease counts will be sufficient).

The lease transfer request could be denied, similar to the snapshot. But I'm not sure if that's a good idea. I was thinking some more on this, since you can't add a lease to any store, just ones that already have a replica, there's an upper limit to the thundering herd. But you're right, thrashing might become rampant. Perhaps the opposite, making them more sticky. e.g. require X plus a random factor number of extensions before starting a transfer should eliminate that problem.

This mechanism prioritizes rebalancing over lease transfers, is this wise? Rebalancing operations are much slower, well should be, and this means that if a range requires rebalancing then it won't get a chance to transfer the lease. This could leave hot spots around longer.
This isn't entirely accurate. A range can require rebalancing but be unable to rebalance because the replica we want to rebalance is the leaseholder. In practice, the rebalancing replicas and leaseholder transfer will likely interleave (i.e. transfer a lease, then rebalance the replica).

Ah, good point.

Does holding the lease while operations are ongoing the best solution? Why not add some randomization to this that will eventually lead to a lease transfer?
I don't understand the first sentence, can you expand or rephrase it? Where are you proposing to add randomization?

Exactly, let's add some randomization. Something simple. e.g. on the first extending of the lease, we add a 50% chance of transfer, than on the next a 75% chance, etc. Always increasing the chance that the lease transfer happens.

As a simple alternative, why not just always transfer the lease after any up-replicate. This way, the lease transfer piggybacks on rebalancing directly. Since we already rebalance to less full clusters, this should allow clusters to rebalance away from overstuffed nodes. Also it would benefit from any load based rebalancing we're planning to do in the future.
We can't actually piggyback lease transfers on rebalancing as they are separate operations, but I don't think that's what you meant. Tying lease transfers to rebalancing is problematic because we need to move leases in some scenarios where no rebalancing is happening. Here are a few examples:
In a 3-node cluster, no rebalancing will occur but we want to balance the leaseholders across the nodes.
In a multi-node cluster, if 1 node dies and stays dead for a short duration (e.g. 30s) all of the leases it held will be acquired by other nodes. When the node comes back up it will have done so before any significant rebalancing occurs so the cluster will still be in replica balance but not in leaseholder balance.
In a geographically distributed cluster, we will fairly soon want to move the leaseholder depending on changing load conditions. But such leaseholder movement might be independent of replica movement. For example, if the cluster is spread across 3 regions and the zone config requires a replica in each region, we want to move the leaseholder close to where the data is being accessed from, but are unable to move replicas. Where the data is being accessed might change during the course of the day. Note that this last scenario is explicitly not something I'm trying to tackle in the RFC, but I think it is worth keeping in mind.

The third case here is the most interesting one, but also not what we are trying to address in this rfc. The others can be addressed by adding a mild amount of randomness to the lease extending which seems like a much simpler fix. As for the 3rd scenario, I do think that performing a lease transfer after every up-replicate makes some sense as well as we will most likely move data to where it is being accessed the most and pass the lease along. Clearly this won't work for completely stable clusters, and that's where adding some randomness to lease extending makes sense.

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2016-10-31T15:07:20Z

I don't understand the first sentence, can you expand or rephrase it? Where are you proposing to add randomization?

Exactly, let's add some randomization. Something simple. e.g. on the first extending of the lease, we add a 50% chance of transfer, than on the next a 75% chance, etc. Always increasing the chance that the lease transfer happens.

With epoch-based leases there won't be lease extensions for the vast majority of leases.

The others can be addressed by adding a mild amount of randomness to the lease extending which seems like a much simpler fix.

Before you were suggesting that we piggy-back lease transfers on replica transfers. Now you're proposing to randomly transfer leases during lease extensions. This latter idea does not mesh with epoch-based leases which are coming very soon.

petermattis · 2016-10-31T15:14:32Z

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.

docs/RFCS/leaseholder_rebalancing.md, line 75 at r3 (raw file):

Previously, cuongdo (Cuong Do) wrote…

It'd be clearer if overfull / underfull were defined with their first use or in a paragraph that precedes their first use.

Done.

docs/RFCS/leaseholder_rebalancing.md, line 106 at r3 (raw file):

Previously, cuongdo (Cuong Do) wrote…

It also becomes harder to reason about the interactions between the metrics used for rebalancing as we add more of them. While we should probably add load-based rebalancing at some point, care should be taken to understand and model the different scenarios when each metric will be the deciding factor for rebalancing.

Agreed.

Comments from Reviewable

petermattis · 2016-10-31T16:41:24Z

@BramGruneir Btw, rebalancing during lease acquisition/extension is mentioned in the Alternatives section.

BramGruneir · 2016-10-31T16:59:15Z

With epoch-based leases there won't be lease extensions for the vast majority of leases.

Yeah. So we will need some form of active lease transfer.

Before you were suggesting that we piggy-back lease transfers on replica transfers. Now you're proposing to randomly transfer leases during lease extensions. This latter idea does not mesh with epoch-based leases which are coming very soon.

Let me clarify.
I think we could attack this in two ways.

The first is we should always perform a lease transfer immediately after a successful up-replicate. This will ensure that leases will always make their way to underfilled stores.

The second is instead of looking at the state of the world, which is bound to be out of date, consider performing lease transfers randomly. If we were not moving to epoch-based leases, the most appropriate place would have been during the lease extending, so with epoch-based leases we should come up with a new mechanism. I agree that the replicate queue (or its own queue) is the best way to go. But I disagree with doing any smart rebalancing of the leases at this point. Randomization will be the easiest way to avoid thrashing and herding. Let's not look at overfull or underfull stores and just assign them randomly (with a decaying chance of them not moving as mentioned in my previous comment).

There are two types of systems we're trying to solve for, the first is a stable system. 3 nodes, no rebalancing happening. In those cases, randomly transferring leases should work wonderfully.

The second type of system is one that is large, ever changing and undergoing a small amount of continuous rebalancing. For those, the transfer leases post up-replicate make a lot of sense. But we might be already optimizing too much in that case.

As I type this, I keep thinking that all we need it the random lease transferring.

As for geographically distributed replicas, I think it's too ambitious to try to solve for that now. I think lease transfer isn't going to solve the problem and there may be other ways to speed up reads/write without going in that direction.

Btw, rebalancing during lease acquisition/extension is mentioned in the Alternatives section.
Yeah, it's an interesting idea. It might limit rebalancing a bit. Got to think more about this one.

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2016-10-31T17:14:37Z

I agree that the replicate queue (or its own queue) is the best way to go. But I disagree with doing any smart rebalancing of the leases at this point. Randomization will be the easiest way to avoid thrashing and herding. Let's not look at overfull or underfull stores and just assign them randomly (with a decaying chance of them not moving as mentioned in my previous comment).

The overfull/underfull heuristics seem to be working fine for replica rebalancing. I think it would be riskier to introduce a new heuristic for leaseholder rebalancing (the decaying randomization factor) than to adapt existing heuristics for a new purpose.

As I type this, I keep thinking that all we need it the random lease transferring.

Perhaps in the short term, but we're almost certainly going to want more intelligence here fairly soon (e.g. for load-based and locality-based rebalancing).

As for geographically distributed replicas, I think it's too ambitious to try to solve for that now.

I agree. That's why this is mentioned in the Future Directions section.

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

BramGruneir · 2016-10-31T17:20:08Z

The overfull/underfull heuristics seem to be working fine for replica rebalancing. I think it would be riskier to introduce a new heuristic for leaseholder rebalancing (the decaying randomization factor) than to adapt existing heuristics for a new purpose.

Sure. But it takes a good amount of time to perform a snapshot. Lease transfers should be much quicker and may account for a lot of trashing. We might need to add something to slow this down. Just something to think about when observe in a real world scenario.

Perhaps in the short term, but we're almost certainly going to want more intelligence here fairly soon (e.g. for load-based and locality-based rebalancing).

Makes sense.

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

petermattis · 2016-10-31T18:04:32Z

Lease transfers should be much quicker and may account for a lot of trashing. We might need to add something to slow this down.

Agreed. I think I mentioned this elsewhere.

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

tbg · 2016-10-31T18:11:06Z

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

RaduBerinde · 2016-10-31T18:32:46Z

Another signal that could be used in conjunction with the proposed heuristic is the time since the lease was last transferred. If we disallow frequent transfers we can prevent thrashing and enforce an upper bound on the rate of "hiccups". And this value can help us choose which lease to transfer from an overfull store.

tamird · 2016-10-31T20:08:19Z

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/leaseholder_rebalancing.md, line 18 at r1 (raw file):

throughout a cluster is to avoid scenarios in which a node is unable
to rebalance replicas away because of the restriction that we refuse
to rebalance a replica for which a node is the leaseholder. This

s/for which a node is the leaseholder/which holds a range lease/ unless this phrasing is meaningful in the context of node leases in a way which I don't quite understand.

Comments from Reviewable

petermattis · 2016-11-01T00:35:35Z

@RaduBerinde Added a note about that idea to the Alternatives section. It isn't really an alternative, but I didn't know where else to put it for now. If testing shows thrashing is a problem it is worth exploring.

Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/leaseholder_rebalancing.md, line 18 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

s/for which a node is the leaseholder/which holds a range lease/ unless this phrasing is meaningful in the context of node leases in a way which I don't quite understand.

No particular reason for that phrasing. Changed.

Comments from Reviewable

bdarnell · 2016-11-01T02:21:59Z

The lease transfer request could be denied, similar to the snapshot.

That wouldn't work well with the current implementation of transfers. The old lease holder effectively destroys its lease at the start of the process, and doesn't get any direct feedback from the transferee. We'd need to redesign the transfer process to make it possible for the target to reject the transfer.

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/RFCS/leaseholder_rebalancing.md, line 45 at r5 (raw file):

gateway node that is accessing the range in order to minimize network
RTT. As an obvious example: it is preferable for the leaseholder to be
in the same datacenter as the gateway node.

This is written as if there is only one gateway node for the range, which is untrue. The doc should acknowledge that there are likely to be multiple active gateways and we're trying to minimize the aggregate RTT.

This goal is at odds with the previous two: in a geographically distributed deployment with load that follows the sun, concentrating all the leaseholders in a single datacenter is going to be best for latency but worst for throughput. This may be a desirable tradeoff for some applications. This isn't the "last motivation" for "ensuring leaseholders are distributed throughout a cluster", it's a motivation for doing something different, so this should be called out as a non-goal or alternative instead of being listed as a motivation for this RFC.

docs/RFCS/leaseholder_rebalancing.md, line 51 at r5 (raw file):

This RFC is intended to address the first two motivations and punt on
the last one (load-based leaseholder placement). Note that addressing
the second motivation of evenly distributing leaseholders across a

I think this would read better with s/motivation/goal/g

docs/RFCS/leaseholder_rebalancing.md, line 58 at r5 (raw file):

Leaseholder rebalancing will be performed using a similar mechanism to
replica rebalancing. The periodically gossiped `StoreCapacity` proto
will be extended with a `leaseholder_count` field. We will also reuse

s/leaseholder_count/lease_count/g?

docs/RFCS/leaseholder_rebalancing.md, line 60 at r5 (raw file):

will be extended with a `leaseholder_count` field. We will also reuse
the overfull/underfull classification used for replica rebalancing
where overfull indicates a store that has x% more leases than the

A simple way to prevent thrashing would be to use a large value for x here. We can be pretty far from balance and still satisfy the primary motivation of this RFC.

docs/RFCS/leaseholder_rebalancing.md, line 75 at r5 (raw file):

5. If the leaseholder is on an overfull store transfer the lease to
the least loaded follower less than the mean.

The potential targets for a least transfer are very limited: just two (or four) replicas instead of all the stores in the cluster. Ranges could end up with all of their replicas on stores that are on the same side of the mean lease count. Should we consider lease count in replica placement too (i.e. prefer to place replicas on stores that are below the mean for both range count and lease count)? I think we're unlikely to get stuck in an equilibrium where we can't achieve balance, but considering the two factors independently and sequentially might lead to more thrashing before things stabilize.

docs/RFCS/leaseholder_rebalancing.md, line 105 at r5 (raw file):

is that it is proportional to the number of bytes on the node. Rather
than balancing replica/leaseholder counts we could balance based on
the range size (i.e. the "used-bytes" metric).

I don't think it's true that load is generally proportional to byte size. One of the reasons an admin might reduce the range split size for a table is because that table has more load relative to its size and needs to be spread across more stores. Counting bytes rather than ranges would have exactly the wrong result in this case. I think that until we have metrics for actual load, range count is the best approximation we have.

docs/RFCS/leaseholder_rebalancing.md, line 144 at r5 (raw file):

  mirrors the heuristics for replica rebalancing. For leaseholder
  rebalancing we could consider other heuristics. For example, we
  could periodically randomly transfer leases. We have some

Random transfers also have the disadvantage of causing minor availability disruptions. The system should be able to reach an equilibrium in which lease transfers are rare.

Comments from Reviewable

petermattis · 2016-11-03T16:26:00Z

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

docs/RFCS/leaseholder_rebalancing.md, line 45 at r5 (raw file):