-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
5bc7bf1
commit 6914af1
Showing
1 changed file
with
136 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
- Feature Name: Leaseholder Rebalancing | ||
- Status: draft | ||
- Start Date: 2016-10-26 | ||
- Authors: Peter Mattis | ||
- RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262) | ||
- Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462) | ||
|
||
# Summary | ||
|
||
Periodically rebalance range leaseholders in order to distribute the | ||
per-leaseholder work. | ||
|
||
# Motivation | ||
|
||
The primary motivation of ensuring leaseholders are distributed | ||
throughout a cluster is to avoid scenarios in which a node is unable | ||
to rebalance replicas away because of the restriction that we refuse | ||
to rebalance a replica for which a node is the leaseholder. This | ||
restriction is present in order to prevent an availability hiccup on | ||
the range when the leaseholder is removed from it. | ||
|
||
It is interesting to note a problematic behavior of the current | ||
system. The current leaseholder will extend its lease as long as it is | ||
receiving operations for a range. And when a range is split, the lease | ||
for the left-hand side of the split is cloned and given to the | ||
right-hand side of the split. The combined effect is that a newly | ||
created cluster that has continuous load applied against it will see a | ||
single node slurp up all of the range leases which causes a severe | ||
replica imbalance (since we can't rebalance away from the leaseholder) | ||
as well as a performance bottleneck. We actually see increased | ||
performance by periodically killing nodes in the cluster. | ||
|
||
The second motivation is to more evenly distributed load in a | ||
cluster. The leaseholder for a range has extra duties when compared to | ||
a follower: it performs all reads for a range and proposes almost all | ||
writes. [Proposer evaluated KV](proposer_evaluated_kv.md) will reduce | ||
the cost of write KV operations on followers exacerbating the | ||
difference between leaseholders and followers. These extra duties | ||
impose additional load on the leaseholder making it desirable to | ||
spread that load throughout a cluster in order to improve performance. | ||
|
||
The last motivation is to place the leaseholder for a range near the | ||
gateway node that is accessing the range in order to minimize network | ||
RTT. As an obvious example: it is preferable for the leaseholder to be | ||
in the same datacenter as the gateway node. | ||
|
||
# Detailed design | ||
|
||
This RFC is intended to address the first two motivations and punt on | ||
the last one (load-based leaseholder placement). Note that addressing | ||
the problem of evenly distributing leaseholders across a cluster will | ||
also address the inability to rebalance a replica away from the | ||
leaseholder as we'll always have sufficient non-leaseholder replicas | ||
in order to perform rebalancing. | ||
|
||
Leaseholder rebalancing will be performed using a similar mechanism to | ||
replica rebalancing. The periodically gossiped `StoreCapacity` proto | ||
will be extended with a `leaseholder_count` field. Currently, | ||
`replicateQueue` has the following logic: | ||
|
||
1. If range has dead replicas, remove them. | ||
2. If range is under-replicated, add a replica. | ||
3. If range is over-replicated, remove a replica. | ||
4. If the range needs rebalancing, add a replica. | ||
|
||
The proposal is to add the following logic (after the above replica | ||
rebalancing logic): | ||
|
||
5. If the leaseholder is on an overfull store transfer the lease to | ||
the least loaded follower less than the mean. | ||
6. If the leaseholder store has a leaseholder count above the mean and | ||
one of the followers has an underfull leaseholder count transfer the | ||
lease to the least loaded follower. | ||
|
||
# Testing | ||
|
||
Individual rebalancing heuristics can be unit tested, but seeing how | ||
those heuristics interact with a real cluster can often reveal | ||
surprising behavior. We have an existing allocation simulation | ||
framework, but it has seen infrequent use. As an alternative, the | ||
`zerosum` tool has been useful in examining rebalancing heuristics. We | ||
propose to fork `zerosum` and create a new `allocsim` tool which will | ||
create a local N-node cluster with controls for generating load and | ||
using smaller range sizes to test various rebalancing scenarios. | ||
|
||
# Future Directions | ||
|
||
We eventually need to provide load-based leaseholder placement, both | ||
to place leaseholders close to gateway nodes and to more accurately | ||
balance load across a cluster. Balancing load by balancing replica | ||
counts or leaseholder counts does not capture differences in per-range | ||
activity. For example, one table might be significantly more active | ||
than others in the system making it desirable to distribute the ranges | ||
in that table more evenly. | ||
|
||
At a high-level, expected load on a node is proportional to the number | ||
of replicas/leaseholders on the node. But a more accurate description | ||
is that it is proportional to the number of bytes on the node. Rather | ||
than balancing replica/leaseholder counts we could balance based on | ||
the range size (i.e. the "used-bytes" metric). | ||
|
||
the second idea is to account for actual load on ranges. The simple | ||
approach to doing this is to maintain an exponentially decaying stat | ||
of operations per range and to multiply this metric by the range size | ||
giving us a range momentum metric. We then balance the range momentum | ||
metric across nodes. There are difficulties to making this work well | ||
with the primary one being that load (and thus momentum) can change | ||
rapidly and we want to avoid the system being overly sensitive to such | ||
changes. Transfering leaseholders is relatively inexpensive, but not | ||
free. Rebalancing a range is fairly heavyweight and can impose a | ||
systemic drain on system resources if done too frequently. | ||
|
||
Range momentum by itself does not aid in load-based leaseholder | ||
placement. For that we'll need to pass additional information in each | ||
`BatchRequest` indicating the locality of the originator of the | ||
request and to maintain per-range metrics about how much load a range | ||
is seeing from each locality. The rebalancer would then attempt to | ||
place leases such that they are spread within the localities they | ||
receiving load from, modulo they other placement constraints | ||
(i.e. diversity). | ||
|
||
# Drawbacks | ||
|
||
* The proposed leaseholder rebalancing mechanisms require a transfer | ||
lease operation. We have such an operation for use in testing but it | ||
isn't ready for use in production (yet). This should be rectified | ||
soon. | ||
|
||
# Alternatives | ||
|
||
* Rather than placing the leaseholder rebalancing burden on | ||
`replicateQueue`, we could perform rebalancing when leases are | ||
acquired/extended. This would work with the current expiration-based | ||
leases, but not with [epoch-based](range_leases.md) leases. | ||
|
||
# Unresolved questions |