diff --git a/docs/RFCS/leaseholder_rebalancing.md b/docs/RFCS/leaseholder_rebalancing.md new file mode 100644 index 000000000000..c2cdebd6ede0 --- /dev/null +++ b/docs/RFCS/leaseholder_rebalancing.md @@ -0,0 +1,136 @@ +- Feature Name: Leaseholder Rebalancing +- Status: draft +- Start Date: 2016-10-26 +- Authors: Peter Mattis +- RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262) +- Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462) + +# Summary + +Periodically rebalance range leaseholders in order to distribute the +per-leaseholder work. + +# Motivation + +The primary motivation of ensuring leaseholders are distributed +throughout a cluster is to avoid scenarios in which a node is unable +to rebalance replicas away because of the restriction that we refuse +to rebalance a replica for which a node is the leaseholder. This +restriction is present in order to prevent an availability hiccup on +the range when the leaseholder is removed from it. + +It is interesting to note a problematic behavior of the current +system. The current leaseholder will extend its lease as long as it is +receiving operations for a range. And when a range is split, the lease +for the left-hand side of the split is cloned and given to the +right-hand side of the split. The combined effect is that a newly +created cluster that has continuous load applied against it will see a +single node slurp up all of the range leases which causes a severe +replica imbalance (since we can't rebalance away from the leaseholder) +as well as a performance bottleneck. We actually see increased +performance by periodically killing nodes in the cluster. + +The second motivation is to more evenly distributed load in a +cluster. The leaseholder for a range has extra duties when compared to +a follower: it performs all reads for a range and proposes almost all +writes. [Proposer evaluated KV](proposer_evaluated_kv.md) will reduce +the cost of write KV operations on followers exacerbating the +difference between leaseholders and followers. These extra duties +impose additional load on the leaseholder making it desirable to +spread that load throughout a cluster in order to improve performance. + +The last motivation is to place the leaseholder for a range near the +gateway node that is accessing the range in order to minimize network +RTT. As an obvious example: it is preferable for the leaseholder to be +in the same datacenter as the gateway node. + +# Detailed design + +This RFC is intended to address the first two motivations and punt on +the last one (load-based leaseholder placement). Note that addressing +the problem of evenly distributing leaseholders across a cluster will +also address the inability to rebalance a replica away from the +leaseholder as we'll always have sufficient non-leaseholder replicas +in order to perform rebalancing. + +Leaseholder rebalancing will be performed using a similar mechanism to +replica rebalancing. The periodically gossiped `StoreCapacity` proto +will be extended with a `leaseholder_count` field. Currently, +`replicateQueue` has the following logic: + +1. If range has dead replicas, remove them. +2. If range is under-replicated, add a replica. +3. If range is over-replicated, remove a replica. +4. If the range needs rebalancing, add a replica. + +The proposal is to add the following logic (after the above replica +rebalancing logic): + +5. If the leaseholder is on an overfull store transfer the lease to +the least loaded follower less than the mean. +6. If the leaseholder store has a leaseholder count above the mean and +one of the followers has an underfull leaseholder count transfer the +lease to the least loaded follower. + +# Testing + +Individual rebalancing heuristics can be unit tested, but seeing how +those heuristics interact with a real cluster can often reveal +surprising behavior. We have an existing allocation simulation +framework, but it has seen infrequent use. As an alternative, the +`zerosum` tool has been useful in examining rebalancing heuristics. We +propose to fork `zerosum` and create a new `allocsim` tool which will +create a local N-node cluster with controls for generating load and +using smaller range sizes to test various rebalancing scenarios. + +# Future Directions + +We eventually need to provide load-based leaseholder placement, both +to place leaseholders close to gateway nodes and to more accurately +balance load across a cluster. Balancing load by balancing replica +counts or leaseholder counts does not capture differences in per-range +activity. For example, one table might be significantly more active +than others in the system making it desirable to distribute the ranges +in that table more evenly. + +At a high-level, expected load on a node is proportional to the number +of replicas/leaseholders on the node. But a more accurate description +is that it is proportional to the number of bytes on the node. Rather +than balancing replica/leaseholder counts we could balance based on +the range size (i.e. the "used-bytes" metric). + +the second idea is to account for actual load on ranges. The simple +approach to doing this is to maintain an exponentially decaying stat +of operations per range and to multiply this metric by the range size +giving us a range momentum metric. We then balance the range momentum +metric across nodes. There are difficulties to making this work well +with the primary one being that load (and thus momentum) can change +rapidly and we want to avoid the system being overly sensitive to such +changes. Transfering leaseholders is relatively inexpensive, but not +free. Rebalancing a range is fairly heavyweight and can impose a +systemic drain on system resources if done too frequently. + +Range momentum by itself does not aid in load-based leaseholder +placement. For that we'll need to pass additional information in each +`BatchRequest` indicating the locality of the originator of the +request and to maintain per-range metrics about how much load a range +is seeing from each locality. The rebalancer would then attempt to +place leases such that they are spread within the localities they +receiving load from, modulo they other placement constraints +(i.e. diversity). + +# Drawbacks + +* The proposed leaseholder rebalancing mechanisms require a transfer + lease operation. We have such an operation for use in testing but it + isn't ready for use in production (yet). This should be rectified + soon. + +# Alternatives + +* Rather than placing the leaseholder rebalancing burden on + `replicateQueue`, we could perform rebalancing when leases are + acquired/extended. This would work with the current expiration-based + leases, but not with [epoch-based](range_leases.md) leases. + +# Unresolved questions