docs/RFCS: leaseholder rebalancing

cockroachdb · Oct 28, 2016 · 6914af1 · 6914af1
1 parent 5bc7bf1
commit 6914af1
Showing 1 changed file with 136 additions and 0 deletions.
diff --git a/docs/RFCS/leaseholder_rebalancing.md b/docs/RFCS/leaseholder_rebalancing.md
@@ -0,0 +1,136 @@
+- Feature Name: Leaseholder Rebalancing
+- Status: draft
+- Start Date: 2016-10-26
+- Authors: Peter Mattis
+- RFC PR: [#10262](https://github.com/cockroachdb/cockroach/pull/10262)
+- Cockroach Issue: [#9462](https://github.com/cockroachdb/cockroach/issues/9462)
+
+# Summary
+
+Periodically rebalance range leaseholders in order to distribute the
+per-leaseholder work.
+
+# Motivation
+
+The primary motivation of ensuring leaseholders are distributed
+throughout a cluster is to avoid scenarios in which a node is unable
+to rebalance replicas away because of the restriction that we refuse
+to rebalance a replica for which a node is the leaseholder. This
+restriction is present in order to prevent an availability hiccup on
+the range when the leaseholder is removed from it.
+
+It is interesting to note a problematic behavior of the current
+system. The current leaseholder will extend its lease as long as it is
+receiving operations for a range. And when a range is split, the lease
+for the left-hand side of the split is cloned and given to the
+right-hand side of the split. The combined effect is that a newly
+created cluster that has continuous load applied against it will see a
+single node slurp up all of the range leases which causes a severe
+replica imbalance (since we can't rebalance away from the leaseholder)
+as well as a performance bottleneck. We actually see increased
+performance by periodically killing nodes in the cluster.
+
+The second motivation is to more evenly distributed load in a
+cluster. The leaseholder for a range has extra duties when compared to
+a follower: it performs all reads for a range and proposes almost all
+writes. [Proposer evaluated KV](proposer_evaluated_kv.md) will reduce
+the cost of write KV operations on followers exacerbating the
+difference between leaseholders and followers. These extra duties
+impose additional load on the leaseholder making it desirable to
+spread that load throughout a cluster in order to improve performance.
+
+The last motivation is to place the leaseholder for a range near the
+gateway node that is accessing the range in order to minimize network
+RTT. As an obvious example: it is preferable for the leaseholder to be
+in the same datacenter as the gateway node.
+
+# Detailed design
+
+This RFC is intended to address the first two motivations and punt on
+the last one (load-based leaseholder placement). Note that addressing
+the problem of evenly distributing leaseholders across a cluster will
+also address the inability to rebalance a replica away from the
+leaseholder as we'll always have sufficient non-leaseholder replicas
+in order to perform rebalancing.
+
+Leaseholder rebalancing will be performed using a similar mechanism to
+replica rebalancing. The periodically gossiped `StoreCapacity` proto
+will be extended with a `leaseholder_count` field. Currently,
+`replicateQueue` has the following logic:
+
+1. If range has dead replicas, remove them.
+2. If range is under-replicated, add a replica.
+3. If range is over-replicated, remove a replica.
+4. If the range needs rebalancing, add a replica.
+
+The proposal is to add the following logic (after the above replica
+rebalancing logic):
+
+5. If the leaseholder is on an overfull store transfer the lease to
+the least loaded follower less than the mean.
+6. If the leaseholder store has a leaseholder count above the mean and
+one of the followers has an underfull leaseholder count transfer the
+lease to the least loaded follower.
+
+# Testing
+
+Individual rebalancing heuristics can be unit tested, but seeing how
+those heuristics interact with a real cluster can often reveal
+surprising behavior. We have an existing allocation simulation
+framework, but it has seen infrequent use. As an alternative, the
+`zerosum` tool has been useful in examining rebalancing heuristics. We
+propose to fork `zerosum` and create a new `allocsim` tool which will
+create a local N-node cluster with controls for generating load and
+using smaller range sizes to test various rebalancing scenarios.
+
+# Future Directions
+
+We eventually need to provide load-based leaseholder placement, both
+to place leaseholders close to gateway nodes and to more accurately
+balance load across a cluster. Balancing load by balancing replica
+counts or leaseholder counts does not capture differences in per-range
+activity. For example, one table might be significantly more active
+than others in the system making it desirable to distribute the ranges
+in that table more evenly.
+
+At a high-level, expected load on a node is proportional to the number
+of replicas/leaseholders on the node. But a more accurate description
+is that it is proportional to the number of bytes on the node. Rather
+than balancing replica/leaseholder counts we could balance based on
+the range size (i.e. the "used-bytes" metric).
+
+the second idea is to account for actual load on ranges. The simple
+approach to doing this is to maintain an exponentially decaying stat
+of operations per range and to multiply this metric by the range size
+giving us a range momentum metric. We then balance the range momentum
+metric across nodes. There are difficulties to making this work well
+with the primary one being that load (and thus momentum) can change
+rapidly and we want to avoid the system being overly sensitive to such
+changes. Transfering leaseholders is relatively inexpensive, but not
+free. Rebalancing a range is fairly heavyweight and can impose a
+systemic drain on system resources if done too frequently.
+
+Range momentum by itself does not aid in load-based leaseholder
+placement. For that we'll need to pass additional information in each
+`BatchRequest` indicating the locality of the originator of the
+request and to maintain per-range metrics about how much load a range
+is seeing from each locality. The rebalancer would then attempt to
+place leases such that they are spread within the localities they
+receiving load from, modulo they other placement constraints
+(i.e. diversity).
+
+# Drawbacks
+
+* The proposed leaseholder rebalancing mechanisms require a transfer
+  lease operation. We have such an operation for use in testing but it
+  isn't ready for use in production (yet). This should be rectified
+  soon.
+
+# Alternatives
+
+* Rather than placing the leaseholder rebalancing burden on
+  `replicateQueue`, we could perform rebalancing when leases are
+  acquired/extended. This would work with the current expiration-based
+  leases, but not with [epoch-based](range_leases.md) leases.
+
+# Unresolved questions