kvserver: optimize voting replica placement in databases with region failure #59650
Labels
A-kv-replication-constraints
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
The zone config extensions introduced thus far achieve our broad goals of configuring a database to tolerate region or zone failure. We haven't yet introduced mechanisms to optimally decide the placement of the non-leaseholder voting replicas. Under region failure tolerance, all writes would incur cross-region replication latencies, since we cannot place a quorum of voting replicas under any single region. Thus, it matters where we place the non-leaseholder voting replicas, as that directly affects write latencies on the database.
The allocator currently lacks a way to place replicas based on their latencies to the leaseholder. We could consider adding a new zone config attribute which we tentatively call
survivability
and introducing a new latency-based heuristic to the allocator. Thesurvivability
attribute will initially only be allowed to take one value: "region", but could be extended to work with all types of locality tiers in the future as needed. With thesurvivability
attribute set, the new latency-based heuristic is intended to have the effect of packing all the voting replicas in 3 regions, 1 being the primary region (with the leaseholder) and the 2 next closest regions.The desire to "pack" voters in the regions closest to the leaseholder also notably makes the total number of replicas a dynamic value based on the physical state of the cluster. To see this, consider a 7 region cluster (call the regions A, B, C...G), and
num_voters
= 5. Say the primary region of the table is region A, and that the next two closest regions are B and C (in that order). Ideally, we'd want a 2-2-1 placement configuration (in regions A, B, C) for the 5 voting replicas. If voting replicas were to be placed in this manner, we'd like the 4 other regions to have a non-voting replica each. Thus, the total number of replicas would be 9 (5 voting and 4 non-voting). However, this requires that region A and B have at least 2 nodes each. What if those regions do not have enough nodes to make that possible? For instance, if each of the 7 regions in the cluster only had 1 node each, then the only possible configuration for the voting replicas would be 1-1-1-1-1 (1 voter in each of the 4 regions closest to the primary region A). Under this configuration, we would only need a total of 7 replicas (5 voting and 2 non-voting) in order to meet our goal of providing low latency follower reads from all regions. Thus, in addition to the above extensions, we plan on investigating the viability of letting thenum_replicas
field in the zone configs be set to some value likeauto
, and then letting the allocator dynamically figure out the number of replicas that are needed for fulfilling the specified constraints (specifically, requirement of having at least one replica per region for the sake of low latency follower reads from everywhere).Jira issue: CRDB-3261
The text was updated successfully, but these errors were encountered: