storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

a-robinson · 2018-10-09T15:25:34Z

This is already called out in a TODO in the code:

cockroach/pkg/storage/store_rebalancer.go

Lines 567 to 571 in 4746982

    
           // TODO(a-robinson): Support more incremental improvements -- move what we 
        
           // can if it makes things better even if it isn't great. For example, 
        
           // moving one of the other existing replicas that's on a store with less 
        
           // qps than the max threshold but above the mean would help in certain 
        
           // locality configurations.

But yesterday I saw a cluster get into a real situation where a greater willingness to move a replica that would make another store overfull would have been the ideal thing to do / what a human would have done. Currently the code is cautious about moving replicas with so many QPS that moving them over-fills whichever store they get moved to. That's a good decision when there's only one such replica on a store, but when there's two it's less good:

I181009 02:36:43.678116 996 storage/store_rebalancer.go:686  [n27,s28,store-rebalancer] r216919's 2456.46 qps would push s13 over the mean (168.26) with 2531.33 qps afterwards
I181009 02:36:43.678126 996 storage/store_rebalancer.go:597  [n27,s28,store-rebalancer] couldn't find enough rebalance targets for r216919 (2/3)
I181009 02:36:43.678723 996 storage/store_rebalancer.go:680  [n27,s28,store-rebalancer] r216912's 2338.62 qps would push s49 over the max threshold (268.26) with 2353.85 qps afterwards
I181009 02:36:43.678730 996 storage/store_rebalancer.go:597  [n27,s28,store-rebalancer] couldn't find enough rebalance targets for r216912 (2/3)
I181009 02:36:43.678738 996 storage/store_rebalancer.go:307  [n27,s28,store-rebalancer] ran out of replicas worth transferring and qps (4953.34) is still above desired threshold (268.26); will check again soon

In these logs, s28 has way more qps than the mean, mostly accounted for by two hot replicas (IIRC they were hot due to a bunch of retried errors, but that's a separate concern). However, moving either of them fails the check in shouldNotMoveTo that attempts to avoid ping-ponging a lone hot lease around between stores on an otherwise lightly loaded cluster:

cockroach/pkg/storage/store_rebalancer.go

Lines 653 to 666 in 4746982

    
           newCandidateQPS := storeDesc.Capacity.QueriesPerSecond + replWithStats.qps 
        
           if storeDesc.Capacity.QueriesPerSecond < minQPS { 
        
           	if newCandidateQPS > maxQPS { 
        
           		log.VEventf(ctx, 3, 
        
           			"r%d's %.2f qps would push s%d over the max threshold (%.2f) with %.2f qps afterwards", 
        
           			replWithStats.repl.RangeID, replWithStats.qps, candidateStore, maxQPS, newCandidateQPS) 
        
           		return true 
        
           	} 
        
           } else if newCandidateQPS > meanQPS { 
        
           	log.VEventf(ctx, 3, 
        
           		"r%d's %.2f qps would push s%d over the mean (%.2f) with %.2f qps afterwards", 
        
           		replWithStats.repl.RangeID, replWithStats.qps, candidateStore, meanQPS, newCandidateQPS) 
        
           	return true 
        
           }

In such cases it would be reasonable to make such a move if, for example, the source store would still have more QPS on it than the destination store.

It's worth noting that it's better for 2.1 to be overly cautious than for it to be overly aggressive, so I'm not planning on pushing for any such changes to be backported.

gz#8871

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2021-07-19T20:05:32Z

When investigating https://github.com/cockroachlabs/support/issues/1065, I was able to trace the load imbalance we've been seeing in a lot of these "high read amplification" issues to this problem. I'm re-opening this issue until the linked PR is merged.

The linked PR (#65379) will solve this issue as well.

This commit augments `TransferLeaseTarget()` by adding a mode that picks the best lease transfer target that would lead to QPS convergence across the stores that have a replica for a given range. Resolves cockroachdb#31135 Release note: None

This commit augments `TransferLeaseTarget()` by adding a mode that picks the best lease transfer target that would lead to QPS convergence across the stores that have a replica for a given range. This commit implements a strategy that predicates lease transfer decisions on whether they would serve to reduce the QPS delta between existing replicas' stores. Resolves cockroachdb#31135 Release note: None Release justification:

This commit augments `TransferLeaseTarget()` by adding a mode that picks the best lease transfer target that would lead to QPS convergence across the stores that have a replica for a given range. This commit implements a strategy that predicates lease transfer decisions on whether they would serve to reduce the QPS delta between existing replicas' stores. Resolves cockroachdb#31135 Release justification: Fixes high priority bug Release note (bug fix): Previously, the store rebalancer was unable to rebalance leases for hot ranges that received a disproportionate amount of traffic relative to the rest of the cluster. This often led to prolonged single node hotspots in certain workloads that led to hot ranges. This bug is now fixed.

65379: kvserver: actuate load-based replica rebalancing under heterogeneous localities r=aayushshah15 a=aayushshah15 This commit teaches the `StoreRebalancer` to make load-based rebalancing decisions that are meaningful within the context of the replication constraints placed on the ranges being relocated and the set of stores that can legally receive replicas for such ranges. Previously, the `StoreRebalancer` would compute the QPS underfull and overfull thresholds based on the overall average QPS being served by all stores in the cluster. Notably, this included stores that were in replication zones that would not satisfy required constraints for the range being considered for rebalancing. This meant that the store rebalancer would effectively never be able to rebalance ranges within the stores inside heavily loaded replication zones (since all the _valid_ stores would be above the overfull thresholds). This patch is a move away from the bespoke relocation logic in the `StoreRebalancer`. Instead, we have the `StoreRebalancer` rely on the rebalancing logic used by the `replicateQueue` that already has the machinery to compute load based signals for candidates _relative to other comparable stores_. The main difference here is that the `StoreRebalancer` uses this machinery to promote convergence of QPS across stores, whereas the `replicateQueue` uses it to promote convergence of range counts. A series of preceeding commits in this patchset generalize the existing replica rebalancing logic, and this commit teaches the `StoreRebalancer` to use it. This generalization also addresses another key limitation (see #62992) of the `StoreRebalancer` regarding its inability to make partial improvements to a range. Previously, if the `StoreRebalancer` couldn't move a range _entirely_ off of overfull stores, it would give up and not even move the subset of replicas it could. This is no longer the case. Resolves #61883 Resolves #62992 Resolves #31135 /cc @cockroachdb/kv Release justification: fixes a set of major limitations behind numerous support escalations Release note (performance improvement): QPS-based rebalancing is now aware of different constraints placed on different replication zones. This means that heterogeneously loaded replication zones (for instance, regions) will achieve a more even distribution of QPS within the stores inside each such zone. Co-authored-by: Aayush Shah <[email protected]>

72296: kvserver: rebalance ranges to minimize QPS delta among stores r=aayushshah15 a=aayushshah15 kvserver: rebalance ranges to minimize QPS delta among stores This commit fixes the regression(s) introduced by #65379 where we observed replica thrashing in various workloads (#70396 and #71244). The following is a description of the differences between the QPS based rebalancing scheme used in the previous implementation of the store rebalancer (release-21.2 and before) and the "new" implementation (22.1 and beyond). **lease rebalancing** ***release 21.2 and before*** QPS based lease rebalancing in CRDB 21.2 considers the overall cluster level average QPS and computes underfull and overfull thresholds based off of this average. For each range that the local store has a lease for, the store rebalancer goroutine checks whether transferring said range's lease away will bring the local store's QPS below the underfull threshold. If so, it ignores the range and moves on to the next one. Otherwise, it iterates through the stores of all the non-leaseholder voting replicas (in ascending order of their QPS) and checks whether it would be reasonable to transfer the lease away to such a store. It ensures that the receiving store would not become overfull after the lease transfer. It checks that the receiving store doesn't have a replica that's lagging behind the current leaseholder. It checks that the receiving store is not in violation of lease preferences. Finally, it ensures that the lease is not on the local store because of access locality considerations (i.e. because of follow-the-workload). All of this was bespoke logic that lived in the store rebalancer (using none of the Allocator's machinery). ***master and this commit*** In #65379, we moved this decision making into the Allocator by adding a new mode in `Allocator.TransferLeaseTarget` that tries to determine whether transferring the lease to another voting replica would reduce the qps delta between the hottest and the coldest stores in the replica set. This commit adds some padding to this logic by ensuring that the qps difference between the store relinquishing the lease and the store receiving the lease is at least 200qps. Furthermore, it ensures that the store receiving the lease won't become significantly hotter than the current leaseholder. **replica rebalancing** ***release 21.2 and before*** QPS replica rebalancing in CRDB <=21.2 works similarly to the lease rebalancing logic. We first compute a cluster level QPS average, overfull and underfull thresholds. Based on these thresholds we try to move replicas away from overfull stores and onto stores that are underfull, all while ensuring that the receiving stores would not become overfull after the rebalance. A critical assumption that the store rebalancer made (and still does, in the approach implemented by this commit) is that follower replicas serve the same traffic as the leaseholder. ***master and this commit*** The approach implemented by #65379 and refined by this commit tries to leverage machinery in the Allocator that makes rebalancing decisions that converge load based statistics per equivalence class. Previously, this machinery was only used for range count based replica rebalancing (performed by the `replicateQueue`) but not for qps-based rebalancing. This commit implements a similar approach to what we do now for lease rebalancing, which is to determine whether a rebalance action would reduce the qps delta between the hottest and the coldest store in the equivalence class. This commit adds some safeguards around this logic by ensuring that the store relinquishing the replica and the store receiving it differ by at least 200 qps. Furthermore, it ensures that the replica rebalance would not significantly switch the relative dispositions of the two stores. An important thing to note with the 21.2 implementation of the store rebalancer is that it was making all of its decisions based on cluster-level QPS averages. This behaves poorly in heterogenously sized / loaded clusters where some localities are designed to receive more traffic than others. In such clusters, heavily loaded localities can always be considered "overfull". This usually means that all stores in such localities would be above the "overfull" threshold in the cluster. The logic described above would effectively not do anything since there are no underfull stores to move replicas to. **Manual testing** This patch has been stress tested with the follower reads roachtests (~250 iterations of `follower-reads/survival=region/locality=global/reads=strong` and 100 iterations of `follower-reads/survival=zone/locality=regional/reads=exact-staleness`). It has also been stress tested with the `rebalance/by-load` roachtests (100 iterations for both `..leases` and `..replicas` tests). I also manually ran a TPCC 10K run with a small ramp (something we know triggers #31135) a few times and saw average QPS converge among stores fairly quickly. ![tpcc-with-low-ramp](https://user-images.githubusercontent.com/10788754/149742518-981825f4-6812-41c1-8320-519caafda9c1.png) Release note (performance improvement): A set of bugs that rendered QPS-based lease and replica rebalancing in CRDB 21.2 and prior ineffective under heterogenously loaded cluster localities has been fixed. Additionally a limitation which prevented CRDB from effectively alleviating extreme QPS hotspots from nodes has also been fixed. 75624: kv: compare MVCC GC threshold against Refresh{Range}Request.RefreshFrom r=nvanbenschoten a=nvanbenschoten Noticed by Sumeer in #74628. A Refresh request needs to observe all MVCC versions between its exclusive RefreshFrom time and its inclusive RefreshTo time. If it were to permit MVCC GC between these times then it could miss conflicts that should cause the refresh to fail. This could in turn lead to violations of serializability. For example: ``` txn1 reads value k1@10 txn2 deletes (tombstones) k1@15 mvcc gc @ 20 clears versions k1@10 and k1@15 txn1 refreshes @ 25, sees no value between (10, 25], refresh successful ``` In the example, the refresh erroneously succeeds because the request is permitted to evaluate after part of the MVCC history it needs to read has been GCed. By considering the RefreshFrom time to be the earliest active timestamp of the request, we avoid this hazard. Instead of being allowed to evaluate, the refresh request in the example would have hit a BatchTimestampBeforeGCError. Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

a-robinson added the A-coreperf label Oct 9, 2018

a-robinson self-assigned this Oct 9, 2018

This was referenced Oct 9, 2018

TPC-C 50k #30284

Closed

roachtest: rebalance-leases-by-load failed #31303

Closed

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. and removed A-coreperf labels Oct 12, 2018

a-robinson assigned nvanbenschoten and unassigned a-robinson Jan 11, 2019

nvanbenschoten mentioned this issue Mar 23, 2021

kvserver: StoreRebalancer is not constraints-aware when comparing stores‚Äô QPS #61883

Closed

aayushshah15 mentioned this issue Apr 1, 2021

kvserver: allow partial load-based rebalancing of ranges #62992

Closed

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

aayushshah15 mentioned this issue Jul 19, 2021

kvserver: actuate load-based replica rebalancing under heterogeneous localities #65379

Merged

aayushshah15 reopened this Jul 19, 2021

github-actions bot removed the X-stale label Jul 20, 2021

aayushshah15 assigned aayushshah15 and unassigned nvanbenschoten Jul 20, 2021

craig bot closed this as completed in d61f474 Sep 8, 2021

aayushshah15 mentioned this issue Sep 14, 2021

roachtest: follower-reads/survival=zone/locality=regional/reads=exact-staleness failed #69817

Closed

aayushshah15 mentioned this issue Jan 17, 2022

kvserver: rebalance ranges to minimize QPS delta among stores #72296

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

a-robinson commented Oct 9, 2018 •

edited by RoachietheSupportRoach

Loading

aayushshah15 commented Jul 19, 2021

storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

storage: Load-based rebalancing should be more willing to make certain incremental improvements #31135

Comments

a-robinson commented Oct 9, 2018 • edited by RoachietheSupportRoach Loading

aayushshah15 commented Jul 19, 2021

a-robinson commented Oct 9, 2018 •

edited by RoachietheSupportRoach

Loading