Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
75224: kvserver: replace circuit breaker cancel goroutine with per-Replica registry r=erikgrinaker a=tbg PR cockroachdb#71806 originally landed with a very simple solution for the problem of joining the breaker's cancellation chain with that of the incoming context: spawning a goroutine. This was never meant to be the solution in prod (spawning goroutines is not cheap) and this PR switches to a real solution (which may still have soom room for improvement). Replica circuit breakers are still off (cockroachdb#74705) and so this change should not change any behavior, unless breakers are turned on. The provided benchmarks show both the "end-to-end" `Replica.Send` as well as the micro-bench `breaker.{R,Un}egister` perf and contrast it with that of having the breakers disabled. This sets the stage for evaluating and, if necessary, minimizing the overhead, which (along with some additional end-to-end testing) then allows us to turn breakers on by default (cockroachdb#74705). `BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of an no-result single-point read. We see that we lose a few percent, though this benchmark is pretty micro already. `BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring the overhead of making a cancel, and calling `Register()` followed by `Unregister()`. The "old" measurement is essentially just creating a cancelable context (so the +XXXX% numbers don't mean a lot; we're doing work for which there was previously no analog). Here we can clearly see that sharding the mutex map can help, though note that this would already only be visible at the `Replica.Send` level at much higher concurrencies than in this test (16). I also ran the kv95 roachtest with what corresponds to the `mutexmap-1` strategy above (i.e. a single mutex); [raw data here]. In this configuration (relative to disabling breakers) `kv95/enc=false/nodes=1` and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps, median over five runs). Once we've chosen a suitable shard count, I expect that to drop a bit as well, and think we can turn on breakers in production, at least from a performance point of view. [raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e old = disabled, new = enabled: ``` ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.63µs ± 4% 1.77µs ± 5% +8.38% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.66µs ± 4% 1.75µs ± 5% +5.74% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.63µs ± 6% 1.75µs ± 4% +7.48% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.58µs ± 7% 1.73µs ± 4% +9.47% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.57µs ± 4% 1.62µs ± 6% +3.19% (p=0.046 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.56µs ± 5% 1.63µs ± 7% +4.20% (p=0.007 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.53µs ± 5% 1.62µs ± 8% +6.01% (p=0.001 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.55µs ± 7% 1.61µs ± 5% +3.50% (p=0.049 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.57µs ± 7% 1.62µs ± 6% +3.53% (p=0.042 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.53µs ± 7% 1.64µs ± 5% +6.84% (p=0.000 n=11+12) name old alloc/op new alloc/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 1.42kB ± 0% 1.44kB ± 0% +1.25% (p=0.000 n=7+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 1.42kB ± 0% 1.44kB ± 0% +1.19% (p=0.000 n=9+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 1.42kB ± 0% 1.43kB ± 0% +1.24% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 1.42kB ± 0% 1.43kB ± 0% +1.08% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 1.42kB ± 0% 1.43kB ± 0% +1.14% (p=0.000 n=10+10) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 1.42kB ± 0% 1.43kB ± 0% +1.12% (p=0.000 n=10+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 1.42kB ± 0% 1.43kB ± 0% +1.09% (p=0.000 n=11+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 1.41kB ± 0% 1.43kB ± 0% +1.13% (p=0.000 n=9+11) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 1.41kB ± 0% 1.43kB ± 0% +1.23% (p=0.000 n=9+12) name old allocs/op new allocs/op delta ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16 11.0 ± 0% 12.0 ± 0% +9.09% (p=0.000 n=11+12) ``` ``` ReplicaCircuitBreaker_Register/X/mutexmap-1-16 30.8ns ±21% 787.6ns ± 2% +2457.74% (p=0.000 n=11+11) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 31.6ns ± 3% 782.0ns ± 2% +2376.38% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 31.0ns ± 0% 778.8ns ± 2% +2409.61% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 31.0ns ± 0% 775.8ns ± 1% +2403.11% (p=0.000 n=10+10) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 31.0ns ± 1% 288.8ns ± 2% +833.06% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 31.1ns ± 1% 324.6ns ± 4% +945.05% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 31.1ns ± 0% 193.3ns ± 2% +522.26% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 31.0ns ± 0% 286.1ns ± 1% +822.80% (p=0.000 n=9+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 31.0ns ± 0% 194.4ns ± 2% +527.12% (p=0.000 n=10+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 31.0ns ± 1% 121.3ns ± 2% +291.01% (p=0.000 n=10+12) name old alloc/op new alloc/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 80.0B ± 0% 96.0B ± 0% +20.00% (p=0.000 n=11+12) name old allocs/op new allocs/op delta ReplicaCircuitBreaker_Register/X/mutexmap-1-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-2-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-4-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-8-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-12-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-16-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-20-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-24-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-32-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ReplicaCircuitBreaker_Register/X/mutexmap-64-16 2.00 ± 0% 3.00 ± 0% +50.00% (p=0.000 n=11+12) ``` Created via: ``` $ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt $ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true' ``` For the record, I also tried a `sync.Map`, but it was vastly inferior for this use case as it is allocation-heavy and its writes are fairly slow. Touches cockroachdb#33007. Fixes cockroachdb#74707. Release note: None Co-authored-by: Tobias Grieger <[email protected]>
- Loading branch information