Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards. By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`. This results in a substantial performance improvement on high-CPU nodes: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=4 7.71k ± 5% 7.93k ± 4% ~ (p=0.310 n=5+5) kv0/enc=false/nodes=3/cpu=8 15.6k ± 3% 14.8k ± 7% ~ (p=0.095 n=5+5) kv0/enc=false/nodes=3/cpu=32 43.4k ± 2% 45.0k ± 3% +3.73% (p=0.032 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 40.5k ± 2% 61.7k ± 1% +52.53% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 35.6k ± 4% 44.5k ± 0% +24.99% (p=0.008 n=5+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=4 21.2 ± 6% 20.6 ± 3% ~ (p=0.397 n=5+5) kv0/enc=false/nodes=3/cpu=8 10.5 ± 0% 11.2 ± 8% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=32 4.16 ± 1% 4.00 ± 5% ~ (p=0.143 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 4.70 ± 0% 4.10 ± 0% -12.77% (p=0.000 n=5+4) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=4 61.6 ± 5% 60.8 ± 3% ~ (p=0.762 n=5+5) kv0/enc=false/nodes=3/cpu=8 28.3 ± 4% 30.4 ± 0% +7.34% (p=0.016 n=5+4) kv0/enc=false/nodes=3/cpu=32 7.98 ± 2% 7.60 ± 0% -4.76% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 12.3 ± 2% 6.8 ± 0% -44.72% (p=0.000 n=5+4) kv0/enc=false/nodes=3/cpu=96/cloud=gce 10.2 ± 3% 8.9 ± 0% -12.75% (p=0.000 n=5+4) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=4 89.8 ± 7% 88.9 ± 6% ~ (p=0.921 n=5+5) kv0/enc=false/nodes=3/cpu=8 46.1 ± 0% 48.6 ± 5% +5.47% (p=0.048 n=5+5) kv0/enc=false/nodes=3/cpu=32 11.5 ± 0% 11.0 ± 0% -4.35% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 14.0 ± 3% 12.1 ± 0% -13.32% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 14.3 ± 3% 13.1 ± 0% -8.55% (p=0.000 n=4+5) ``` Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%. The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges: ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24 457ns ± 2% 564ns ± 2% +23.36% (p=0.001 n=7+7) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24 461ns ± 3% 563ns ± 2% +22.14% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24 459ns ± 2% 591ns ± 2% +28.63% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24 455ns ± 0% 776ns ± 5% +70.60% (p=0.001 n=6+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24 456ns ± 2% 1058ns ± 1% +132.13% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24 7.15ms ± 1% 8.18ms ± 1% +14.33% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24 7.13ms ± 1% 8.18ms ± 1% +14.77% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24 7.12ms ± 2% 7.86ms ± 1% +10.30% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24 7.20ms ± 1% 7.11ms ± 1% -1.27% (p=0.001 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24 7.12ms ± 2% 7.16ms ± 3% ~ (p=0.721 n=8+8) ``` Epic: none Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.
- Loading branch information