Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
86638: admission,kvserver: introduce an elastic cpu limiter r=irfansharif a=irfansharif Today when admission control admits a request, it is able to run indefinitely consuming arbitrary CPU. For long-running (~1s of CPU work per request) "elastic" (not latency sensitive) work like backups, this can have detrimental effects on foreground latencies – once such work is admitted, it can take up available CPU cores until completion, which prevents foreground work from running. The scheme below aims to change this behavior; there are two components in play: - A token bucket that hands out slices of CPU time where the total amount handed out is determined by a 'target utilization' – the max % of CPU it's aiming to use (on a 8vCPU machine, if targeting 50% CPU, it can hand out .50 * 8 = 4 seconds of CPU time per second). - A feedback controller that adjusts the CPU % used by the token bucket periodically by measuring scheduling latency[1]. If over the limit (1ms at p99, chosen experimentally), the % is reduced; if under the limit and we're seeing substantial utilization, the % is increased. Elastic work acquires CPU tokens representing some predetermined slice of CPU time, blocking until these tokens become available. We found that 100ms of tokens works well enough experimentally. A larger value, say 250ms, would translate to less preemption and fewer RPCs. What's important is that it isn't "too much", like 2s of CPU time, since that would let a single request hog a core potentially for 2s and allow for a large build up of a runnable goroutines (serving foreground traffic) on that core, affecting scheduling/foreground latencies. The work preempts itself once the slice is used up (as a form of cooperative scheduling). Once preempted, the request returns to the caller with a resumption key. This scheme is effective in clamping down on scheduling latency that's due an excessive amount of elastic work. We have proof from direct trace captures and instrumentation that reducing scheduling latencies directly translates to reduced foreground latencies. They're primarily felt when straddling goroutines, typically around RPC boundaries (request/response handling goroutines); the effects multiplicative for statements that issue multiple requests. The controller uses fixed deltas for adjustments, adjusting down a bit more aggressively than adjusting up. This is due to the nature of the work being paced — we care more about quickly introducing a ceiling rather than staying near it (though experimentally we’re able to stay near it just fine). It adjusts upwards only when seeing a reasonably high % of utilization with the allotted CPU quota (assuming it’s under the p99 target). The adjustments are small to reduce {over,under}shoot and controller instability at the cost of being somewhat dampened. We use a smoothed form of the p99 latency captures to add stability to the controller input, which consequently affects the controller output. We use a relatively low frequency when sampling scheduler latencies; since the p99 is computed off of histogram data, we saw a lot more jaggedness when taking p99s off of a smaller set of scheduler events (every 50ms for ex.) compared to computing p99s over a larger set of scheduler events (every 2500ms). This, with the small deltas used for adjustments, can make for a dampened response, but assuming a stable-ish foreground CPU load against a node, it works fine. The controller output is limited to a well-defined range that can be tuned through cluster settings. --- Miscellaneous code details: To evaluate the overhead of checking against ElasticCPUHandle.OverLimit in a tight loop within MVCCExportToSST, we used the following. Underneath the hood the handle does simple estimation of per-iteration running time to avoid calling grunning.Time() frequently; not doing so caused a 5% slowdown in the same benchmark. $ dev bench pkg/storage \ --filter BenchmarkMVCCExportToSST/useElasticCPUHandle --count 10 \ --timeout 20m -v --stream-output --ignore-cache 2>&1 | tee bench.txt $ for flavor in useElasticCPUHandle=true useElasticCPUHandle=false do grep -E "${flavor}[^0-9]+" bench.txt | sed -E "s/${flavor}+/X/" > $flavor.txt done # goos: linux # goarch: amd64 # cpu: Intel(R) Xeon(R) CPU @ 2.20GHz $ benchstat useElasticCPUHandle\={false,true}.txt name old time/op new time/op delta MVCCExportToSST/X 2.54s ± 2% 2.53s ± 2% ~ (p=0.549 n=10+9) The tests for SchedulerLatencyListener show graphically how the elastic CPU controller behaves in response to various terms in the control loop (delta, multiplicative factor, smoothing constant, etc) -- see snippet below for an example. # With more lag (first half of the graph), we're more likely to # observe a large difference between the set-point we need to hit # and the utilization we currently have, making for larger # scheduling latency fluctuations (i.e. an ineffective controller). plot width=70 height=20 ---- ---- 1069 ┤ ╭╮ ╭╮╭╮ 1060 ┤ ││ ││││ 1052 ┤ ││ ││││╭╮ ╭╮ 1044 ┤ ││ ╭╮ ││││││ ││ 1035 ┤ ╭╮││ ││ ╭╮ ╭╮ ││││││ ││ 1027 ┤ │││╰╮ ││ ││ ││ ││││││ ╭╯│ ╭╮ ╭╮ ╭╮ 1019 ┤ │││ │ ││ ╭╮││ ││╭╮ ││││││ │ │ ││ ││ ││ 1010 ┤ │││ │ ││ │╰╯│ ││││ ││││││ │ │ ││ ╭──╮│╰─╮ ││ ╭╮╭─ 1002 ┼──────────────────────────────────────────────────────────────────── 993 ┤╰╮│││ │ │││││ ││││ │ │││││╰╮│ ╰╮╭╯││╰─╯╰╯╰╮╭╮ │ ╰─╯╰╯ ╰╯ 985 ┤ ││╰╯ │ │╰╯╰╯ ││││ │ │││││ ││ ╰╯ ╰╯ ╰╯╰─╯ 977 ┤ ││ │ │ ││││ │ │││││ ││ 968 ┤ ││ │╭╯ ││││ ╰╮ │││││ ││ 960 ┤ ││ ││ ││││ │ │││││ ╰╯ 951 ┤ ││ ││ ││││ │ │││││ 943 ┤ ││ ││ ││││ │╭╯││││ 935 ┤ ││ ││ ││╰╯ ││ ╰╯││ 926 ┤ ││ ││ ││ ││ ╰╯ 918 ┤ ╰╯ ╰╯ ││ ╰╯ 910 ┤ ││ 901 ┤ ╰╯ p99 scheduler latencies (μs) 21.7 ┤ ╭╮ 20.6 ┤ ╭───╮ ╭╯╰───╮ 19.5 ┼─────────╮ ╭──╮ ╭╮ ╭────╮ ╭────╮╭╮╭╮╭─╯ ╰─╯ ╰╮╭ 18.4 ┤ │ │ │╭─╯╰╮ ╭───╮╮ │╭─╮ ╰────│ ╰╯╰╯╰╯ ╰╯ 17.3 ┤ ╰──╭╯╮╭╰────╮╭─╯╯ ││╭╭╯ │ ╭╯ 16.2 ┤ │ ╰╯ ╰╯╭╯ ╰╮╭╯ │ │ 15.2 ┤ ╭╯ ╰╯ ╰╯ │ ╭╯ 14.1 ┤ │ │ │ 13.0 ┤ │ │ │ 11.9 ┤ ╭╯ │ ╭╯ 10.8 ┤ │ │ │ 9.7 ┤ │ │ ╭╯ 8.7 ┤ ╭╯ │ │ 7.6 ┤ │ │ │ 6.5 ┤ ╭╯ │ ╭╯ 5.4 ┤ │ │ │ 4.3 ┤ │ │╭╯ 3.2 ┤ ╭╯ ││ 2.2 ┤ │ ││ 1.1 ┤ │ ╰╯ 0.0 ┼───────╯ elastic cpu utilization and limit (%) [1]: Specifically the time between a goroutine being ready to run and when it's scheduled to do so by the Go scheduler. Release note: None Release justification: Non-production code 87561: backupccl: backup/restore with user-defined functions r=chengxiong-ruan a=chengxiong-ruan Fixes cockroachdb#84087 With user-defined functions introduced, backup/restore needs to work with the new function descriptors and schema descriptors containing function signatures. This commit adds logic into current backup/restore infrastructure to make sure function descriptors are properly backed up and restored. Release note (enterprise change): backup/restore now can backup and restore user-defined function descriptors at database and cluster level. Release justification: necessar fix to backup/restore to make sure it works with user-defined functions. 88010: bazel: add explanatory top-level comment to .bazelrc r=healthy-pod a=rickystewart People sometimes find themselves getting confused about the different `bazelrc` files. Hopefully this comment will help avoid this. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
- Loading branch information