Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
…db#67022 66891: admission: improve slot and grant chain heuristics r=sumeerbhola a=sumeerbhola The existing heuristic was ok with reducing the runnable goroutines while maintaining ~94% cpu utilization under the kv50 overload heuristic. However, when examining the behavior at 1ms granularity: - Running "perf sched record" and "perf sched map" there were occasional ~10ms intervals of time where ~6 of the 8 cpus become idle. - Logging in GrantCoordinator.CPULoad indicated that the immediate termination of a grant chain at every 1ms tick caused the granter to lose control over the runnable goroutines when there was limited KV work (KV work uses slots) but lots of SQL work (SQL work here is shorthand for KV=>SQL response work, which uses tokens). At every 1ms tick of CPULoad, the previous grant chain would be terminated and a new one started that would admit 64 SQL work units (procs * admission.kv_slot_adjuster.overload_threshold = 8 * 8). The kv50 overload workload has a concurrency of 8192 and within 100ms one can admit 6400 of this work even though much of it has not completed (since work that uses tokens does not have a termination signal). The runnable count (all numbers here are aggregate across all 8 cpus) would increase to > 4000, due to which the total slots for KV work would start getting decreased until it reached the minimum of 1 (since the decreases were not helping to reduce runnable). Eventually runnable would come down by itself since the SQL work items finished at which point the runnable would become 0 since the total slots were still at 1. The total slots would increase by 1 slot every 1ms until we built up enough runnable to use all cpus. There are 2 related changes made here: - There is a 100ms lag introduced for grant chain termination. A grant chain is terminated only when the oldest attempt to terminate it is > 100ms old. This means the throttling introduced by the grant chain mechanism actually functions since the same grant chain is active for long. - The default of admission.kv_slot_adjuster.overload_threshold is bumped up to 32 and a grant chain uses this value divided by 4 as a multiplier. This allows a grant chain to still burst with the same burst size as before but ensures that a single burst does not trigger the runnable count to be high enough such that total slots start getting decreased. This did not increase the mean CPU utilization (still ~94%) but there are other improvements based on examining at 1ms intervals. The runnable count rarely becomes < 10. Even when it does, the currently used KV slots is > 200, which suggests that we are running into the limits of what control we can exercise without changing the scheduler (the KV work is probably waiting on IO, which is not observable to the admission control system). Despite a 4x higher admission.kv_slot_adjuster.overload_threshold, which results in total KV slots ~400, the peak runnable ~800, when it used to be ~4000. Some screenshots with `kv50/enc=false/nodes=1/conc=8192` (admission control was turned on between 14:45-14:46) <img width="783" alt="Screen Shot 2021-06-25 at 10 51 30 AM" src="https://user-images.githubusercontent.com/54990988/123456792-40eaf600-d5b1-11eb-8771-5e08f54dd660.png"> <img width="756" alt="Screen Shot 2021-06-25 at 10 51 51 AM" src="https://user-images.githubusercontent.com/54990988/123459874-d9cf4080-d5b4-11eb-8668-6f5190c4f94c.png"> <img width="787" alt="Screen Shot 2021-06-25 at 10 52 06 AM" src="https://user-images.githubusercontent.com/54990988/123459898-dfc52180-d5b4-11eb-8ce2-379af4912bed.png"> <img width="764" alt="Screen Shot 2021-06-25 at 10 52 24 AM" src="https://user-images.githubusercontent.com/54990988/123459918-e6ec2f80-d5b4-11eb-9ba4-a228df1068d4.png"> Release note: None 66995: roachtest: bump version for 20.2 to 20.2.12 r=tbg a=ajwerner Part of the release process. Relates to cockroachdb#66627. Release note: None 67020: AUTHORS: add kpatron r=kpatron-cockroachlabs a=kpatron-cockroachlabs 67022: Update AUTHORS r=ZhouXing19 a=ZhouXing19 Added Jane's name and email Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Kyle Patron <[email protected]> Co-authored-by: Zhou Xing <[email protected]>
- Loading branch information