kvserver: store cpu rebalancing #95380

kvoli · 2023-01-17T19:32:18Z

Is your feature request related to a problem? Please describe.
Balancing CPU usage of a store's replicas rather than QPS has been shown to provide improvement cluster performance.

This issue is to add CPU based balancing in the allocator, as a replacement for QPS.

Describe the solution you'd like

Using the sum of replica's CPU on a store, which is added in #92858, instrument cpu balancing using the same policy structure as QPS.

Add an additional kv.allocator.load_based_rebalancing_dimension that supports CPU.
Add an additional CPU field to StoreCapacity that is used when comparing a store to the cluster for balance.
Instrument the thresholds, minimums and logic to use CPU rather than QPS when selected.
Update the storepool UpdateLocalStoreAfterXXX to include CPU.

Describe alternatives you've considered
One alternative explored was to use the runtime CPU rather than balancing the sum of replica CPU. This is closer to the value we actually care about. However, it is not as "closed" of an objective and requires assumptions regarding the impact of actions taken since we are not able to fully attribute the runtime CPU to replicas.

Additional considerations

One additional consideration is mixed version clusters. Some stores on the prior version will not be populating their new CPU field in store capacity. This change should be version gated to only activate on v23.1 or later.

Jira issue: CRDB-23493

The text was updated successfully, but these errors were encountered:

Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and `RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently. Informs cockroachdb#95380 Release note: None

This patch instruments the store rebalancer using store cpu time as opposed to QPS when balancing the cluster. This patch adds `store_cpu` as an option with the existing, now public cluster setting: `kv.allocator.load_based_rebalancing_dimension` When set to `store_cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. Similar to QPS, the rebalance threshold can be set to allow controlling the aggressiveness of balancing: `kv.allocator.store_cpu_rebalance_threshold`: 0.1 resolves: cockroachdb#95380 Release note (ops change): Add option to balance store cpu time instead of queries per second (qps) by setting `kv.allocator.load_based_rebalancing_dimension='store_cpu'`. `kv.allocator.store_cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean.

Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and `RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently. Informs cockroachdb#95380 Release note: None

This patch instruments the store rebalancer using store cpu time as opposed to QPS when balancing the cluster. This patch adds `store_cpu` as an option with the existing, now public cluster setting: `kv.allocator.load_based_rebalancing_dimension` When set to `store_cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. Similar to QPS, the rebalance threshold can be set to allow controlling the aggressiveness of balancing: `kv.allocator.store_cpu_rebalance_threshold`: 0.1 resolves: cockroachdb#95380 Release note (ops change): Add option to balance store cpu time instead of queries per second (qps) by setting `kv.allocator.load_based_rebalancing_dimension='store_cpu'`. `kv.allocator.store_cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean.

…95564 #95583 #95606 90222: server: add api for decommission pre-flight checks r=AlexTalks a=AlexTalks While we have an API for checking the status of an in-progress decommission, we did not previously have an API to execute sanity checks prior to requesting a node to move into the `DECOMMISSIONING` state. This adds an API to do just that, intended to be called by the CLI prior to issuing a subsequent `Decommission` RPC request. Fixes #91568. Release note: None 91458: roachtest/mixed-versions-compat: use corpus for master r=fqazi a=fqazi Informs: #91350 The CI build scripts take advantage of the branch name to uplaod the corpus to GCS. Unfortunately, we have no way of know if the current build is master inside the roachtest. To address this, this patch supports fetching the master corpus as a fallback. Release note: None 92826: multiregionccl: add a cold start latency test r=ajwerner a=ajwerner This commit adds a test which creates an MR serverless cluster and then boots the sql pods in each region while disallowing connectivity to other regions. It also simulates latency to make sure the routing logic works and to provide a somewhat realistic picture of what to expect. Epic: CRDB-18596 Release note: None 93758: server: evaluate decommission pre-checks r=kvoli a=AlexTalks This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on #94024. Part of #91568 95007: admission: CPU slot adjustment and utilization metrics r=irfansharif a=sumeerbhola Our existing metrics are gauges (total and used slots) which don't give us insight into what is happening at smaller time scales. This creates uncertainty when we observe admission queueing but the gauge samples show total slots consistenly greater than used slots. Additionally, if total slots is steady during queuing, it doesn't tell us whether that was because of roughly matching increments or decrements, or no increments/decrements. The following metrics are added: - admission.granter.slots_exhausted_duration.kv: cumulative duration when the slots were exhausted. This can give insight into how much exhaustion was occurring. It is insufficient to tell us whether 0.5sec/sec of exhaustion is due to a long 500ms of exhaustion and then non-exhaustion or alternating 1ms of exhaustion and non-exhaustion. But this is an improvement over what we have. - admission.granter.slot_adjuster_{increments,decrements}.kv: Counts the increments and decrements of the total slots. - admission.granter.cpu_load_{short,long}_period_duration.kv: cumulative duration of short and long ticks, as indicated by the period in the CPULoad callback. We don't expect long period ticks when admission control is active (and we explicitly disable enforcement during long period ticks), but it helps us eliminate some hypothesis during incidents (e.g. long period ticks alternating with short period ticks causing a slow down in how fast we increment slots). Additionally, the sum of the rate of these two, if significantly < 1, would indicate that CPULoad frequency is lower than expected, say due to CPU overload. Fixes #92673 Epic: none Release note: None 95145: sql/stats: include partial stats in results of statsCache.GetTableStats r=rytaft a=michae2 We were not including partial stats in the list of table statistics returned by `statsCache.GetTableStats`. This was fine for the optimizer, which currently cannot use partial stats directly, but it was a problem for backup. We'd like to use partial stats directly in the optimizer eventually, so this commit goes ahead and adds them to the results of `GetTableStats`. The optimizer then must filter them out. To streamline this we add some helper functions. Finally, in an act of overzealous refactoring, this commit also changes `MergedStatistics` and `ForecastTableStatistics` to accept partial statistics and full statistics mixed together in the same input list. This simplifies the code that calls these functions. Fixes: #95056 Part of: #93983 Epic: CRDB-19449 Release note: None 95387: kvserver: separate loadstats cpu nanos to raft/req r=andrewbaptist a=kvoli Previously, loadstats tracked replica raft/request cpu nanos per second separately but returned both summed together in `load.ReplicaLoadStats`. This patch separates `RaftCPUNanosPerSecond` and `RequestCPUNanosPerSecond` in the returned `load.ReplicaLoadStats` so that they may be used independently. Informs #95380 Release note: None 95557: sql: Fix testing_optimizer_disable_rule_probability usage with vtables r=cucaroach a=cucaroach If a vtable scan query tries to use the dummy "0" column the exec builder errors out, this typically won't happen thanks to prune columns normalization rules and those rules are marked as "essential" but the logic allowing those rules to be applied was flawed. Epic: CRDB-20535 Informs: #94890 Release note: None 95559: rpc: fix comment r=andreimatei a=andreimatei This copy-pasta comment was mentioning a KV node, which was not right. Release note: None Epic: None 95564: cpustopwatch: s/grunning.Difference/grunning.Elapsed r=irfansharif a=irfansharif `grunning.Elapsed()` is the API to use when measuring the running time spent doing some piece of work, with measurements from the start and end. This only exists due to `grunning.Time()`'s non-monotonicity, a bug in our runtime patch: #95529. The bug results in slight {over,under}-estimation of the running time (the latter breaking monotonicity), but is livable with our current uses of this library, including the one here in cpustopwatch. `grunning.Elapsed()` papers over this bug by 0-ing out when `grunning.Time()`stamps regress. This is unlike `grunning.Difference()` which would return the absolute value of the regression -- not what we want here. Release note: None 95583: sql: fix cluster setting propagation flake take 2 r=cucaroach a=cucaroach Previously we tried to fix this with one retry but that was insufficient. Extend it to all queries in this section of the test. Release note: None Epic: CRDB-20535 95606: backupccl: deflake TestScheduleChainingLifecycle r=adityamaru a=msbutler This patch will skip the test if the machine clock is close to midnight, and increases the frequency of incremental backups to run every 2 minutes. Previously, the backup schedule in this test used the following crontab recurrence: '*/5 * * * *'. In english, this means "run a full backup now, and then run a full backup every day at midnight, and an incremental every 5 minutes. This test also relies on an incremental backup running after the first full backup. But, what happens if the first full backup gets scheduled to run within 5 minutes of midnight? A second full backup may get scheduled before the expected incremental backup, breaking the invariant the test expects. Fixes #88575 #95186 Release note: None Co-authored-by: Alex Sarkesian <[email protected]> Co-authored-by: Faizan Qazi <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Michael Butler <[email protected]>

This patch allows the store rebalancer to use store cpu time as opposed to QPS when balancing the cluster. This patch adds `store_cpu` as an option with the existing, now public cluster setting: `kv.allocator.load_based_rebalancing_dimension` When set to `store_cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. Similar to QPS, the rebalance threshold can be set to allow controlling the aggressiveness of balancing: `kv.allocator.store_cpu_rebalance_threshold`: 0.1 resolves: cockroachdb#95380 Release note (ops change): Add option to balance store cpu time instead of queries per second (qps) by setting `kv.allocator.load_based_rebalancing_dimension='store_cpu'`. `kv.allocator.store_cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean.

This patch allows the store rebalancer to use CPU in place of QPS when balancing load on a cluster. This patch adds `cpu` as an option with the cluster setting: `kv.allocator.load_based_rebalancing.objective` When set to `cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. The default remains as `qps` here. Similar to QPS, the rebalance threshold can be set to allow controlling the range above and below the mean store CPU is considered imbalanced, either overfull or underfull respectively: `kv.allocator.cpu_rebalance_threshold`: 0.1 In order to manage with mixed versions during upgrade and some architectures not supporting the cpu sampling method, a rebalance objective manager is introduced in `rebalance_objective.go`. The manager mediates access to the rebalance objective and overwrites it in cases where the objective set in the cluster setting cannot be supported. resolves: cockroachdb#95380 Release note (ops change) Add option to balance cpu time (cpu) instead of queries per second (qps) among stores in a cluster. This is done by setting `kv.allocator.load_based_rebalancing.objective='cpu'`. `kv.allocator.cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean.

This patch allows the store rebalancer to use CPU in place of QPS when balancing load on a cluster. This patch adds `cpu` as an option with the cluster setting: `kv.allocator.load_based_rebalancing.objective` When set to `cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however, it will target balancing the sum of all replica's cpu time on each store, rather than qps. The default remains as `qps` here. Similar to QPS, the rebalance threshold can be set to allow controlling the range above and below the mean store CPU is considered imbalanced, either overfull or underfull respectively: `kv.allocator.cpu_rebalance_threshold`: 0.1 In order to manage with mixed versions during upgrade and some architectures not supporting the cpu sampling method, a rebalance objective manager is introduced in `rebalance_objective.go`. The manager mediates access to the rebalance objective and overwrites it in cases where the objective set in the cluster setting cannot be supported. resolves: cockroachdb#95380 Release note (ops change): Add option to balance cpu time (cpu) instead of queries per second (qps) among stores in a cluster. This is done by setting `kv.allocator.load_based_rebalancing.objective='cpu'`. `kv.allocator.cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean.

96031: sql: add mixed version test for system.role_members user ids upgrade r=rafiss a=andyyang890 This patch adds a mixed version logictest that ensures that GRANT ROLE continues to work properly in a cluster with both 22.2 and 23.1 nodes (i.e. nodes that have run the system.role_members user ids upgrade). Part of #92342 Release note: None 96127: kvserver: introduce cpu rebalancing r=nvanbenschoten a=kvoli This patch allows the store rebalancer to use CPU in place of QPS when balancing load on a cluster. This patch adds `cpu` as an option with the cluster setting: `kv.allocator.load_based_rebalancing.objective` When set to `cpu`, rather than `qps`. The store rebalancer will perform a mostly identical function, however target balancing the sum of all replica's cpu time on each store, rather than qps. The default remains as `qps` here. Similar to QPS, the rebalance threshold can be set to allow controlling the range above and below the mean store CPU is considered imbalanced, either overfull or underfull respectively: `kv.allocator.cpu_rebalance_threshold`: 0.1 In order to manage with mixed versions during upgrade and some architectures not supporting the cpu sampling method, a rebalance objective manager is introduced in `rebalance_objective.go`. The manager mediates access to the rebalance objective and overwrites it in cases where the objective set in the cluster setting cannot be supported. The results when using CPU in comparison to QPS can be found [here](https://docs.google.com/document/d/1QLhD20BTamjj3-dSG9F1gW7XMBy9miGPpJpmu2Dn3yo/edit#) (internal). <details> <summary>Results Summary</summary> ![image](https://user-images.githubusercontent.com/39606633/215580650-b12ff509-5cf5-4ffa-880d-8387e2ef0afa.png) ![image](https://user-images.githubusercontent.com/39606633/215580626-3d748ba1-e9a4-4abb-8acd-2c319203932e.png) ![image](https://user-images.githubusercontent.com/39606633/215580585-58e6000d-b6cf-430a-b4b7-d14a77eab3bd.png) </details> <details> <summary>Detailed Allocbench Results</summary> ``` kv/r=0/access=skew master median cost(gb):05.81 cpu(%):14.97 write(%):37.83 stddev cost(gb):01.87 cpu(%):03.98 write(%):07.01 cpu rebalancing median cost(gb):08.76 cpu(%):14.42 write(%):36.61 stddev cost(gb):02.66 cpu(%):01.85 write(%):04.80 kv/r=0/ops=skew master median cost(gb):06.23 cpu(%):26.05 write(%):57.33 stddev cost(gb):02.92 cpu(%):05.83 write(%):08.20 cpu rebalancing median cost(gb):04.28 cpu(%):11.45 write(%):31.28 stddev cost(gb):02.25 cpu(%):02.51 write(%):06.68 kv/r=50/ops=skew master median cost(gb):04.36 cpu(%):22.84 write(%):48.09 stddev cost(gb):01.12 cpu(%):02.71 write(%):05.51 cpu rebalancing median cost(gb):04.64 cpu(%):13.49 write(%):43.05 stddev cost(gb):01.07 cpu(%):01.26 write(%):08.58 kv/r=95/access=skew master median cost(gb):00.00 cpu(%):09.51 write(%):01.24 stddev cost(gb):00.00 cpu(%):01.74 write(%):00.27 cpu rebalancing median cost(gb):00.00 cpu(%):05.66 write(%):01.31 stddev cost(gb):00.00 cpu(%):01.56 write(%):00.26 kv/r=95/ops=skew master median cost(gb):0.00 cpu(%):47.29 write(%):00.93 stddev cost(gb):0.09 cpu(%):04.30 write(%):00.17 cpu rebalancing median cost(gb):0.00 cpu(%):08.16 write(%):01.30 stddev cost(gb):0.01 cpu(%):04.59 write(%):00.20 ``` </details> resolves: #95380 Release note (ops change) Add option to balance cpu time (cpu) instead of queries per second (qps) among stores in a cluster. This is done by setting `kv.allocator.load_based_rebalancing.objective='cpu'`. `kv.allocator.cpu_rebalance_threshold` is also added, similar to `kv.allocator.qps_rebalance_threshold` to control the target range for store cpu above and below the cluster mean. 96440: ui: add execution insights to statement and transaction fingerprint details r=ericharmeling a=ericharmeling This commit adds execution insights to the Statement Fingerprint and Transaction Fingerprint Details pages. Part of #83780. Loom: https://www.loom.com/share/98d2023b672e43fa8016829aa641a829 Note that the SQL queries against the `*_execution_insights` tables are updated to `SELECT DISTINCT ON (*_fingerprint_id, problems, causes)` (equivalent to `GROUP BY (*_fingerprint_id, problems, causes)`) from the latest results in the tables, rather than `row_number() OVER ( PARTITION BY stmt_fingerprint_id, problem, causes ORDER BY end_time DESC ) AS rank... WHERE rank = 1`. Both patterns return the same result, but one uses aggregation and the other uses a window function. I find the `DISTINCT ON/GROUP BY` pattern easier to understand, I'm not seeing much difference in the planning/execution time between the two over the same set of data, and I'm seeing `DISTINCT ON/GROUP BY` coming up as more performant in almost all the secondary sources I've encountered. Release note (ui change): Added execution insights to the Statement Fingerprint Details and Transaction Fingerprint Details Pages. 96828: collatedstring: support default, C, and POSIX in expressions r=otan a=rafiss fixes #50734 fixes #95667 informs #57255 --- ### collatedstring: create new package Move the small amount of code from tree/collatedstring.go --- ### collatedstring: support C and POSIX in expressions Release note (sql change): Expressions of the form `COLLATE "default"`, `COLLATE "C"`, and `COLLATE "POSIX"` are now supported. Since the default collation cannot be changed currently, these expressions are all equivalent. The expressions are evaluated by treating the input as a normal string, and ignoring the collation. This means that comparisons between strings and collated strings that use "default", "C", or "POSIX" are now supported. Creating a column with the "C" or "POSIX" collations is still not supported. 96870: kvserver: use replicasByKey addition func in snapshot path r=tbg a=pavelkalinnikov This commit makes one step towards better code sharing between `Replica` initialization paths: split trigger and snapshot application. It makes both to use the same method to check and insert the initialized `Replica` to `replicasByKey` map. Touches #94912 96874: roachtest: run scheduled backup only on clusters with enterprise license r=stevendanna a=msbutler Epic: none Release note: None 96883: go.mod: bump Pebble to 829675f94811 r=RaduBerinde a=RaduBerinde 829675f9 db: fix ObsoleteSize stat 2f086b74 db: refactor compaction splitting to reduce key comparisons Release note: None Epic: none Co-authored-by: Andy Yang <[email protected]> Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Radu Berinde <[email protected]>

kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Jan 17, 2023

kvoli added this to the 23.1 milestone Jan 17, 2023

kvoli self-assigned this Jan 17, 2023

blathers-crl bot added the T-kv KV Team label Jan 17, 2023

kvoli mentioned this issue Jan 17, 2023

kvserver: allocator cpu balancing for overload protection #90582

Closed

13 tasks

kvoli mentioned this issue Jan 17, 2023

kvserver: separate loadstats cpu nanos to raft/req #95387

Merged

kvoli changed the title ~~kvserver: instrument cpu rebalancing~~ kvserver: allocator cpu rebalancing Jan 20, 2023

kvoli changed the title ~~kvserver: allocator cpu rebalancing~~ kvserver: store cpu rebalancing Jan 20, 2023

kvoli mentioned this issue Jan 24, 2023

kvserver: introduce cpu rebalancing #95152

Closed

kvoli mentioned this issue Jan 27, 2023

kvserver: introduce cpu rebalancing #96127

Merged

craig bot closed this as completed in c28ed6b Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: store cpu rebalancing #95380

kvserver: store cpu rebalancing #95380

kvoli commented Jan 17, 2023 •

edited

Loading

kvserver: store cpu rebalancing #95380

kvserver: store cpu rebalancing #95380

Comments

kvoli commented Jan 17, 2023 • edited Loading

kvoli commented Jan 17, 2023 •

edited

Loading