kvserver: additional signals for load-based splitting and rebalancing #69364

erikgrinaker · 2021-08-25T15:29:22Z

Currently, I believe load-based splitting and rebalancing decisions are made primarily on the basis of QPS, which essentially means number of BatchRequests processed per second. This is too naïve -- as a simple example, a scan over 1000 values is generally more expensive than a single-value get, but they're currently considered equal for the purposes of load-based splitting and rebalancing.

Some signals that we should consider taking into account, on a per-replica/range basis, are:

The number of individual requests processed (instead of the number of batch requests), see also kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620.
The number of keys read/written (I believe current WPS metrics count keys written, while QPS metrics count number of batch requests, these should be harmonized). We have seen cases where a hot range was using most of the CPU on a single node, but only had a QPS of around 25 -- far lower than the load-based splitting threshold of 2500 QPS -- which prevented the range from being split and rebalanced. The CPU load turned out to be mostly scans and joins, which are weighed as a single query even though they can read hundreds or thousands of values.
The number of bytes read/written to disk, and/or sent/received in network requests/responses. We have seen cases where a single node was pushing several hundred MB/s over the network, an order of magnitude more than other nodes.
Locally scheduled DistSQL processors. We have seen cases where a large cluster scheduled a large amount of join processors on a single node containing a hot range, overwhelming the node, but because the DistSQL CPU usage was not taken into account the range was never split or moved.

There are probably many others as well, but this is a starting point.

/cc @cockroachdb/kv

Jira issue: CRDB-9561

The text was updated successfully, but these errors were encountered:

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. T-kv KV Team labels Aug 25, 2021

erikgrinaker mentioned this issue Aug 25, 2021

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

Closed

lunevalex mentioned this issue Oct 28, 2021

Change Active Range Split Heuristics to Split on based on Node CPU latency #71721

Closed

erikgrinaker mentioned this issue Apr 1, 2022

kvserver: rebalance based on write load and store health #79216

Open

erikgrinaker mentioned this issue Sep 13, 2022

kvserver: load based splitting should be on a more granular metric #83519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: additional signals for load-based splitting and rebalancing #69364

kvserver: additional signals for load-based splitting and rebalancing #69364

erikgrinaker commented Aug 25, 2021 •

edited by cockroach-jira-scripts

Loading

kvserver: additional signals for load-based splitting and rebalancing #69364

kvserver: additional signals for load-based splitting and rebalancing #69364

Comments

erikgrinaker commented Aug 25, 2021 • edited by cockroach-jira-scripts Loading

erikgrinaker commented Aug 25, 2021 •

edited by cockroach-jira-scripts

Loading