kvserver: additional signals for load-based splitting and rebalancing #69364
Labels
A-kv-distribution
Relating to rebalancing and leasing.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Currently, I believe load-based splitting and rebalancing decisions are made primarily on the basis of QPS, which essentially means number of
BatchRequest
s processed per second. This is too naïve -- as a simple example, a scan over 1000 values is generally more expensive than a single-value get, but they're currently considered equal for the purposes of load-based splitting and rebalancing.Some signals that we should consider taking into account, on a per-replica/range basis, are:
The number of individual requests processed (instead of the number of batch requests), see also kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620.
The number of keys read/written (I believe current WPS metrics count keys written, while QPS metrics count number of batch requests, these should be harmonized). We have seen cases where a hot range was using most of the CPU on a single node, but only had a QPS of around 25 -- far lower than the load-based splitting threshold of 2500 QPS -- which prevented the range from being split and rebalanced. The CPU load turned out to be mostly scans and joins, which are weighed as a single query even though they can read hundreds or thousands of values.
The number of bytes read/written to disk, and/or sent/received in network requests/responses. We have seen cases where a single node was pushing several hundred MB/s over the network, an order of magnitude more than other nodes.
Locally scheduled DistSQL processors. We have seen cases where a large cluster scheduled a large amount of join processors on a single node containing a hot range, overwhelming the node, but because the DistSQL CPU usage was not taken into account the range was never split or moved.
There are probably many others as well, but this is a starting point.
/cc @cockroachdb/kv
Jira issue: CRDB-9561
The text was updated successfully, but these errors were encountered: