kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

nvanbenschoten · 2020-06-24T19:26:40Z

In a recent customer report, we found that load-based rebalancing was failing to properly balance leaseholders, resulting in imbalanced load. After some deep investigation, @yuzefovich and @tbg found that the problem could be traced back to lookup joins. Specifically, one of the nodes was being tasked with running all of the lookup joins in the workload. The question is, why wasn't this picked up by either the hotranges report or load-based balancing?

After an in-person discussion, we believe this is because both of these sources rely on the leaseholderStats on each replica. These statistics only track the number of BatchRequests evaluated on a leaseholder, and not the number of individual requests:

cockroach/pkg/kv/kvserver/replica_send.go

Lines 54 to 56 in 96db1b3

    
           if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 { 
        
           	r.leaseholderStats.record(ba.Header.GatewayNodeID) 
        
           }

The hypothesis is that if this line was changed to r.leaseholderStats.recordCount(len(ba.Requests), ba.Header.GatewayNodeID), load-based lease rebalancing would have avoided this issue. This is because (in v19.2) lookup joins issue batches of 100 scans. So these batches would only be counted once towards a range's qps for load balancing purposes, but would place 100x the load on the leaseholder evaluating them.

We should test that hypothesis.

We should also determine whether we actually want to make a change here. It's problematic that we don't consider the size of batch requests in these heuristics. However, changing that now could lead to surprising effects in other areas. For instance, it would also weigh follow-the-workload rebalancing in favor of gateways that issue multi-request batches. Do we want that?

Jira issue: CRDB-4109

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2021-08-24T10:27:43Z

I'd argue we should go even further, and measure the number of values returned (or maybe even the number of bytes). A large scan is more expensive than a get, which should be reflected in the load metric.

ajwerner · 2021-08-24T12:34:23Z

We've built a linear cost model elsewhere. We could consider leveraging that same model for load balancing considerations.

erikgrinaker · 2021-08-25T15:56:06Z

We've built a linear cost model elsewhere. We could consider leveraging that same model for load balancing considerations.

I'm guessing you're referring to the query optimizer? I don't think we necessarily need to model the cost here, we can just measure it since these heuristics will necessarily be reactive anyway, so it's not immediately clear to me how it'd apply. But I wrote up a broader issue for this in #69364, maybe you could add a bit more detail there?

ajwerner · 2021-08-25T16:28:13Z

I'm guessing you're referring to the query optimizer?

No, I'm talking about the kvserver model used to calculate tenant costs. See

cockroach/pkg/multitenant/tenantcostmodel/model.go

Lines 15 to 20 in 8f82389

    
           // RU stands for "Request Unit(s)"; the tenant cost model maps tenant activity 
        
           // into this abstract unit. 
        
           // 
        
           // To get an idea of the magnitude of an RU, the cost model was designed so that 
        
           // using one CPU for a second costs 1000 RUs. 
        
           type RU float64

.

github-actions · 2023-09-12T11:11:12Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels Jun 24, 2020

nvanbenschoten self-assigned this Jun 24, 2020

jlinder added the T-kv KV Team label Jun 16, 2021

erikgrinaker mentioned this issue Aug 25, 2021

kvserver: additional signals for load-based splitting and rebalancing #69364

Open

tbg mentioned this issue Oct 27, 2021

kvserver: export bytes_read/sec in range status #72053

Closed

aayushshah15 mentioned this issue Dec 13, 2021

kvserver: improve the accounting of AddSSTableRequests' QPS #73731

Closed

erikgrinaker mentioned this issue Jan 27, 2022

admission,kv,bulk: unify (local) store overload protection via admission control #75066

Closed

irfansharif mentioned this issue Feb 25, 2022

tenantrate: use measured on-cpu time for rate limiting #77041

Open

github-actions bot added the no-issue-activity label Sep 12, 2023

github-actions bot added the X-stale label Sep 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023

exalate-issue-sync bot closed this as completed Sep 26, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

nvanbenschoten commented Jun 24, 2020 •

edited by cockroach-jira-scripts

Loading

erikgrinaker commented Aug 24, 2021

ajwerner commented Aug 24, 2021

erikgrinaker commented Aug 25, 2021

ajwerner commented Aug 25, 2021

github-actions bot commented Sep 12, 2023

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

Comments

nvanbenschoten commented Jun 24, 2020 • edited by cockroach-jira-scripts Loading

erikgrinaker commented Aug 24, 2021

ajwerner commented Aug 24, 2021

erikgrinaker commented Aug 25, 2021

ajwerner commented Aug 25, 2021

github-actions bot commented Sep 12, 2023

nvanbenschoten commented Jun 24, 2020 •

edited by cockroach-jira-scripts

Loading