Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

Closed
nvanbenschoten opened this issue Jun 24, 2020 · 5 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team X-stale

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 24, 2020

In a recent customer report, we found that load-based rebalancing was failing to properly balance leaseholders, resulting in imbalanced load. After some deep investigation, @yuzefovich and @tbg found that the problem could be traced back to lookup joins. Specifically, one of the nodes was being tasked with running all of the lookup joins in the workload. The question is, why wasn't this picked up by either the hotranges report or load-based balancing?

After an in-person discussion, we believe this is because both of these sources rely on the leaseholderStats on each replica. These statistics only track the number of BatchRequests evaluated on a leaseholder, and not the number of individual requests:

if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
r.leaseholderStats.record(ba.Header.GatewayNodeID)
}

The hypothesis is that if this line was changed to r.leaseholderStats.recordCount(len(ba.Requests), ba.Header.GatewayNodeID), load-based lease rebalancing would have avoided this issue. This is because (in v19.2) lookup joins issue batches of 100 scans. So these batches would only be counted once towards a range's qps for load balancing purposes, but would place 100x the load on the leaseholder evaluating them.

We should test that hypothesis.

We should also determine whether we actually want to make a change here. It's problematic that we don't consider the size of batch requests in these heuristics. However, changing that now could lead to surprising effects in other areas. For instance, it would also weigh follow-the-workload rebalancing in favor of gateways that issue multi-request batches. Do we want that?

Jira issue: CRDB-4109

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels Jun 24, 2020
@nvanbenschoten nvanbenschoten self-assigned this Jun 24, 2020
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@erikgrinaker
Copy link
Contributor

I'd argue we should go even further, and measure the number of values returned (or maybe even the number of bytes). A large scan is more expensive than a get, which should be reflected in the load metric.

@ajwerner
Copy link
Contributor

We've built a linear cost model elsewhere. We could consider leveraging that same model for load balancing considerations.

@erikgrinaker
Copy link
Contributor

We've built a linear cost model elsewhere. We could consider leveraging that same model for load balancing considerations.

I'm guessing you're referring to the query optimizer? I don't think we necessarily need to model the cost here, we can just measure it since these heuristics will necessarily be reactive anyway, so it's not immediately clear to me how it'd apply. But I wrote up a broader issue for this in #69364, maybe you could add a bit more detail there?

@ajwerner
Copy link
Contributor

I'm guessing you're referring to the query optimizer?

No, I'm talking about the kvserver model used to calculate tenant costs. See

// RU stands for "Request Unit(s)"; the tenant cost model maps tenant activity
// into this abstract unit.
//
// To get an idea of the magnitude of an RU, the cost model was designed so that
// using one CPU for a second costs 1000 RUs.
type RU float64
.

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior. no-issue-activity T-kv KV Team X-stale
Projects
No open projects
Archived in project
Development

No branches or pull requests

4 participants