Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: DistSender batch slicing is quadratic with batch size #68536

Closed
nvanbenschoten opened this issue Aug 6, 2021 · 2 comments · Fixed by #82865
Closed

kv: DistSender batch slicing is quadratic with batch size #68536

nvanbenschoten opened this issue Aug 6, 2021 · 2 comments · Fixed by #82865
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 6, 2021

CPU profiles like the following demonstrate high CPU utilization in DistSender. The profile is exemplary of the cost of a BatchRequest with a high fanout factor.

Screen Shot 2021-08-06 at 10 30 01 AM

If we look closely, we can see that two of the most pronounced functions are kvcoord.next and kvcoord.truncate. The first of these is called here and the second here. We can also assume that if this was a reverse scan, kvcoord.prev, called here would replace kvcoord.next as a pronounced use of CPU.

The problem is that all three functions defined in pkg/kv/kvclient/kvcoord/batch.go are linear with respect to batch size, and they are called in a loop that iterates over ranges touched by a batch. This means that with a large batch and a high fanout factor (i.e. each request in the batch touches a different range), the cost of these functions is quadratic with respect to the number of requests in the batch. This is fine for small batches, but not great for large batches, like the ones issued by a lookup join.

We should explore what we can do here to make this efficient. Solutions may include micro-optimizing the internals of these functions (with microbenchmarks) or restructuring their use to avoid the quadratic cost.

Jira issue: CRDB-9072

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-client Relating to the KV client and the KV interface. T-kv KV Team labels Aug 6, 2021
@yuzefovich
Copy link
Member

Possibly related issue #70706.

craig bot pushed a commit that referenced this issue Jun 6, 2022
81845: roachtest: include some cluster configs in test failure reports r=srosenberg a=renatolabs

This commit changes the 'Parameters' section of a test
failure. Previously, it would only include the contents of the `TAGS`
and `GOFLAGS` environment variables. It now adds cluster configuration
values obtained from the `ClusterSpec`. These parameters are prefixed
with `ROACHTEST` (e.g., `ROACHTEST_cloud`).

For now, just the cloud name, number of CPUs and SSDs are included. In
the future, we can expand this work to support other relevant cluster
configuration values as needed to debug test failures.

Resolves: #80799.

Release note: None.

81934: kvcoord: optimize Truncate r=yuzefovich a=yuzefovich

**kvcoord: introduce benchmark for Truncate function**

Release note: None

**kvcoord: optimize Truncate**

This commit optimizes the `Truncate` function by accessing the header of
the request directly (instead of having to construct `roachpb.Span`) as
well as keeping track of whether the request has been changed (instead
of performing key comparison later).
```
name                              old time/op    new time/op    delta
Truncate/reqs=32/type=get-24        3.81µs ± 4%    3.42µs ± 1%  -10.32%  (p=0.000 n=9+9)
Truncate/reqs=32/type=scan-24       6.35µs ±11%    5.81µs ±11%   -8.56%  (p=0.019 n=10+10)
Truncate/reqs=1024/type=get-24      98.9µs ± 2%    85.8µs ± 1%  -13.24%  (p=0.000 n=10+10)
Truncate/reqs=1024/type=scan-24      176µs ± 2%     155µs ± 2%  -12.17%  (p=0.000 n=10+10)
Truncate/reqs=32768/type=get-24     3.23ms ± 1%    2.78ms ± 0%  -13.86%  (p=0.000 n=10+9)
Truncate/reqs=32768/type=scan-24    5.64ms ± 1%    5.02ms ± 0%  -11.01%  (p=0.000 n=10+9)

name                              old alloc/op   new alloc/op   delta
Truncate/reqs=32/type=get-24          744B ± 0%      744B ± 0%     ~     (all equal)
Truncate/reqs=32/type=scan-24       2.08kB ±37%    2.05kB ±36%     ~     (p=0.897 n=10+10)
Truncate/reqs=1024/type=get-24      24.6kB ± 0%    24.6kB ± 0%     ~     (all equal)
Truncate/reqs=1024/type=scan-24     73.2kB ± 2%    73.7kB ± 2%     ~     (p=0.167 n=9+9)
Truncate/reqs=32768/type=get-24     1.21MB ± 0%    1.21MB ± 0%     ~     (p=0.370 n=10+10)
Truncate/reqs=32768/type=scan-24    2.82MB ± 0%    2.84MB ± 0%   +0.48%  (p=0.000 n=10+10)

name                              old allocs/op  new allocs/op  delta
Truncate/reqs=32/type=get-24          10.0 ± 0%      10.0 ± 0%     ~     (all equal)
Truncate/reqs=32/type=scan-24         33.6 ±25%      34.8 ±26%     ~     (p=0.646 n=10+10)
Truncate/reqs=1024/type=get-24        20.0 ± 0%      20.0 ± 0%     ~     (all equal)
Truncate/reqs=1024/type=scan-24        692 ± 5%       705 ± 6%     ~     (p=0.167 n=9+9)
Truncate/reqs=32768/type=get-24       40.0 ± 0%      40.0 ± 0%     ~     (all equal)
Truncate/reqs=32768/type=scan-24     21.6k ± 2%     22.0k ± 1%   +1.73%  (p=0.000 n=10+10)
```

Addresses: #68536.

Release note: None

81938: kvcoord: don't construct roachpb.Spans needlessly in several spots r=yuzefovich a=yuzefovich

This should be a minor performance improvement, but I don't know of good
benchmarks to verify that, yet the change doesn't seem controversial
either.

Release note: None

Co-authored-by: Renato Costa <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@yuzefovich yuzefovich self-assigned this Jun 11, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jun 11, 2022
@yuzefovich
Copy link
Member

I have thought about this problem for some time, and I think the best solution is to sort the requests upfront (because it allows us to only compare the range boundaries against a contiguous subset of all requests), and then keep track of the fully-processed requests (in order to skip key comparisons altogether) as well as the yet-to-be-processed suffixes of other requests such that the ordering of requests is maintained across the truncation. This solution is implemented in #82865.

In terms of time complexity, the existing qudratic approach is O(N * R) where N is the number of original requests and R is the number of ranges those requests touch. The proposed solution is actually worse in the absolute worse case - O(N * (log N + R)) because of the need to sort the requests up front. However, in practice, it should be much faster and behave like O(N * log N + T * R) where T is the average number of requests that intersect with a single range.

@craig craig bot closed this as completed in 90b15cf Jul 6, 2022
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Feb 13, 2023
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants