-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: DistSender batch slicing is quadratic with batch size #68536
Comments
Possibly related issue #70706. |
81845: roachtest: include some cluster configs in test failure reports r=srosenberg a=renatolabs This commit changes the 'Parameters' section of a test failure. Previously, it would only include the contents of the `TAGS` and `GOFLAGS` environment variables. It now adds cluster configuration values obtained from the `ClusterSpec`. These parameters are prefixed with `ROACHTEST` (e.g., `ROACHTEST_cloud`). For now, just the cloud name, number of CPUs and SSDs are included. In the future, we can expand this work to support other relevant cluster configuration values as needed to debug test failures. Resolves: #80799. Release note: None. 81934: kvcoord: optimize Truncate r=yuzefovich a=yuzefovich **kvcoord: introduce benchmark for Truncate function** Release note: None **kvcoord: optimize Truncate** This commit optimizes the `Truncate` function by accessing the header of the request directly (instead of having to construct `roachpb.Span`) as well as keeping track of whether the request has been changed (instead of performing key comparison later). ``` name old time/op new time/op delta Truncate/reqs=32/type=get-24 3.81µs ± 4% 3.42µs ± 1% -10.32% (p=0.000 n=9+9) Truncate/reqs=32/type=scan-24 6.35µs ±11% 5.81µs ±11% -8.56% (p=0.019 n=10+10) Truncate/reqs=1024/type=get-24 98.9µs ± 2% 85.8µs ± 1% -13.24% (p=0.000 n=10+10) Truncate/reqs=1024/type=scan-24 176µs ± 2% 155µs ± 2% -12.17% (p=0.000 n=10+10) Truncate/reqs=32768/type=get-24 3.23ms ± 1% 2.78ms ± 0% -13.86% (p=0.000 n=10+9) Truncate/reqs=32768/type=scan-24 5.64ms ± 1% 5.02ms ± 0% -11.01% (p=0.000 n=10+9) name old alloc/op new alloc/op delta Truncate/reqs=32/type=get-24 744B ± 0% 744B ± 0% ~ (all equal) Truncate/reqs=32/type=scan-24 2.08kB ±37% 2.05kB ±36% ~ (p=0.897 n=10+10) Truncate/reqs=1024/type=get-24 24.6kB ± 0% 24.6kB ± 0% ~ (all equal) Truncate/reqs=1024/type=scan-24 73.2kB ± 2% 73.7kB ± 2% ~ (p=0.167 n=9+9) Truncate/reqs=32768/type=get-24 1.21MB ± 0% 1.21MB ± 0% ~ (p=0.370 n=10+10) Truncate/reqs=32768/type=scan-24 2.82MB ± 0% 2.84MB ± 0% +0.48% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Truncate/reqs=32/type=get-24 10.0 ± 0% 10.0 ± 0% ~ (all equal) Truncate/reqs=32/type=scan-24 33.6 ±25% 34.8 ±26% ~ (p=0.646 n=10+10) Truncate/reqs=1024/type=get-24 20.0 ± 0% 20.0 ± 0% ~ (all equal) Truncate/reqs=1024/type=scan-24 692 ± 5% 705 ± 6% ~ (p=0.167 n=9+9) Truncate/reqs=32768/type=get-24 40.0 ± 0% 40.0 ± 0% ~ (all equal) Truncate/reqs=32768/type=scan-24 21.6k ± 2% 22.0k ± 1% +1.73% (p=0.000 n=10+10) ``` Addresses: #68536. Release note: None 81938: kvcoord: don't construct roachpb.Spans needlessly in several spots r=yuzefovich a=yuzefovich This should be a minor performance improvement, but I don't know of good benchmarks to verify that, yet the change doesn't seem controversial either. Release note: None Co-authored-by: Renato Costa <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
I have thought about this problem for some time, and I think the best solution is to sort the requests upfront (because it allows us to only compare the range boundaries against a contiguous subset of all requests), and then keep track of the fully-processed requests (in order to skip key comparisons altogether) as well as the yet-to-be-processed suffixes of other requests such that the ordering of requests is maintained across the truncation. This solution is implemented in #82865. In terms of time complexity, the existing qudratic approach is |
CPU profiles like the following demonstrate high CPU utilization in
DistSender
. The profile is exemplary of the cost of aBatchRequest
with a high fanout factor.If we look closely, we can see that two of the most pronounced functions are
kvcoord.next
andkvcoord.truncate
. The first of these is called here and the second here. We can also assume that if this was a reverse scan,kvcoord.prev
, called here would replacekvcoord.next
as a pronounced use of CPU.The problem is that all three functions defined in
pkg/kv/kvclient/kvcoord/batch.go
are linear with respect to batch size, and they are called in a loop that iterates over ranges touched by a batch. This means that with a large batch and a high fanout factor (i.e. each request in the batch touches a different range), the cost of these functions is quadratic with respect to the number of requests in the batch. This is fine for small batches, but not great for large batches, like the ones issued by a lookup join.We should explore what we can do here to make this efficient. Solutions may include micro-optimizing the internals of these functions (with microbenchmarks) or restructuring their use to avoid the quadratic cost.
Jira issue: CRDB-9072
The text was updated successfully, but these errors were encountered: