Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: perform point key-value reads where possible #46758

Closed
nvanbenschoten opened this issue Mar 30, 2020 · 9 comments · Fixed by #61583
Closed

sql: perform point key-value reads where possible #46758

nvanbenschoten opened this issue Mar 30, 2020 · 9 comments · Fixed by #61583
Assignees
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-sre For issues SRE opened or otherwise cares about tracking.

Comments

@nvanbenschoten
Copy link
Member

The KV API exposes both a ranged ScanRequest read operation and a point GetRequest read operation. Currently, SQL's row.Fetcher only ever uses ScanRequests, even when looking up a discrete set of keys. We should explore updating it to use GetRequests when possible.

For a multitude of reasons, we expect point KV operations to perform marginally better than ranged KV operations, even in cases where the SQL layer knows that the ranged operation operates over a fixed number of keys. Here's a non-exhaustive list of reasons why:

  • MVCCGet, unlike MVCCScan, is able to use a prefix iterator, which allows it to make use of RocksDB/Pebble bloom filters. There may also be other wins at the storage layer, like avoiding prefetching
  • various data structures have optimizations and fast-paths for single-key operations (e.g. timestamp cache, latch manager, refresh span tracking)
  • various code paths are simpler in the point operation case
  • a point operation implies only needing a single key allocation instead of a start and end key allocation
  • a point operation implies only needing to return up to a single result, which means that we can avoid some indirection in various places

We've known that these could add up to a performance win on simple workloads for some time. However, (to my knowledge) we've never actually made any measurements. Inspired by #46752, I hacked up a test using this patch. I then ran kv95 on a single-node cluster with and without the load-gen change that enables the use of GetRequest instead of ScanRequest. This led to the following improvement in throughput:

name  old ops/sec  new ops/sec  delta
kv95   44.9k ± 3%   45.9k ± 2%  +2.16%  (p=0.056 n=6+6)

This seems significant enough to justify further investigation.

I think there's also a similar issue to be filed about DeleteRequest vs DeleteRangeRequest.

cc. @jordanlewis for triage.

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-sql-execution Relating to SQL execution. labels Mar 30, 2020
@jordanlewis
Copy link
Member

Thanks for taking a look at this! I did an informal experiment with this a while ago and didn't record my results, but I don't think I saw anything like 2%. We'll put it into the backlog, although there are a few performance improvements (lookup join stuff, removal of extra planning latency) we're planning that will probably take priority.

@nvanbenschoten
Copy link
Member Author

Thanks Jordan!

@nvanbenschoten
Copy link
Member Author

I think there's also a similar issue to be filed about DeleteRequest vs DeleteRangeRequest.

I finally got around to writing this up in #53939 after seeing it come up again in TPC-C.

@nvanbenschoten
Copy link
Member Author

In his work with the separated lock table, @sumeerbhola is introducing another case where point reads will be optimized over ranged reads. With his changes, point reads will be able to use bloom filters when searching the persistent portion of the lock table. This will not be possible for ranged reads.

@sumeerbhola
Copy link
Collaborator

(Adding to the earlier list of performance benefits) A scan over a span that has at most 1 key, but does not know it, needs to iterate over all the versions (or eventually do a second expensive seek), looking for the non-existent next key in the span. A Get can do a single seek, read the key, and be done.

Regarding the lock table, if locks are rare and the BatchRequest has a large number of ScanRequests, it may actually be faster than a similar number of GetRequests, since the first scan will seek past all the spans being requested and the later seeks will get optimized, due to the Pebble improvements in cockroachdb/pebble#951 and cockroachdb/pebble#947. But we can always change a Get to a Scan inside the storage package (or below) if we find any significant performance benefit, but we can't do the reverse.

@joshimhoff joshimhoff added the O-sre For issues SRE opened or otherwise cares about tracking. label Feb 16, 2021
@jordanlewis
Copy link
Member

@nvanbenschoten do you have an intuition about whether it would be better, in the case of a fetch of a row with 2 (or 3 or 4 or n) column families, to issue n GetRequests, one per family, or 1 ScanRequest?

@nvanbenschoten
Copy link
Member Author

@nvanbenschoten do you have an intuition about whether it would be better, in the case of a fetch of a row with 2 (or 3 or 4 or n) column families, to issue n GetRequests, one per family, or 1 ScanRequest?

This is an interesting question! I don't have much intuition about which would be better or where the crossover point lies. Let's run some experiments and let that guide our decision here.

@jordanlewis
Copy link
Member

While I've got your attention, do you have an idea about the workload that would be mostly likely to showcase a difference between GetRequest and ScanRequest? I've been playing with KV and have seen nearly zero noticeable difference in performance.

@nvanbenschoten
Copy link
Member Author

You've been playing with an empty database, right? That might be contributing, as a major benefit of point reads is that they can use bloom filters to reduce read amplification once an LSM fills out. This effect will also be more pronounced once the entire data set no longer fits in RAM. I know it's slow, but consider throwing a few iterations of tpccbench/nodes=3/cpu=4 at both variants and letting it run overnight.

@craig craig bot closed this as completed in 2e60515 Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-sre For issues SRE opened or otherwise cares about tracking.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants