-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rowexec: improve LEFT SEMI/ANTI lookup join performance #49790
Comments
Hi @asubiotto, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
CheckExistsRequest CheckExistsRequest is currently implemented with a flag checking for types that don’t require right cols, namely left semi/anti joins. Currently in the kv_batch_fetcher, I set Testing Switching to the inner loop in https://github.com/cockroachdb/cockroach/pull/53669/files#diff-2793c01ca6dd699f8ad1e26f24f700dbR1182 seems to have increased variance, compared to before when I was using -count 100 or so to modify b.N, so despite running faster I don’t know if it’s as desirable for getting statistically significant results? Results with gceworker for some reason tended to have slightly more variance, leading to having very high p-values so unfortunately I didn’t get any worthwhile results on the gceworker after making this change. The results I got prior to making the benchmark changes are shown below:
Along with the corresponding profile call_trees: Master CheckExists Get To force the GetRequest code path in BenchmarkLeftSemiJoin I had to create the foo index as UNIQUE. Overall from the testing, it was a bit difficult to isolate the difference between using Scan vs CheckExists even in a benchmark designed to stress the left semi join. I’m not entirely sure why |
The conclusion here is that it doesn't seem like it'll be a huge performance win, so closing this issue. We can revisit and use this as a starting point in the future. |
Note: throughout this issue I refer to the lookup side (i.e. the rows we're looking up results for) as the left side and the looked up side (i.e. the table we're performing lookups against) as the right side.
LEFT SEMI/ANTI joins are joins where only the values from the left side need to be output. Currently, the
joinReader
creates aScanRequest
for each row in the left side, which returns values unnecessarily from the right side. In an ideal world, we'd only want to check if there is a matching row for a row on the left side. The reason the performance of these cases is important is that they're used for FK checks, performance improvements here could be observed in a variety of workloads.We could do a couple of optimizations:
ScanRequests
. There are two alternatives:GetRequest
s (performance difference is explained here sql: perform point key-value reads where possible #46758). The advantage is that this API already exists in the codebase. The disadvantage is that we'll still be fetching unnecessary values.ExistsRequest
(sql: use the new CheckExistsRequest KV operation for FK existence checks #33472). The advantage is that unneeded values are no longer fetched. The disadvantage is that this API is still a prototype and this will currently be the only use of it. The work will probably need to be done rebased on top of the prototype API and demonstrate performance gains in order to justify adding that API to production. Another advantage usingExistsRequest
opens the door to caching results at the transaction coordinator layer that will cheapen redundant existence checks (this work is out of scope).I think it would be worth it to implement both the
GetRequest
s andExistsRequest
s and benchmark them againstScanRequest
s and each other to get an idea for the performance differences.Some useful benchmarks to see how the performance changes:
pkg/sql/opt/bench/bench_test.go
The text was updated successfully, but these errors were encountered: