kvstreamer: optimize singleRangeBatch.Less #82382

yuzefovich · 2022-06-02T23:34:37Z

bench: add benchmarks for lookup joins

This commit adds benchmarks for lookup joins, both when equality columns
are and are not key, both with and without maintaining ordering.

Release note: None

kvstreamer: optimize singleRangeBatch.Less

This commit optimizes singleRangeBatch.Less method which is used when
sorting the requests inside of these objects in the OutOfOrder mode
(which is needed to get the low-level Pebble speedups) by storing the
start keys explicitly instead of performing a couple of function calls
on each Less invocation.

name                             old time/op    new time/op    delta
IndexJoin/Cockroach-24             6.30ms ± 1%    5.78ms ± 1%  -8.31%  (p=0.000 n=10+10)
IndexJoin/MultinodeCockroach-24    8.01ms ± 1%    7.51ms ± 1%  -6.28%  (p=0.000 n=10+10)

name                             old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24             1.55MB ± 0%    1.57MB ± 0%  +0.98%  (p=0.000 n=9+10)
IndexJoin/MultinodeCockroach-24    2.28MB ± 2%    2.30MB ± 1%    ~     (p=0.400 n=10+9)

name                             old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24              8.16k ± 1%     8.13k ± 1%    ~     (p=0.160 n=10+10)
IndexJoin/MultinodeCockroach-24     12.7k ± 1%     12.6k ± 0%    ~     (p=0.128 n=10+9)

name                                                    old time/op    new time/op    delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             6.89ms ± 1%    6.43ms ± 1%  -6.65%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    8.03ms ± 1%    7.48ms ± 2%  -6.92%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         9.21ms ± 3%    8.82ms ± 5%  -4.23%  (p=0.007 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                11.9ms ± 3%    11.5ms ± 3%  -3.36%  (p=0.002 n=9+10)

name                                                    old alloc/op   new alloc/op   delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.81MB ± 1%    1.84MB ± 0%  +1.23%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.50MB ± 2%    2.54MB ± 1%  +1.76%  (p=0.004 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         1.89MB ± 0%    1.91MB ± 1%  +1.09%  (p=0.000 n=9+9)
LookupJoinNoOrdering/MultinodeCockroach-24                2.37MB ± 2%    2.42MB ± 4%  +1.85%  (p=0.010 n=10+9)

name                                                    old allocs/op  new allocs/op  delta
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              10.8k ± 0%     10.8k ± 1%    ~     (p=0.615 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     15.1k ± 1%     15.0k ± 0%    ~     (p=0.101 n=10+10)
LookupJoinNoOrdering/Cockroach-24                          13.3k ± 1%     13.3k ± 1%    ~     (p=0.549 n=10+9)
LookupJoinNoOrdering/MultinodeCockroach-24                 17.3k ± 1%     17.3k ± 1%    ~     (p=0.460 n=10+8)

Addresses: #82159

Release note: None

cockroach-teamcity · 2022-06-02T23:34:45Z

This change is

rharding6373

The performance - memory tradeoff seems fine. I have some questions about the benchmarks.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @yuzefovich)

-- commits line 33 at r2:
What are the results for the new Ordering (as opposed to NoOrdering) benchmarks?

pkg/bench/bench_test.go line 1209 at r1 (raw file):

	defer log.Scope(b).Close(b)
	ForEachDB(b, func(b *testing.B, db *sqlutils.SQLRunner) {
		db.Exec(b, `CREATE TABLE t1 (a INT)`)

Should a be a primary key to meet the "equality columns are key" specification?

pkg/bench/bench_test.go line 1210 at r1 (raw file):

	ForEachDB(b, func(b *testing.B, db *sqlutils.SQLRunner) {
		db.Exec(b, `CREATE TABLE t1 (a INT)`)
		db.Exec(b, `INSERT INTO t1 SELECT generate_series(1, 1000)`)

Even though a is not a primary key in t1, does this insert order make it inherently in order in the storage layer? That seems like it would affect benchmark performance. Maybe that's ok, but it'd be good to be aware of.

Code quote:

SELECT * FROM t1 INNER LOOKUP JOIN t2 ON t1.a = t2.a

pkg/bench/bench_test.go line 1233 at r1 (raw file):

		for i := 0; i < b.N; i++ {
			db.Exec(b, `SELECT count(b) FROM t1 INNER LOOKUP JOIN t2 ON t1.a = t2.a GROUP BY t1.a ORDER BY t1.a`)

Why does this query need the count and GROUP BY?

Code quote:

GROUP BY t1.a

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

-- commits line 33 at r2:

Previously, rharding6373 (Rachael Harding) wrote…

What are the results for the new Ordering (as opposed to NoOrdering) benchmarks?

Those didn't change because we don't sort singleRangeBatch.reqs in the InOrder mode, so Less is not used.

pkg/bench/bench_test.go line 1209 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Should a be a primary key to meet the "equality columns are key" specification?

It doesn't really matter as long as the target index is unique. Here is the comment from JoinReaderSpec:

	// If set, the lookup columns form a key in the target table and thus each
	// lookup has at most one result.
	LookupColumnsAreKey bool `protobuf:"varint,8,opt,name=lookup_columns_are_key,json=lookupColumnsAreKey" json:"lookup_columns_are_key"`

pkg/bench/bench_test.go line 1210 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Even though a is not a primary key in t1, does this insert order make it inherently in order in the storage layer? That seems like it would affect benchmark performance. Maybe that's ok, but it'd be good to be aware of.

Hm, I don't think it matters much since the main lookup query will perform the lookup join without any information about the ordering (since no ordering is provided by t1), and the query doesn't have the ORDER BY, so we will internally sort the requests (in the OutOfOrder mode of the streamer) to get pebble-level optimizations.

pkg/bench/bench_test.go line 1233 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Why does this query need the count and GROUP BY?

Good point, this is not needed, removed.

yuzefovich

Made some minor improvements.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)

pkg/bench/bench_test.go line 1210 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Hm, I don't think it matters much since the main lookup query will perform the lookup join without any information about the ordering (since no ordering is provided by t1), and the query doesn't have the ORDER BY, so we will internally sort the requests (in the OutOfOrder mode of the streamer) to get pebble-level optimizations.

Your point matters for the Ordering benchmarks where the streamer (at least currently) doesn't sort requests according to their keys, so with the current version of this PR we'd be benchmarking the best case (when the ordering provided by the streamer's user matches the ordering of keys). I think it's ok though, at least for now.

michae2

I too have some questions about the benchmarks, but no really substantive complaints.

Reviewed 3 of 3 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rharding6373 and @yuzefovich)

pkg/kv/kvclient/kvstreamer/streamer.go line 549 at r4 (raw file):

			} else {
				r.reqsKeys = make([]roachpb.Key, len(r.reqs))
			}

unimportant code golfing side comment: I've been thinking about this common little idiom. Wouldn't append handle this for us? In other words, wouldn't it be a little simpler to write something like:

r.reqsKeys = reqsKeysScratch[:0]
for i := range r.reqs {
	r.reqsKeys = append(r.reqsKeys, r.reqs[i].GetInner().Header().Key)
}
sort.Sort(&r)
reqsKeysScratch = r.reqsKeys
r.reqsKeys = nil

pkg/bench/bench_test.go line 1209 at r3 (raw file):

	defer log.Scope(b).Close(b)
	ForEachDB(b, func(b *testing.B, db *sqlutils.SQLRunner) {
		db.Exec(b, `CREATE TABLE t1 (a INT)`)

Do we need to control the distribution of these tables during the benchmark? What if the KV layer initiates a split of one of them during the benchmark? Should we be explicit about whether this is testing local or distributed execution?

pkg/bench/bench_test.go line 1245 at r3 (raw file):

		db.Exec(b, `CREATE TABLE t1 (a INT)`)
		db.Exec(b, `INSERT INTO t1 SELECT generate_series(1, 1000)`)
		db.Exec(b, `CREATE TABLE t2 (a INT, INDEX (a))`)

Is the only difference between the EqColsAreKey benchmarks and these benchmarks a lookup join into a PK vs a lookup join into a secondary index? Are those different in the execution engine? I would expect them to be the same. Do we need different benchmarks?

pkg/bench/bench_test.go line 1250 at r3 (raw file):

		for i := 0; i < b.N; i++ {
			db.Exec(b, `SELECT t2.a FROM t1 INNER LOOKUP JOIN t2 ON t1.a = t2.a`)

nit: This uses SELECT t2.a instead of SELECT * like the other queries. Seems like it should also use SELECT * to return the same number of columns, for an apples-to-apples comparison.

This commit adds benchmarks for lookup joins, both when equality columns are and are not key, both with and without maintaining ordering. Release note: None

This commit optimizes `singleRangeBatch.Less` method which is used when sorting the requests inside of these objects in the OutOfOrder mode (which is needed to get the low-level Pebble speedups) by storing the start keys explicitly instead of performing a couple of function calls on each `Less` invocation. ``` name old time/op new time/op delta IndexJoin/Cockroach-24 6.30ms ± 1% 5.78ms ± 1% -8.31% (p=0.000 n=10+10) IndexJoin/MultinodeCockroach-24 8.01ms ± 1% 7.51ms ± 1% -6.28% (p=0.000 n=10+10) name old alloc/op new alloc/op delta IndexJoin/Cockroach-24 1.55MB ± 0% 1.57MB ± 0% +0.98% (p=0.000 n=9+10) IndexJoin/MultinodeCockroach-24 2.28MB ± 2% 2.30MB ± 1% ~ (p=0.400 n=10+9) name old allocs/op new allocs/op delta IndexJoin/Cockroach-24 8.16k ± 1% 8.13k ± 1% ~ (p=0.160 n=10+10) IndexJoin/MultinodeCockroach-24 12.7k ± 1% 12.6k ± 0% ~ (p=0.128 n=10+9) ``` ``` name old time/op new time/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 6.89ms ± 1% 6.43ms ± 1% -6.65% (p=0.000 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 8.03ms ± 1% 7.48ms ± 2% -6.92% (p=0.000 n=10+10) LookupJoinNoOrdering/Cockroach-24 9.21ms ± 3% 8.82ms ± 5% -4.23% (p=0.007 n=10+10) LookupJoinNoOrdering/MultinodeCockroach-24 11.9ms ± 3% 11.5ms ± 3% -3.36% (p=0.002 n=9+10) name old alloc/op new alloc/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 1.81MB ± 1% 1.84MB ± 0% +1.23% (p=0.000 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 2.50MB ± 2% 2.54MB ± 1% +1.76% (p=0.004 n=10+10) LookupJoinNoOrdering/Cockroach-24 1.89MB ± 0% 1.91MB ± 1% +1.09% (p=0.000 n=9+9) LookupJoinNoOrdering/MultinodeCockroach-24 2.37MB ± 2% 2.42MB ± 4% +1.85% (p=0.010 n=10+9) name old allocs/op new allocs/op delta LookupJoinEqColsAreKeyNoOrdering/Cockroach-24 10.8k ± 0% 10.8k ± 1% ~ (p=0.615 n=10+10) LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24 15.1k ± 1% 15.0k ± 0% ~ (p=0.101 n=10+10) LookupJoinNoOrdering/Cockroach-24 13.3k ± 1% 13.3k ± 1% ~ (p=0.549 n=10+9) LookupJoinNoOrdering/MultinodeCockroach-24 17.3k ± 1% 17.3k ± 1% ~ (p=0.460 n=10+8) ``` Release note: None

yuzefovich

TFTRs!

I think Rachael is onboard with the change too, so I'll go ahead and merge.

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @rharding6373)

pkg/kv/kvclient/kvstreamer/streamer.go line 549 at r4 (raw file):

Previously, michae2 (Michael Erickson) wrote…

unimportant code golfing side comment: I've been thinking about this common little idiom. Wouldn't append handle this for us? In other words, wouldn't it be a little simpler to write something like:
r.reqsKeys = reqsKeysScratch[:0]
for i := range r.reqs {
	r.reqsKeys = append(r.reqsKeys, r.reqs[i].GetInner().Header().Key)
}
sort.Sort(&r)
reqsKeysScratch = r.reqsKeys
r.reqsKeys = nil

Good point, I like it, done.

pkg/bench/bench_test.go line 1209 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Do we need to control the distribution of these tables during the benchmark? What if the KV layer initiates a split of one of them during the benchmark? Should we be explicit about whether this is testing local or distributed execution?

These queries follow the pattern of this file where we insert small amount of data, so we know it'll be a single range setup. I think that's ok since the goal of the benchmarks is the "speed of light" check. It's highly unlikely that the KV will split any of these tables into separate ranges since they are just so small, and the benchmarks don't run long enough so that QPS based splitting would kick in.

pkg/bench/bench_test.go line 1245 at r3 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Is the only difference between the EqColsAreKey benchmarks and these benchmarks a lookup join into a PK vs a lookup join into a secondary index? Are those different in the execution engine? I would expect them to be the same. Do we need different benchmarks?

The important difference is a lookup join into a unique index (which PK is) vs a lookup join into a non-unique index. The uniqueness property guarantees that a single lookup will produce at most one row, and in the execution engine we use different strategies because of this property. In particular, in the non-streamer code path we parallelize KV batches IFF we have this "at most single row lookup" property. This is the case with the index joins as well as lookup joins when equality columns form a key. In the streamer code-path, the behavior is also different because we propagate this property as Hints.SingleRowLookup hint, and the streamer can simplify some of the bookkeeping if that is true.

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2, @rharding6373, and @yuzefovich)

pkg/bench/bench_test.go line 1209 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

These queries follow the pattern of this file where we insert small amount of data, so we know it'll be a single range setup. I think that's ok since the goal of the benchmarks is the "speed of light" check. It's highly unlikely that the KV will split any of these tables into separate ranges since they are just so small, and the benchmarks don't run long enough so that QPS based splitting would kick in.

Makes sense!

pkg/bench/bench_test.go line 1245 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

The important difference is a lookup join into a unique index (which PK is) vs a lookup join into a non-unique index. The uniqueness property guarantees that a single lookup will produce at most one row, and in the execution engine we use different strategies because of this property. In particular, in the non-streamer code path we parallelize KV batches IFF we have this "at most single row lookup" property. This is the case with the index joins as well as lookup joins when equality columns form a key. In the streamer code-path, the behavior is also different because we propagate this property as Hints.SingleRowLookup hint, and the streamer can simplify some of the bookkeeping if that is true.

Ahhh, right, uniqueness! Thank you!

craig · 2022-06-11T00:26:07Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich requested review from rharding6373, michae2 and a team June 2, 2022 23:34

rharding6373 requested changes Jun 3, 2022

View reviewed changes

yuzefovich force-pushed the lookup-join-benchmarks branch from 2fbfafc to 229ddd1 Compare June 4, 2022 05:06

yuzefovich commented Jun 4, 2022

View reviewed changes

yuzefovich force-pushed the lookup-join-benchmarks branch from 229ddd1 to 9bd3042 Compare June 7, 2022 00:27

yuzefovich commented Jun 7, 2022

View reviewed changes

yuzefovich mentioned this pull request Jun 9, 2022

kvstreamer: remove temporary allocations for []Result #82387

Merged

michae2 approved these changes Jun 10, 2022

View reviewed changes

yuzefovich added 2 commits June 10, 2022 14:23

bench: add benchmarks for lookup joins

277be17

This commit adds benchmarks for lookup joins, both when equality columns are and are not key, both with and without maintaining ordering. Release note: None

yuzefovich force-pushed the lookup-join-benchmarks branch from 9bd3042 to 76175be Compare June 10, 2022 21:23

yuzefovich commented Jun 10, 2022

View reviewed changes

michae2 reviewed Jun 10, 2022

View reviewed changes

craig bot merged commit f2cc0b3 into cockroachdb:master Jun 11, 2022

yuzefovich deleted the lookup-join-benchmarks branch June 11, 2022 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvstreamer: optimize singleRangeBatch.Less #82382

kvstreamer: optimize singleRangeBatch.Less #82382

yuzefovich commented Jun 2, 2022

cockroach-teamcity commented Jun 2, 2022

rharding6373 left a comment

yuzefovich left a comment •

edited

Loading

yuzefovich left a comment

michae2 left a comment

yuzefovich left a comment

michae2 left a comment

craig bot commented Jun 11, 2022

kvstreamer: optimize singleRangeBatch.Less #82382

kvstreamer: optimize singleRangeBatch.Less #82382

Conversation

yuzefovich commented Jun 2, 2022

cockroach-teamcity commented Jun 2, 2022

rharding6373 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment • edited Loading

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

craig bot commented Jun 11, 2022

yuzefovich left a comment •

edited

Loading