colexec: some optimizations #47942

yuzefovich · 2020-04-22T21:33:14Z

colexec: remove one of the Go maps from hash aggregator

This commit switches usage of map to iteration over []uint64 when
building selection vectors in the hash aggregator. This is a lot more
efficient when group sizes are relatively large with moderate hit when
group sizes are small. This hit is reduced in a follow-up commit.

Release note: None

colexec: more improvements to hash aggregator

This commit removes the buffering stage of the hash aggregator as well
as removes the "append only" scratch batch that we're currently using.
The removal of buffering stage allows us to have smaller buffers without
sacrificing the performance. The removal of the scratch batch allows to
avoid copying over the data from the input batch and using that input
batch directly. We will be descructively modifying the selection vector
on that batch, but such behavior is acceptable because hash aggregator
owns the output batch, and the input batch will not be propagated
further.

This commit also bumps hashAggFuncsAllocSize from 16 to 64 which
gives us minor performance improvement in case of small group sizes.

Release note: None

colexec: remove some allocations

In a recent PR (for logical types plumbing) I introduced some
unnecessary allocations for unhandled type case - by taking a pointer
from a value in []types.T slice. This commit fixes that.

Release note: None

cockroach-teamcity · 2020-04-22T21:33:21Z

This change is

yuzefovich · 2020-04-22T23:13:18Z

Third commit removes the allocations I mistakenly introduced:

name                                       old alloc/op   new alloc/op   delta
MergeJoiner/rows=1024-16                     1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=4096-16                     7.27kB ± 0%    5.05kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=16384-16                    30.1kB ± 0%    20.2kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=1048576-16                  1.95MB ± 0%    1.29MB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=1024-16       1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=4096-16       7.27kB ± 0%    5.04kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=16384-16      30.0kB ± 0%    20.1kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=1048576-16    1.94MB ± 0%    1.29MB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=1024-16     1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=4096-16     7.50kB ± 0%    5.21kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=16384-16    31.2kB ± 0%    21.0kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=32768-16    63.2kB ± 0%    42.8kB ± 0%   ~     (p=0.100 n=3+3)

Compare it to the benchmarks of logical plumbing PR from here:

MergeJoiner/oneSideRepeat-rows=1024-24                                                               1.22kB ± 0%     1.61kB ± 0%   +31.48%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=1024-24                                                             1.22kB ± 0%     1.61kB ± 0%   +31.48%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=4096-24                                                             5.14kB ± 0%     7.64kB ± 0%   +48.59%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=32768-24                                                            43.3kB ± 0%     65.5kB ± 0%   +51.45%  (p=0.008 n=5+5)
MergeJoiner/oneSideRepeat-rows=4096-24                                                               4.85kB ± 0%     7.35kB ± 0%   +51.60%  (p=0.008 n=5+5)
MergeJoiner/rows=4096-24                                                                             4.84kB ± 0%     7.35kB ± 0%   +51.65%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=16384-24                                                            20.9kB ± 0%     31.9kB ± 0%   +52.41%  (p=0.016 n=5+4)
MergeJoiner/oneSideRepeat-rows=16384-24                                                              19.4kB ± 0%     30.3kB ± 0%   +56.60%  (p=0.008 n=5+5)
MergeJoiner/rows=16384-24                                                                            19.4kB ± 0%     30.3kB ± 0%   +56.64%  (p=0.008 n=5+5)
MergeJoiner/oneSideRepeat-rows=1048576-24                                                            1.24MB ± 0%     1.96MB ± 0%   +58.27%  (p=0.008 n=5+5)
MergeJoiner/rows=1048576-24                                                                          1.24MB ± 0%     1.96MB ± 0%   +58.28%  (p=0.008 n=5+5)

Thanks @jordanlewis

blathers-crl · 2020-04-22T23:45:03Z

❌ The GitHub CI (Cockroach) build has failed on c4c176f8.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

jordanlewis · 2020-04-23T00:38:28Z

Looks good. I don't think we should forgo the small group size stuff, but I understand that with the current algorithm it would be hard to get small groups right. We need to think more about this though.

yuzefovich · 2020-04-23T00:39:33Z

I have some more improvements coming up :) (nothing drastic though)

yuzefovich · 2020-04-23T01:52:41Z

I removed the comments that were describing the benchmarks of my WIP.

Some observations from those comments:

the first commit on its own introduces about 20% performance hit with small group sizes but gives at least 20% performance improvement on large group sizes
that gain can be up to 80% (when aggregating integers)
the first commit reduces allocations in all cases.

The second commit removes the buffering stage which reduces the maximum length of hashBuffer to coldata.BatchSize which makes the approach of the first commit suffer less with small group sizes. As a result, we see gains across the board (full output is here):

name                                                                             old speed      new speed       delta
Aggregator/SUM/hash/int/groupSize=1/hasNulls=false/numInputBatches=64-24         4.48MB/s ± 2%   5.23MB/s ± 1%   +16.70%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=1/hasNulls=true/numInputBatches=64-24          4.18MB/s ± 2%   4.86MB/s ± 1%   +16.41%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2/hasNulls=false/numInputBatches=64-24         8.18MB/s ± 3%  10.66MB/s ± 5%   +30.32%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2/hasNulls=true/numInputBatches=64-24          7.77MB/s ± 1%  10.14MB/s ± 3%   +30.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4/hasNulls=false/numInputBatches=64-24         17.0MB/s ± 3%   21.7MB/s ± 1%   +27.66%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4/hasNulls=true/numInputBatches=64-24          16.3MB/s ± 3%   21.3MB/s ± 2%   +30.48%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=8/hasNulls=false/numInputBatches=64-24         30.8MB/s ± 1%   42.0MB/s ± 1%   +36.29%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=8/hasNulls=true/numInputBatches=64-24          29.8MB/s ± 1%   41.1MB/s ± 1%   +38.14%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=16/hasNulls=false/numInputBatches=64-24        52.6MB/s ± 1%   82.2MB/s ± 1%   +56.40%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=16/hasNulls=true/numInputBatches=64-24         50.1MB/s ± 1%   78.0MB/s ± 4%   +55.72%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=32/hasNulls=false/numInputBatches=64-24        79.8MB/s ± 2%  143.7MB/s ± 2%   +79.98%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=32/hasNulls=true/numInputBatches=64-24         75.7MB/s ± 4%  135.5MB/s ± 1%   +78.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=64/hasNulls=false/numInputBatches=64-24         107MB/s ± 2%    226MB/s ± 0%  +111.44%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=64/hasNulls=true/numInputBatches=64-24         99.2MB/s ± 3%  207.8MB/s ± 1%  +109.44%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=128/hasNulls=false/numInputBatches=64-24        134MB/s ± 5%    320MB/s ± 0%  +139.87%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=128/hasNulls=true/numInputBatches=64-24         123MB/s ± 6%    285MB/s ± 2%  +131.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=256/hasNulls=false/numInputBatches=64-24        182MB/s ± 2%    403MB/s ± 0%  +121.56%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=256/hasNulls=true/numInputBatches=64-24         159MB/s ± 1%    350MB/s ± 0%  +120.66%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=512/hasNulls=false/numInputBatches=64-24        208MB/s ± 1%    457MB/s ± 0%  +119.58%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=512/hasNulls=true/numInputBatches=64-24         182MB/s ± 0%    390MB/s ± 0%  +114.08%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=1024/hasNulls=false/numInputBatches=64-24       234MB/s ± 0%    486MB/s ± 0%  +108.29%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=1024/hasNulls=true/numInputBatches=64-24        202MB/s ± 1%    413MB/s ± 0%  +104.29%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=2048/hasNulls=false/numInputBatches=64-24       274MB/s ± 1%    501MB/s ± 1%   +82.55%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2048/hasNulls=true/numInputBatches=64-24        230MB/s ± 1%    421MB/s ± 0%   +82.87%  (p=0.000 n=9+9)
Aggregator/SUM/hash/int/groupSize=4096/hasNulls=false/numInputBatches=64-24       278MB/s ± 1%    505MB/s ± 0%   +81.65%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4096/hasNulls=true/numInputBatches=64-24        233MB/s ± 0%    426MB/s ± 0%   +82.68%  (p=0.000 n=8+10)
Aggregator/SUM/hash/decimal/groupSize=1/hasNulls=false/numInputBatches=64-24     3.73MB/s ± 3%   4.21MB/s ± 2%   +12.96%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1/hasNulls=true/numInputBatches=64-24      3.60MB/s ± 1%   4.12MB/s ± 1%   +14.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=2/hasNulls=false/numInputBatches=64-24     6.18MB/s ± 4%   7.22MB/s ± 1%   +16.95%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=2/hasNulls=true/numInputBatches=64-24      6.09MB/s ± 3%   7.40MB/s ± 3%   +21.48%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4/hasNulls=false/numInputBatches=64-24     9.83MB/s ± 1%  11.89MB/s ± 2%   +20.97%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4/hasNulls=true/numInputBatches=64-24      10.1MB/s ± 1%   12.2MB/s ± 1%   +21.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=8/hasNulls=false/numInputBatches=64-24     14.4MB/s ± 1%   16.8MB/s ± 2%   +16.63%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=8/hasNulls=true/numInputBatches=64-24      15.0MB/s ± 1%   17.4MB/s ± 1%   +16.13%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=16/hasNulls=false/numInputBatches=64-24    18.7MB/s ± 1%   21.7MB/s ± 2%   +15.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=16/hasNulls=true/numInputBatches=64-24     19.7MB/s ± 1%   23.3MB/s ± 0%   +18.42%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=32/hasNulls=false/numInputBatches=64-24    22.0MB/s ± 2%   26.1MB/s ± 0%   +18.61%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=32/hasNulls=true/numInputBatches=64-24     23.0MB/s ± 1%   28.2MB/s ± 1%   +22.51%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=64/hasNulls=false/numInputBatches=64-24    24.0MB/s ± 1%   28.7MB/s ± 1%   +19.54%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=64/hasNulls=true/numInputBatches=64-24     25.3MB/s ± 1%   30.6MB/s ± 1%   +21.05%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=128/hasNulls=false/numInputBatches=64-24   25.2MB/s ± 1%   30.1MB/s ± 1%   +19.48%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=128/hasNulls=true/numInputBatches=64-24    26.7MB/s ± 1%   32.6MB/s ± 1%   +21.77%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=256/hasNulls=false/numInputBatches=64-24   26.9MB/s ± 1%   30.8MB/s ± 1%   +14.68%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=256/hasNulls=true/numInputBatches=64-24    28.7MB/s ± 1%   33.6MB/s ± 1%   +17.12%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=512/hasNulls=false/numInputBatches=64-24   27.7MB/s ± 1%   31.5MB/s ± 1%   +13.84%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=512/hasNulls=true/numInputBatches=64-24    29.3MB/s ± 1%   34.0MB/s ± 1%   +16.01%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1024/hasNulls=false/numInputBatches=64-24  28.1MB/s ± 3%   31.7MB/s ± 1%   +12.63%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1024/hasNulls=true/numInputBatches=64-24   29.9MB/s ± 1%   34.4MB/s ± 1%   +15.11%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=2048/hasNulls=false/numInputBatches=64-24  28.8MB/s ± 0%   31.8MB/s ± 1%   +10.64%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=2048/hasNulls=true/numInputBatches=64-24   30.6MB/s ± 0%   34.8MB/s ± 1%   +13.69%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=4096/hasNulls=false/numInputBatches=64-24  29.1MB/s ± 1%   32.1MB/s ± 2%   +10.30%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4096/hasNulls=true/numInputBatches=64-24   30.9MB/s ± 1%   34.6MB/s ± 1%   +12.01%  (p=0.000 n=10+10)

asubiotto

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 7 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @yuzefovich)

pkg/sql/colexec/hash_aggregator.go, line 195 at r1 (raw file):

	// We picked value this as the result of our benchmark.
	tupleLimit := coldata.BatchSize() * 2

This is interesting. Based on previous benchmarks, it seemed like it was good to have a buffering stage. What changed?

pkg/sql/colexec/hash_aggregator.go, line 374 at r1 (raw file):

	for selIdx, hashCode := range hashBuffer {
		selsSlot := -1
		for slot, hash := range op.scratch.hashCodeForSelsSlot {

maybe add a comment as to why we're not using a map

pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

}

const hashAggFuncsAllocSize = 128

What's the perf+mem difference with using 1024?

This commit switches usage of `map` to iteration over `[]uint64` when building selection vectors in the hash aggregator. This is a lot more efficient when group sizes are relatively large with moderate hit when group sizes are small. This hit is reduced in a follow-up commit. Release note: None

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @Azhng)

pkg/sql/colexec/hash_aggregator.go, line 195 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

This is interesting. Based on previous benchmarks, it seemed like it was good to have a buffering stage. What changed?

Not sure if it is the only reason, but the first commit here removes the lookup in a map (which has amortized O(1) cost) in favor of linear search in a slice (which has O(distinct buffered tuples) cost), so if we buffer up several batches, that cost will increase, especially in case of small group sizes.

pkg/sql/colexec/hash_aggregator.go, line 374 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

maybe add a comment as to why we're not using a map

Added.

pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

What's the perf+mem difference with using 1024?

The comparison of 128 (old) vs 1024 (new) is here. Seems like there is no noticeable change in memory allocations, but there is a minor hit in performance with small group sizes (which is somewhat surprising to me).

I ran a few other comparisons: 128 vs 32 and 128 vs 64

Maybe 64 would be best? I feel like it is a nicer number than 128, but there is not much difference between the two in the benchmarks.

asubiotto

Reviewed 9 of 9 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @Azhng, and @yuzefovich)

pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, yuzefovich wrote…

The comparison of 128 (old) vs 1024 (new) is here. Seems like there is no noticeable change in memory allocations, but there is a minor hit in performance with small group sizes (which is somewhat surprising to me).

I ran a few other comparisons: 128 vs 32 and 128 vs 64

Maybe 64 would be best? I feel like it is a nicer number than 128, but there is not much difference between the two in the benchmarks.

Let's do 64 if there's not a big boost to using 128. Maybe also add a comment about how you got to this number.

This commit removes the buffering stage of the hash aggregator as well as removes the "append only" scratch batch that we're currently using. The removal of buffering stage allows us to have smaller buffers without sacrificing the performance. The removal of the scratch batch allows to avoid copying over the data from the input batch and using that input batch directly. We will be descructively modifying the selection vector on that batch, but such behavior is acceptable because hash aggregator owns the output batch, and the input batch will not be propagated further. This commit also bumps `hashAggFuncsAllocSize` from 16 to 64 which gives us minor performance improvement in case of small group sizes. Release note: None

In a recent PR (for logical types plumbing) I introduced some unnecessary allocations for unhandled type case - by taking a pointer from a value in `[]types.T` slice. This commit fixes that. Release note: None

yuzefovich

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @Azhng)

pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Let's do 64 if there's not a big boost to using 128. Maybe also add a comment about how you got to this number.

Done.

craig · 2020-04-28T14:53:40Z

Build succeeded

GitHub CI (Cockroach)

yuzefovich force-pushed the hash-agg branch from bd98246 to 9f846a1 Compare April 22, 2020 22:43

yuzefovich changed the title ~~colexec: remove one of the Go maps from hash aggregator~~ colexec: some optimizations Apr 22, 2020

yuzefovich requested review from Azhng, asubiotto and a team April 22, 2020 23:19

yuzefovich force-pushed the hash-agg branch from 63797bb to c4c176f Compare April 22, 2020 23:29

yuzefovich force-pushed the hash-agg branch 2 times, most recently from 4aa9c80 to 3baf549 Compare April 23, 2020 01:37

asubiotto reviewed Apr 27, 2020

View reviewed changes

yuzefovich force-pushed the hash-agg branch 2 times, most recently from 49101c6 to d24f5d3 Compare April 27, 2020 18:16

yuzefovich commented Apr 27, 2020

View reviewed changes

asubiotto approved these changes Apr 28, 2020

View reviewed changes

yuzefovich added 2 commits April 28, 2020 06:57

colexec: remove some allocations

1444f90

In a recent PR (for logical types plumbing) I introduced some unnecessary allocations for unhandled type case - by taking a pointer from a value in `[]types.T` slice. This commit fixes that. Release note: None

yuzefovich force-pushed the hash-agg branch from d24f5d3 to 1444f90 Compare April 28, 2020 13:58

yuzefovich commented Apr 28, 2020

View reviewed changes

craig bot merged commit 211abed into cockroachdb:master Apr 28, 2020

yuzefovich deleted the hash-agg branch April 28, 2020 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexec: some optimizations #47942

colexec: some optimizations #47942

yuzefovich commented Apr 22, 2020 •

edited

Loading

cockroach-teamcity commented Apr 22, 2020

yuzefovich commented Apr 22, 2020 •

edited

Loading

blathers-crl bot commented Apr 22, 2020

jordanlewis commented Apr 23, 2020

yuzefovich commented Apr 23, 2020

yuzefovich commented Apr 23, 2020 •

edited

Loading

asubiotto left a comment

yuzefovich left a comment

asubiotto left a comment

yuzefovich left a comment

craig bot commented Apr 28, 2020

colexec: some optimizations #47942

colexec: some optimizations #47942

Conversation

yuzefovich commented Apr 22, 2020 • edited Loading

cockroach-teamcity commented Apr 22, 2020

yuzefovich commented Apr 22, 2020 • edited Loading

blathers-crl bot commented Apr 22, 2020

jordanlewis commented Apr 23, 2020

yuzefovich commented Apr 23, 2020

yuzefovich commented Apr 23, 2020 • edited Loading

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Apr 28, 2020

Build succeeded

yuzefovich commented Apr 22, 2020 •

edited

Loading

yuzefovich commented Apr 22, 2020 •

edited

Loading

yuzefovich commented Apr 23, 2020 •

edited

Loading