Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colexec: some optimizations #47942

Merged
merged 3 commits into from
Apr 28, 2020
Merged

colexec: some optimizations #47942

merged 3 commits into from
Apr 28, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Apr 22, 2020

colexec: remove one of the Go maps from hash aggregator

This commit switches usage of map to iteration over []uint64 when
building selection vectors in the hash aggregator. This is a lot more
efficient when group sizes are relatively large with moderate hit when
group sizes are small. This hit is reduced in a follow-up commit.

Release note: None

colexec: more improvements to hash aggregator

This commit removes the buffering stage of the hash aggregator as well
as removes the "append only" scratch batch that we're currently using.
The removal of buffering stage allows us to have smaller buffers without
sacrificing the performance. The removal of the scratch batch allows to
avoid copying over the data from the input batch and using that input
batch directly. We will be descructively modifying the selection vector
on that batch, but such behavior is acceptable because hash aggregator
owns the output batch, and the input batch will not be propagated
further.

This commit also bumps hashAggFuncsAllocSize from 16 to 64 which
gives us minor performance improvement in case of small group sizes.

Release note: None

colexec: remove some allocations

In a recent PR (for logical types plumbing) I introduced some
unnecessary allocations for unhandled type case - by taking a pointer
from a value in []types.T slice. This commit fixes that.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich changed the title colexec: remove one of the Go maps from hash aggregator colexec: some optimizations Apr 22, 2020
@yuzefovich
Copy link
Member Author

yuzefovich commented Apr 22, 2020

Third commit removes the allocations I mistakenly introduced:

name                                       old alloc/op   new alloc/op   delta
MergeJoiner/rows=1024-16                     1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=4096-16                     7.27kB ± 0%    5.05kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=16384-16                    30.1kB ± 0%    20.2kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/rows=1048576-16                  1.95MB ± 0%    1.29MB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=1024-16       1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=4096-16       7.27kB ± 0%    5.04kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=16384-16      30.0kB ± 0%    20.1kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/oneSideRepeat-rows=1048576-16    1.94MB ± 0%    1.29MB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=1024-16     1.58kB ± 0%    1.26kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=4096-16     7.50kB ± 0%    5.21kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=16384-16    31.2kB ± 0%    21.0kB ± 0%   ~     (p=0.100 n=3+3)
MergeJoiner/bothSidesRepeat-rows=32768-16    63.2kB ± 0%    42.8kB ± 0%   ~     (p=0.100 n=3+3)

Compare it to the benchmarks of logical plumbing PR from here:

MergeJoiner/oneSideRepeat-rows=1024-24                                                               1.22kB ± 0%     1.61kB ± 0%   +31.48%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=1024-24                                                             1.22kB ± 0%     1.61kB ± 0%   +31.48%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=4096-24                                                             5.14kB ± 0%     7.64kB ± 0%   +48.59%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=32768-24                                                            43.3kB ± 0%     65.5kB ± 0%   +51.45%  (p=0.008 n=5+5)
MergeJoiner/oneSideRepeat-rows=4096-24                                                               4.85kB ± 0%     7.35kB ± 0%   +51.60%  (p=0.008 n=5+5)
MergeJoiner/rows=4096-24                                                                             4.84kB ± 0%     7.35kB ± 0%   +51.65%  (p=0.008 n=5+5)
MergeJoiner/bothSidesRepeat-rows=16384-24                                                            20.9kB ± 0%     31.9kB ± 0%   +52.41%  (p=0.016 n=5+4)
MergeJoiner/oneSideRepeat-rows=16384-24                                                              19.4kB ± 0%     30.3kB ± 0%   +56.60%  (p=0.008 n=5+5)
MergeJoiner/rows=16384-24                                                                            19.4kB ± 0%     30.3kB ± 0%   +56.64%  (p=0.008 n=5+5)
MergeJoiner/oneSideRepeat-rows=1048576-24                                                            1.24MB ± 0%     1.96MB ± 0%   +58.27%  (p=0.008 n=5+5)
MergeJoiner/rows=1048576-24                                                                          1.24MB ± 0%     1.96MB ± 0%   +58.28%  (p=0.008 n=5+5)

Thanks @jordanlewis

@blathers-crl
Copy link

blathers-crl bot commented Apr 22, 2020

❌ The GitHub CI (Cockroach) build has failed on c4c176f8.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@jordanlewis
Copy link
Member

Looks good. I don't think we should forgo the small group size stuff, but I understand that with the current algorithm it would be hard to get small groups right. We need to think more about this though.

@yuzefovich
Copy link
Member Author

I have some more improvements coming up :) (nothing drastic though)

@yuzefovich yuzefovich force-pushed the hash-agg branch 2 times, most recently from 4aa9c80 to 3baf549 Compare April 23, 2020 01:37
@yuzefovich
Copy link
Member Author

yuzefovich commented Apr 23, 2020

I removed the comments that were describing the benchmarks of my WIP.

Some observations from those comments:

  • the first commit on its own introduces about 20% performance hit with small group sizes but gives at least 20% performance improvement on large group sizes
  • that gain can be up to 80% (when aggregating integers)
  • the first commit reduces allocations in all cases.

The second commit removes the buffering stage which reduces the maximum length of hashBuffer to coldata.BatchSize which makes the approach of the first commit suffer less with small group sizes. As a result, we see gains across the board (full output is here):

name                                                                             old speed      new speed       delta
Aggregator/SUM/hash/int/groupSize=1/hasNulls=false/numInputBatches=64-24         4.48MB/s ± 2%   5.23MB/s ± 1%   +16.70%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=1/hasNulls=true/numInputBatches=64-24          4.18MB/s ± 2%   4.86MB/s ± 1%   +16.41%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2/hasNulls=false/numInputBatches=64-24         8.18MB/s ± 3%  10.66MB/s ± 5%   +30.32%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2/hasNulls=true/numInputBatches=64-24          7.77MB/s ± 1%  10.14MB/s ± 3%   +30.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4/hasNulls=false/numInputBatches=64-24         17.0MB/s ± 3%   21.7MB/s ± 1%   +27.66%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4/hasNulls=true/numInputBatches=64-24          16.3MB/s ± 3%   21.3MB/s ± 2%   +30.48%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=8/hasNulls=false/numInputBatches=64-24         30.8MB/s ± 1%   42.0MB/s ± 1%   +36.29%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=8/hasNulls=true/numInputBatches=64-24          29.8MB/s ± 1%   41.1MB/s ± 1%   +38.14%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=16/hasNulls=false/numInputBatches=64-24        52.6MB/s ± 1%   82.2MB/s ± 1%   +56.40%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=16/hasNulls=true/numInputBatches=64-24         50.1MB/s ± 1%   78.0MB/s ± 4%   +55.72%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=32/hasNulls=false/numInputBatches=64-24        79.8MB/s ± 2%  143.7MB/s ± 2%   +79.98%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=32/hasNulls=true/numInputBatches=64-24         75.7MB/s ± 4%  135.5MB/s ± 1%   +78.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=64/hasNulls=false/numInputBatches=64-24         107MB/s ± 2%    226MB/s ± 0%  +111.44%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=64/hasNulls=true/numInputBatches=64-24         99.2MB/s ± 3%  207.8MB/s ± 1%  +109.44%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=128/hasNulls=false/numInputBatches=64-24        134MB/s ± 5%    320MB/s ± 0%  +139.87%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=128/hasNulls=true/numInputBatches=64-24         123MB/s ± 6%    285MB/s ± 2%  +131.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=256/hasNulls=false/numInputBatches=64-24        182MB/s ± 2%    403MB/s ± 0%  +121.56%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=256/hasNulls=true/numInputBatches=64-24         159MB/s ± 1%    350MB/s ± 0%  +120.66%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=512/hasNulls=false/numInputBatches=64-24        208MB/s ± 1%    457MB/s ± 0%  +119.58%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=512/hasNulls=true/numInputBatches=64-24         182MB/s ± 0%    390MB/s ± 0%  +114.08%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=1024/hasNulls=false/numInputBatches=64-24       234MB/s ± 0%    486MB/s ± 0%  +108.29%  (p=0.000 n=9+10)
Aggregator/SUM/hash/int/groupSize=1024/hasNulls=true/numInputBatches=64-24        202MB/s ± 1%    413MB/s ± 0%  +104.29%  (p=0.000 n=10+8)
Aggregator/SUM/hash/int/groupSize=2048/hasNulls=false/numInputBatches=64-24       274MB/s ± 1%    501MB/s ± 1%   +82.55%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=2048/hasNulls=true/numInputBatches=64-24        230MB/s ± 1%    421MB/s ± 0%   +82.87%  (p=0.000 n=9+9)
Aggregator/SUM/hash/int/groupSize=4096/hasNulls=false/numInputBatches=64-24       278MB/s ± 1%    505MB/s ± 0%   +81.65%  (p=0.000 n=10+10)
Aggregator/SUM/hash/int/groupSize=4096/hasNulls=true/numInputBatches=64-24        233MB/s ± 0%    426MB/s ± 0%   +82.68%  (p=0.000 n=8+10)
Aggregator/SUM/hash/decimal/groupSize=1/hasNulls=false/numInputBatches=64-24     3.73MB/s ± 3%   4.21MB/s ± 2%   +12.96%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1/hasNulls=true/numInputBatches=64-24      3.60MB/s ± 1%   4.12MB/s ± 1%   +14.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=2/hasNulls=false/numInputBatches=64-24     6.18MB/s ± 4%   7.22MB/s ± 1%   +16.95%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=2/hasNulls=true/numInputBatches=64-24      6.09MB/s ± 3%   7.40MB/s ± 3%   +21.48%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4/hasNulls=false/numInputBatches=64-24     9.83MB/s ± 1%  11.89MB/s ± 2%   +20.97%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4/hasNulls=true/numInputBatches=64-24      10.1MB/s ± 1%   12.2MB/s ± 1%   +21.45%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=8/hasNulls=false/numInputBatches=64-24     14.4MB/s ± 1%   16.8MB/s ± 2%   +16.63%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=8/hasNulls=true/numInputBatches=64-24      15.0MB/s ± 1%   17.4MB/s ± 1%   +16.13%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=16/hasNulls=false/numInputBatches=64-24    18.7MB/s ± 1%   21.7MB/s ± 2%   +15.99%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=16/hasNulls=true/numInputBatches=64-24     19.7MB/s ± 1%   23.3MB/s ± 0%   +18.42%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=32/hasNulls=false/numInputBatches=64-24    22.0MB/s ± 2%   26.1MB/s ± 0%   +18.61%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=32/hasNulls=true/numInputBatches=64-24     23.0MB/s ± 1%   28.2MB/s ± 1%   +22.51%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=64/hasNulls=false/numInputBatches=64-24    24.0MB/s ± 1%   28.7MB/s ± 1%   +19.54%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=64/hasNulls=true/numInputBatches=64-24     25.3MB/s ± 1%   30.6MB/s ± 1%   +21.05%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=128/hasNulls=false/numInputBatches=64-24   25.2MB/s ± 1%   30.1MB/s ± 1%   +19.48%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=128/hasNulls=true/numInputBatches=64-24    26.7MB/s ± 1%   32.6MB/s ± 1%   +21.77%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=256/hasNulls=false/numInputBatches=64-24   26.9MB/s ± 1%   30.8MB/s ± 1%   +14.68%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=256/hasNulls=true/numInputBatches=64-24    28.7MB/s ± 1%   33.6MB/s ± 1%   +17.12%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=512/hasNulls=false/numInputBatches=64-24   27.7MB/s ± 1%   31.5MB/s ± 1%   +13.84%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=512/hasNulls=true/numInputBatches=64-24    29.3MB/s ± 1%   34.0MB/s ± 1%   +16.01%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1024/hasNulls=false/numInputBatches=64-24  28.1MB/s ± 3%   31.7MB/s ± 1%   +12.63%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=1024/hasNulls=true/numInputBatches=64-24   29.9MB/s ± 1%   34.4MB/s ± 1%   +15.11%  (p=0.000 n=10+9)
Aggregator/SUM/hash/decimal/groupSize=2048/hasNulls=false/numInputBatches=64-24  28.8MB/s ± 0%   31.8MB/s ± 1%   +10.64%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=2048/hasNulls=true/numInputBatches=64-24   30.6MB/s ± 0%   34.8MB/s ± 1%   +13.69%  (p=0.000 n=9+10)
Aggregator/SUM/hash/decimal/groupSize=4096/hasNulls=false/numInputBatches=64-24  29.1MB/s ± 1%   32.1MB/s ± 2%   +10.30%  (p=0.000 n=10+10)
Aggregator/SUM/hash/decimal/groupSize=4096/hasNulls=true/numInputBatches=64-24   30.9MB/s ± 1%   34.6MB/s ± 1%   +12.01%  (p=0.000 n=10+10)

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 3 of 3 files at r2, 7 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @yuzefovich)


pkg/sql/colexec/hash_aggregator.go, line 195 at r1 (raw file):

	// We picked value this as the result of our benchmark.
	tupleLimit := coldata.BatchSize() * 2

This is interesting. Based on previous benchmarks, it seemed like it was good to have a buffering stage. What changed?


pkg/sql/colexec/hash_aggregator.go, line 374 at r1 (raw file):

	for selIdx, hashCode := range hashBuffer {
		selsSlot := -1
		for slot, hash := range op.scratch.hashCodeForSelsSlot {

maybe add a comment as to why we're not using a map


pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

}

const hashAggFuncsAllocSize = 128

What's the perf+mem difference with using 1024?

This commit switches usage of `map` to iteration over `[]uint64` when
building selection vectors in the hash aggregator. This is a lot more
efficient when group sizes are relatively large with moderate hit when
group sizes are small. This hit is reduced in a follow-up commit.

Release note: None
@yuzefovich yuzefovich force-pushed the hash-agg branch 2 times, most recently from 49101c6 to d24f5d3 Compare April 27, 2020 18:16
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @Azhng)


pkg/sql/colexec/hash_aggregator.go, line 195 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

This is interesting. Based on previous benchmarks, it seemed like it was good to have a buffering stage. What changed?

Not sure if it is the only reason, but the first commit here removes the lookup in a map (which has amortized O(1) cost) in favor of linear search in a slice (which has O(distinct buffered tuples) cost), so if we buffer up several batches, that cost will increase, especially in case of small group sizes.


pkg/sql/colexec/hash_aggregator.go, line 374 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

maybe add a comment as to why we're not using a map

Added.


pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

What's the perf+mem difference with using 1024?

The comparison of 128 (old) vs 1024 (new) is here. Seems like there is no noticeable change in memory allocations, but there is a minor hit in performance with small group sizes (which is somewhat surprising to me).

I ran a few other comparisons: 128 vs 32 and 128 vs 64

Maybe 64 would be best? I feel like it is a nicer number than 128, but there is not much difference between the two in the benchmarks.

Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @Azhng, and @yuzefovich)


pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, yuzefovich wrote…

The comparison of 128 (old) vs 1024 (new) is here. Seems like there is no noticeable change in memory allocations, but there is a minor hit in performance with small group sizes (which is somewhat surprising to me).

I ran a few other comparisons: 128 vs 32 and 128 vs 64

Maybe 64 would be best? I feel like it is a nicer number than 128, but there is not much difference between the two in the benchmarks.

Let's do 64 if there's not a big boost to using 128. Maybe also add a comment about how you got to this number.

This commit removes the buffering stage of the hash aggregator as well
as removes the "append only" scratch batch that we're currently using.
The removal of buffering stage allows us to have smaller buffers without
sacrificing the performance. The removal of the scratch batch allows to
avoid copying over the data from the input batch and using that input
batch directly. We will be descructively modifying the selection vector
on that batch, but such behavior is acceptable because hash aggregator
owns the output batch, and the input batch will not be propagated
further.

This commit also bumps `hashAggFuncsAllocSize` from 16 to 64 which
gives us minor performance improvement in case of small group sizes.

Release note: None
In a recent PR (for logical types plumbing) I introduced some
unnecessary allocations for unhandled type case - by taking a pointer
from a value in `[]types.T` slice. This commit fixes that.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto and @Azhng)


pkg/sql/colexec/hash_aggregator.go, line 447 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Let's do 64 if there's not a big boost to using 128. Maybe also add a comment about how you got to this number.

Done.

@craig
Copy link
Contributor

craig bot commented Apr 28, 2020

Build succeeded

@craig craig bot merged commit 211abed into cockroachdb:master Apr 28, 2020
@yuzefovich yuzefovich deleted the hash-agg branch April 28, 2020 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants