Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/distsqlrun: add benchmarks for aggregator and mergeJoiner #20759

Merged

Conversation

petermattis
Copy link
Collaborator

Interesting that grouping is slightly faster than distinct. They are
doing the same work in these benchmarks. The STDDEV and VARIANCE
aggregations are slow due to decimal computations.

name                      time/op
Aggregation/IDENT-8         49.6µs ± 2%
Aggregation/AVG-8           70.4µs ± 1%
Aggregation/COUNT-8         49.6µs ± 2%
Aggregation/MAX-8           56.7µs ± 2%
Aggregation/MIN-8           55.6µs ± 2%
Aggregation/STDDEV-8        1.50ms ± 1%
Aggregation/SUM-8           64.8µs ± 2%
Aggregation/SUM_INT-8       65.4µs ± 4%
Aggregation/VARIANCE-8      1.48ms ± 3%
Aggregation/XOR_AGG-8       50.5µs ± 2%
Grouping-8                   260µs ± 1%
Distinct-8                   291µs ± 1%

name                      speed
Aggregation/IDENT-8        161MB/s ± 2%
Aggregation/AVG-8          114MB/s ± 1%
Aggregation/COUNT-8        161MB/s ± 2%
Aggregation/MAX-8          141MB/s ± 2%
Aggregation/MIN-8          144MB/s ± 2%
Aggregation/STDDEV-8      5.35MB/s ± 1%
Aggregation/SUM-8          124MB/s ± 2%
Aggregation/SUM_INT-8      122MB/s ± 4%
Aggregation/VARIANCE-8    5.42MB/s ± 3%
Aggregation/XOR_AGG-8      158MB/s ± 2%
Grouping-8                30.8MB/s ± 1%
Distinct-8                27.5MB/s ± 1%
name                      time/op
HashJoiner/rows=0-8         2.90µs ± 1%
HashJoiner/rows=4-8         7.50µs ± 1%
HashJoiner/rows=16-8        18.0µs ± 1%
HashJoiner/rows=256-8        217µs ± 1%
HashJoiner/rows=4096-8      3.39ms ± 1%
HashJoiner/rows=65536-8     64.4ms ± 4%
MergeJoiner/rows=0-8        3.15µs ± 0%
MergeJoiner/rows=4-8        6.50µs ± 1%
MergeJoiner/rows=16-8       14.9µs ± 0%
MergeJoiner/rows=256-8       170µs ± 1%
MergeJoiner/rows=4096-8     2.64ms ± 0%
MergeJoiner/rows=65536-8    44.4ms ± 1%

name                      speed
HashJoiner/rows=0-8
HashJoiner/rows=4-8       4.27MB/s ± 1%
HashJoiner/rows=16-8      7.09MB/s ± 1%
HashJoiner/rows=256-8     9.43MB/s ± 1%
HashJoiner/rows=4096-8    9.66MB/s ± 1%
HashJoiner/rows=65536-8   8.14MB/s ± 4%
MergeJoiner/rows=0-8
MergeJoiner/rows=4-8      4.93MB/s ± 0%
MergeJoiner/rows=16-8     8.61MB/s ± 0%
MergeJoiner/rows=256-8    12.0MB/s ± 1%
MergeJoiner/rows=4096-8   12.4MB/s ± 0%
MergeJoiner/rows=65536-8  11.8MB/s ± 1%

Release note: None

@petermattis petermattis requested review from asubiotto and a team December 15, 2017 21:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator Author

Added a 3rd commit which significantly speeds up mergeJoiner:

name                      old time/op    new time/op     delta
MergeJoiner/rows=0-8        3.15µs ± 0%     3.35µs ± 2%   +6.21%  (p=0.000 n=10+10)
MergeJoiner/rows=4-8        6.50µs ± 1%     5.67µs ± 1%  -12.75%  (p=0.000 n=9+10)
MergeJoiner/rows=16-8       14.9µs ± 0%     11.5µs ± 3%  -22.61%  (p=0.000 n=8+10)
MergeJoiner/rows=256-8       170µs ± 1%      117µs ± 5%  -31.33%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8     2.64ms ± 0%     1.78ms ± 2%  -32.63%  (p=0.000 n=8+10)
MergeJoiner/rows=65536-8    44.4ms ± 1%     29.6ms ± 2%  -33.43%  (p=0.000 n=9+10)

name                      old speed      new speed       delta
MergeJoiner/rows=4-8      4.93MB/s ± 0%   5.64MB/s ± 1%  +14.51%  (p=0.000 n=8+10)
MergeJoiner/rows=16-8     8.61MB/s ± 0%  11.13MB/s ± 3%  +29.26%  (p=0.000 n=8+10)
MergeJoiner/rows=256-8    12.0MB/s ± 1%   17.5MB/s ± 5%  +45.70%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8   12.4MB/s ± 0%   18.4MB/s ± 2%  +48.46%  (p=0.000 n=8+10)
MergeJoiner/rows=65536-8  11.8MB/s ± 1%   17.7MB/s ± 2%  +50.23%  (p=0.000 n=9+10)

@petermattis petermattis force-pushed the pmattis/distsql-benchmarks branch from 1f2ec5c to 8cbd609 Compare December 15, 2017 21:54
@petermattis
Copy link
Collaborator Author

petermattis commented Dec 15, 2017

Updated the 3rd commit to fix another place where EvalContext wasn't being plumbed for mergeJoiner:

name                      old time/op    new time/op     delta
MergeJoiner/rows=0-8        3.15µs ± 0%     3.29µs ± 1%    +4.54%  (p=0.000 n=10+9)
MergeJoiner/rows=4-8        6.50µs ± 1%     6.01µs ± 1%    -7.62%  (p=0.000 n=9+8)
MergeJoiner/rows=16-8       14.9µs ± 0%      9.2µs ± 1%   -38.01%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8       170µs ± 1%       71µs ± 2%   -58.19%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8     2.64ms ± 0%     1.05ms ± 0%   -60.42%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8    44.4ms ± 1%     17.2ms ± 1%   -61.22%  (p=0.000 n=9+10)

name                      old speed      new speed       delta
MergeJoiner/rows=4-8      4.93MB/s ± 0%   5.33MB/s ± 1%    +8.15%  (p=0.000 n=8+8)
MergeJoiner/rows=16-8     8.61MB/s ± 0%  13.89MB/s ± 1%   +61.32%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8    12.0MB/s ± 1%   28.8MB/s ± 2%  +139.20%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8   12.4MB/s ± 0%   31.3MB/s ± 0%  +152.68%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8  11.8MB/s ± 1%   30.4MB/s ± 1%  +157.84%  (p=0.000 n=9+10)

@RaduBerinde I'm assuming there isn't a correctness issue with using an EvalContext derived from the flow vs the empty EvalContexts that were previously being used.

@petermattis petermattis force-pushed the pmattis/distsql-benchmarks branch from 8cbd609 to a7d0956 Compare December 15, 2017 21:58
b.SetBytes(int64(8 * numRows * numCols))
b.ResetTimer()
for i := 0; i < b.N; i++ {
h, err := newMergeJoiner(flowCtx, spec, leftInput, rightInput, post, &RowDisposer{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the initialization of the processor here is pessimistically skewing the benchmark numbers.

Perhaps try pulling this out of the loop? I don't think any other state is mutated horribly in mergeJoiner after each successive Run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization of the processor is only a factor at small numRows. And it is somewhat interesting to capture that initialization overhead. The hash joiner benchmark has a TODO about addressing this overhead, but I'm not convinced we want to change the benchmark.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough 👍 the hash joiner's state with the memory containers also look non-trivial to address for the benchmark so it might be best to keep this uniform across the benchmarks.

@asubiotto
Copy link
Contributor

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 4 of 4 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/sql/distsqlrun/aggregator_test.go, line 411 at r1 (raw file):

	for _, aggFunc := range aggFuncs {
		b.Run(aggFunc.String(), func(b *testing.B) {
			ctx := context.Background()

Consider extracting all of this minus spec up to and including input.


pkg/sql/distsqlrun/aggregator_test.go, line 427 at r1 (raw file):

				},
			}
			post := &PostProcessSpec{}

nit: This could probably be put into the call to newAggregator. Same thing in BenchmarkGrouping. If we don't want to have unnecessary allocations, we should probably extract RowDisposer instead.


pkg/sql/distsqlrun/aggregator_test.go, line 440 at r1 (raw file):

				input.Reset()
			}
			b.StopTimer()

Is this call to StopTimer necessary?


pkg/sql/distsqlrun/hashjoiner_test.go, line 878 at r2 (raw file):

	const numCols = 1
	for _, numRows := range []int{0, 1 << 2, 1 << 4, 1 << 8, 1 << 12, 1 << 16} {
		b.Run(fmt.Sprintf("rows=%d", numRows), func(b *testing.B) {

Although we don't have a standard for the benchmark key capitalization (we didn't decide in #16830), all the benchmarks in distsqlrun have their first letter capitalized (actually BenchmarkRowChannelPipeline doesn't, but I think you added that in), so this should probably be Rows. I don't mind whether benchmarks have capitalized keys or not, the only reason I use camel case is to be consistent with the benchmark names. Regardless, I think we should make a decision and change all non-conforming benchmarks accordingly.

Also, since you're switching inputSize to numRows, we should probably do that across the board (which means only in BenchmarkSorter).


pkg/sql/distsqlrun/mergejoiner_test.go, line 590 at r2 (raw file):

			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				h, err := newMergeJoiner(flowCtx, spec, leftInput, rightInput, post, &RowDisposer{})

nit: s/h/m


Comments from Reviewable

Interesting that grouping is slightly faster than distinct. They are
doing the same work in these benchmarks. The STDDEV and VARIANCE
aggregations are slow due to decimal computations.

name                      time/op
Aggregation/IDENT-8         49.6µs ± 2%
Aggregation/AVG-8           70.4µs ± 1%
Aggregation/COUNT-8         49.6µs ± 2%
Aggregation/MAX-8           56.7µs ± 2%
Aggregation/MIN-8           55.6µs ± 2%
Aggregation/STDDEV-8        1.50ms ± 1%
Aggregation/SUM-8           64.8µs ± 2%
Aggregation/SUM_INT-8       65.4µs ± 4%
Aggregation/VARIANCE-8      1.48ms ± 3%
Aggregation/XOR_AGG-8       50.5µs ± 2%
Grouping-8                   260µs ± 1%
Distinct-8                   291µs ± 1%

name                      speed
Aggregation/IDENT-8        161MB/s ± 2%
Aggregation/AVG-8          114MB/s ± 1%
Aggregation/COUNT-8        161MB/s ± 2%
Aggregation/MAX-8          141MB/s ± 2%
Aggregation/MIN-8          144MB/s ± 2%
Aggregation/STDDEV-8      5.35MB/s ± 1%
Aggregation/SUM-8          124MB/s ± 2%
Aggregation/SUM_INT-8      122MB/s ± 4%
Aggregation/VARIANCE-8    5.42MB/s ± 3%
Aggregation/XOR_AGG-8      158MB/s ± 2%
Grouping-8                30.8MB/s ± 1%
Distinct-8                27.5MB/s ± 1%

Release note: None
Adjust BenchmarkHashJoiner to remove the projection and only use a
single column so that the benchmark focuses on "speed of light" of the
processor itself.

name                      time/op
HashJoiner/rows=0-8         2.90µs ± 1%
HashJoiner/rows=4-8         7.50µs ± 1%
HashJoiner/rows=16-8        18.0µs ± 1%
HashJoiner/rows=256-8        217µs ± 1%
HashJoiner/rows=4096-8      3.39ms ± 1%
HashJoiner/rows=65536-8     64.4ms ± 4%
MergeJoiner/rows=0-8        3.15µs ± 0%
MergeJoiner/rows=4-8        6.50µs ± 1%
MergeJoiner/rows=16-8       14.9µs ± 0%
MergeJoiner/rows=256-8       170µs ± 1%
MergeJoiner/rows=4096-8     2.64ms ± 0%
MergeJoiner/rows=65536-8    44.4ms ± 1%

name                      speed
HashJoiner/rows=0-8
HashJoiner/rows=4-8       4.27MB/s ± 1%
HashJoiner/rows=16-8      7.09MB/s ± 1%
HashJoiner/rows=256-8     9.43MB/s ± 1%
HashJoiner/rows=4096-8    9.66MB/s ± 1%
HashJoiner/rows=65536-8   8.14MB/s ± 4%
MergeJoiner/rows=0-8
MergeJoiner/rows=4-8      4.93MB/s ± 0%
MergeJoiner/rows=16-8     8.61MB/s ± 0%
MergeJoiner/rows=256-8    12.0MB/s ± 1%
MergeJoiner/rows=4096-8   12.4MB/s ± 0%
MergeJoiner/rows=65536-8  11.8MB/s ± 1%

Release note: None
Plumb EvalContext into streamMerger and streamGroupAccumulator in order
to avoid allocating it on every call to advanceGroup.

name                      old time/op    new time/op     delta
MergeJoiner/rows=0-8        3.15µs ± 0%     3.29µs ± 1%    +4.54%  (p=0.000 n=10+9)
MergeJoiner/rows=4-8        6.50µs ± 1%     6.01µs ± 1%    -7.62%  (p=0.000 n=9+8)
MergeJoiner/rows=16-8       14.9µs ± 0%      9.2µs ± 1%   -38.01%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8       170µs ± 1%       71µs ± 2%   -58.19%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8     2.64ms ± 0%     1.05ms ± 0%   -60.42%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8    44.4ms ± 1%     17.2ms ± 1%   -61.22%  (p=0.000 n=9+10)

name                      old speed      new speed       delta
MergeJoiner/rows=4-8      4.93MB/s ± 0%   5.33MB/s ± 1%    +8.15%  (p=0.000 n=8+8)
MergeJoiner/rows=16-8     8.61MB/s ± 0%  13.89MB/s ± 1%   +61.32%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8    12.0MB/s ± 1%   28.8MB/s ± 2%  +139.20%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8   12.4MB/s ± 0%   31.3MB/s ± 0%  +152.68%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8  11.8MB/s ± 1%   30.4MB/s ± 1%  +157.84%  (p=0.000 n=9+10)

Release note (performance change): Speed up except and merge joins by
avoiding an unnecessary allocation.
@petermattis petermattis force-pushed the pmattis/distsql-benchmarks branch from a7d0956 to bda143f Compare December 18, 2017 14:41
@petermattis
Copy link
Collaborator Author

Review status: 1 of 8 files reviewed at latest revision, 6 unresolved discussions.


pkg/sql/distsqlrun/aggregator_test.go, line 411 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Consider extracting all of this minus spec up to and including input.

Extracting into what? Out of the loop? That doesn't make much if any difference.


pkg/sql/distsqlrun/aggregator_test.go, line 427 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

nit: This could probably be put into the call to newAggregator. Same thing in BenchmarkGrouping. If we don't want to have unnecessary allocations, we should probably extract RowDisposer instead.

I copy&pasted this from somewhere else. Avoiding the allocation was not a consideration. I'll move the creation of RowDisposer up here.


pkg/sql/distsqlrun/aggregator_test.go, line 440 at r1 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Is this call to StopTimer necessary?

I usually do this out of habit. We sometimes put stuff in defer which takes a long time that we don't want to include in benchmarks. So not necessary here, but a good habit.


pkg/sql/distsqlrun/hashjoiner_test.go, line 878 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Although we don't have a standard for the benchmark key capitalization (we didn't decide in #16830), all the benchmarks in distsqlrun have their first letter capitalized (actually BenchmarkRowChannelPipeline doesn't, but I think you added that in), so this should probably be Rows. I don't mind whether benchmarks have capitalized keys or not, the only reason I use camel case is to be consistent with the benchmark names. Regardless, I think we should make a decision and change all non-conforming benchmarks accordingly.

Also, since you're switching inputSize to numRows, we should probably do that across the board (which means only in BenchmarkSorter).

I've reverted this to InputSize.


pkg/sql/distsqlrun/mergejoiner_test.go, line 590 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

nit: s/h/m

Done.


Comments from Reviewable

@RaduBerinde
Copy link
Member

@RaduBerinde I'm assuming there isn't a correctness issue with using an EvalContext derived from the flow vs the empty EvalContexts that were previously being used.

If anything, the old code had issues. There may be some cornercases which we were screwing up before (maybe DDate to DTimestamp comparisons)

@petermattis
Copy link
Collaborator Author

We all desire approval, from our partners, from our parents, from our friends. This PR desires it too.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, LGTM

@petermattis petermattis merged commit 533e942 into cockroachdb:master Dec 20, 2017
@petermattis petermattis deleted the pmattis/distsql-benchmarks branch December 20, 2017 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants