sql/distsqlrun: add benchmarks for aggregator and mergeJoiner #20759

petermattis · 2017-12-15T21:03:23Z

Interesting that grouping is slightly faster than distinct. They are
doing the same work in these benchmarks. The STDDEV and VARIANCE
aggregations are slow due to decimal computations.

name                      time/op
Aggregation/IDENT-8         49.6µs ± 2%
Aggregation/AVG-8           70.4µs ± 1%
Aggregation/COUNT-8         49.6µs ± 2%
Aggregation/MAX-8           56.7µs ± 2%
Aggregation/MIN-8           55.6µs ± 2%
Aggregation/STDDEV-8        1.50ms ± 1%
Aggregation/SUM-8           64.8µs ± 2%
Aggregation/SUM_INT-8       65.4µs ± 4%
Aggregation/VARIANCE-8      1.48ms ± 3%
Aggregation/XOR_AGG-8       50.5µs ± 2%
Grouping-8                   260µs ± 1%
Distinct-8                   291µs ± 1%

name                      speed
Aggregation/IDENT-8        161MB/s ± 2%
Aggregation/AVG-8          114MB/s ± 1%
Aggregation/COUNT-8        161MB/s ± 2%
Aggregation/MAX-8          141MB/s ± 2%
Aggregation/MIN-8          144MB/s ± 2%
Aggregation/STDDEV-8      5.35MB/s ± 1%
Aggregation/SUM-8          124MB/s ± 2%
Aggregation/SUM_INT-8      122MB/s ± 4%
Aggregation/VARIANCE-8    5.42MB/s ± 3%
Aggregation/XOR_AGG-8      158MB/s ± 2%
Grouping-8                30.8MB/s ± 1%
Distinct-8                27.5MB/s ± 1%

name                      time/op
HashJoiner/rows=0-8         2.90µs ± 1%
HashJoiner/rows=4-8         7.50µs ± 1%
HashJoiner/rows=16-8        18.0µs ± 1%
HashJoiner/rows=256-8        217µs ± 1%
HashJoiner/rows=4096-8      3.39ms ± 1%
HashJoiner/rows=65536-8     64.4ms ± 4%
MergeJoiner/rows=0-8        3.15µs ± 0%
MergeJoiner/rows=4-8        6.50µs ± 1%
MergeJoiner/rows=16-8       14.9µs ± 0%
MergeJoiner/rows=256-8       170µs ± 1%
MergeJoiner/rows=4096-8     2.64ms ± 0%
MergeJoiner/rows=65536-8    44.4ms ± 1%

name                      speed
HashJoiner/rows=0-8
HashJoiner/rows=4-8       4.27MB/s ± 1%
HashJoiner/rows=16-8      7.09MB/s ± 1%
HashJoiner/rows=256-8     9.43MB/s ± 1%
HashJoiner/rows=4096-8    9.66MB/s ± 1%
HashJoiner/rows=65536-8   8.14MB/s ± 4%
MergeJoiner/rows=0-8
MergeJoiner/rows=4-8      4.93MB/s ± 0%
MergeJoiner/rows=16-8     8.61MB/s ± 0%
MergeJoiner/rows=256-8    12.0MB/s ± 1%
MergeJoiner/rows=4096-8   12.4MB/s ± 0%
MergeJoiner/rows=65536-8  11.8MB/s ± 1%

Release note: None

cockroach-teamcity · 2017-12-15T21:03:31Z

This change is

petermattis · 2017-12-15T21:42:28Z

Added a 3rd commit which significantly speeds up mergeJoiner:

name                      old time/op    new time/op     delta
MergeJoiner/rows=0-8        3.15µs ± 0%     3.35µs ± 2%   +6.21%  (p=0.000 n=10+10)
MergeJoiner/rows=4-8        6.50µs ± 1%     5.67µs ± 1%  -12.75%  (p=0.000 n=9+10)
MergeJoiner/rows=16-8       14.9µs ± 0%     11.5µs ± 3%  -22.61%  (p=0.000 n=8+10)
MergeJoiner/rows=256-8       170µs ± 1%      117µs ± 5%  -31.33%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8     2.64ms ± 0%     1.78ms ± 2%  -32.63%  (p=0.000 n=8+10)
MergeJoiner/rows=65536-8    44.4ms ± 1%     29.6ms ± 2%  -33.43%  (p=0.000 n=9+10)

name                      old speed      new speed       delta
MergeJoiner/rows=4-8      4.93MB/s ± 0%   5.64MB/s ± 1%  +14.51%  (p=0.000 n=8+10)
MergeJoiner/rows=16-8     8.61MB/s ± 0%  11.13MB/s ± 3%  +29.26%  (p=0.000 n=8+10)
MergeJoiner/rows=256-8    12.0MB/s ± 1%   17.5MB/s ± 5%  +45.70%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8   12.4MB/s ± 0%   18.4MB/s ± 2%  +48.46%  (p=0.000 n=8+10)
MergeJoiner/rows=65536-8  11.8MB/s ± 1%   17.7MB/s ± 2%  +50.23%  (p=0.000 n=9+10)

petermattis · 2017-12-15T21:56:47Z

Updated the 3rd commit to fix another place where EvalContext wasn't being plumbed for mergeJoiner:

name                      old time/op    new time/op     delta
MergeJoiner/rows=0-8        3.15µs ± 0%     3.29µs ± 1%    +4.54%  (p=0.000 n=10+9)
MergeJoiner/rows=4-8        6.50µs ± 1%     6.01µs ± 1%    -7.62%  (p=0.000 n=9+8)
MergeJoiner/rows=16-8       14.9µs ± 0%      9.2µs ± 1%   -38.01%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8       170µs ± 1%       71µs ± 2%   -58.19%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8     2.64ms ± 0%     1.05ms ± 0%   -60.42%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8    44.4ms ± 1%     17.2ms ± 1%   -61.22%  (p=0.000 n=9+10)

name                      old speed      new speed       delta
MergeJoiner/rows=4-8      4.93MB/s ± 0%   5.33MB/s ± 1%    +8.15%  (p=0.000 n=8+8)
MergeJoiner/rows=16-8     8.61MB/s ± 0%  13.89MB/s ± 1%   +61.32%  (p=0.000 n=8+9)
MergeJoiner/rows=256-8    12.0MB/s ± 1%   28.8MB/s ± 2%  +139.20%  (p=0.000 n=10+10)
MergeJoiner/rows=4096-8   12.4MB/s ± 0%   31.3MB/s ± 0%  +152.68%  (p=0.000 n=8+9)
MergeJoiner/rows=65536-8  11.8MB/s ± 1%   30.4MB/s ± 1%  +157.84%  (p=0.000 n=9+10)

@RaduBerinde I'm assuming there isn't a correctness issue with using an EvalContext derived from the flow vs the empty EvalContexts that were previously being used.

richardwu · 2017-12-15T22:10:43Z

pkg/sql/distsqlrun/mergejoiner_test.go

+			b.SetBytes(int64(8 * numRows * numCols))
+			b.ResetTimer()
+			for i := 0; i < b.N; i++ {
+				h, err := newMergeJoiner(flowCtx, spec, leftInput, rightInput, post, &RowDisposer{})


So I think the initialization of the processor here is pessimistically skewing the benchmark numbers.

Perhaps try pulling this out of the loop? I don't think any other state is mutated horribly in mergeJoiner after each successive Run.

The initialization of the processor is only a factor at small numRows. And it is somewhat interesting to capture that initialization overhead. The hash joiner benchmark has a TODO about addressing this overhead, but I'm not convinced we want to change the benchmark.

Fair enough 👍 the hash joiner's state with the memory containers also look non-trivial to address for the benchmark so it might be best to keep this uniform across the benchmarks.

asubiotto · 2017-12-18T13:34:45Z

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 4 of 4 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/sql/distsqlrun/aggregator_test.go, line 411 at r1 (raw file):

	for _, aggFunc := range aggFuncs {
		b.Run(aggFunc.String(), func(b *testing.B) {
			ctx := context.Background()

Consider extracting all of this minus spec up to and including input.

pkg/sql/distsqlrun/aggregator_test.go, line 427 at r1 (raw file):

				},
			}
			post := &PostProcessSpec{}

nit: This could probably be put into the call to newAggregator. Same thing in BenchmarkGrouping. If we don't want to have unnecessary allocations, we should probably extract RowDisposer instead.

pkg/sql/distsqlrun/aggregator_test.go, line 440 at r1 (raw file):

				input.Reset()
			}
			b.StopTimer()

Is this call to StopTimer necessary?

pkg/sql/distsqlrun/hashjoiner_test.go, line 878 at r2 (raw file):

	const numCols = 1
	for _, numRows := range []int{0, 1 << 2, 1 << 4, 1 << 8, 1 << 12, 1 << 16} {
		b.Run(fmt.Sprintf("rows=%d", numRows), func(b *testing.B) {

Although we don't have a standard for the benchmark key capitalization (we didn't decide in #16830), all the benchmarks in distsqlrun have their first letter capitalized (actually BenchmarkRowChannelPipeline doesn't, but I think you added that in), so this should probably be Rows. I don't mind whether benchmarks have capitalized keys or not, the only reason I use camel case is to be consistent with the benchmark names. Regardless, I think we should make a decision and change all non-conforming benchmarks accordingly.

Also, since you're switching inputSize to numRows, we should probably do that across the board (which means only in BenchmarkSorter).

pkg/sql/distsqlrun/mergejoiner_test.go, line 590 at r2 (raw file):

			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				h, err := newMergeJoiner(flowCtx, spec, leftInput, rightInput, post, &RowDisposer{})

nit: s/h/m

Comments from Reviewable

Interesting that grouping is slightly faster than distinct. They are doing the same work in these benchmarks. The STDDEV and VARIANCE aggregations are slow due to decimal computations. name time/op Aggregation/IDENT-8 49.6µs ± 2% Aggregation/AVG-8 70.4µs ± 1% Aggregation/COUNT-8 49.6µs ± 2% Aggregation/MAX-8 56.7µs ± 2% Aggregation/MIN-8 55.6µs ± 2% Aggregation/STDDEV-8 1.50ms ± 1% Aggregation/SUM-8 64.8µs ± 2% Aggregation/SUM_INT-8 65.4µs ± 4% Aggregation/VARIANCE-8 1.48ms ± 3% Aggregation/XOR_AGG-8 50.5µs ± 2% Grouping-8 260µs ± 1% Distinct-8 291µs ± 1% name speed Aggregation/IDENT-8 161MB/s ± 2% Aggregation/AVG-8 114MB/s ± 1% Aggregation/COUNT-8 161MB/s ± 2% Aggregation/MAX-8 141MB/s ± 2% Aggregation/MIN-8 144MB/s ± 2% Aggregation/STDDEV-8 5.35MB/s ± 1% Aggregation/SUM-8 124MB/s ± 2% Aggregation/SUM_INT-8 122MB/s ± 4% Aggregation/VARIANCE-8 5.42MB/s ± 3% Aggregation/XOR_AGG-8 158MB/s ± 2% Grouping-8 30.8MB/s ± 1% Distinct-8 27.5MB/s ± 1% Release note: None

Adjust BenchmarkHashJoiner to remove the projection and only use a single column so that the benchmark focuses on "speed of light" of the processor itself. name time/op HashJoiner/rows=0-8 2.90µs ± 1% HashJoiner/rows=4-8 7.50µs ± 1% HashJoiner/rows=16-8 18.0µs ± 1% HashJoiner/rows=256-8 217µs ± 1% HashJoiner/rows=4096-8 3.39ms ± 1% HashJoiner/rows=65536-8 64.4ms ± 4% MergeJoiner/rows=0-8 3.15µs ± 0% MergeJoiner/rows=4-8 6.50µs ± 1% MergeJoiner/rows=16-8 14.9µs ± 0% MergeJoiner/rows=256-8 170µs ± 1% MergeJoiner/rows=4096-8 2.64ms ± 0% MergeJoiner/rows=65536-8 44.4ms ± 1% name speed HashJoiner/rows=0-8 HashJoiner/rows=4-8 4.27MB/s ± 1% HashJoiner/rows=16-8 7.09MB/s ± 1% HashJoiner/rows=256-8 9.43MB/s ± 1% HashJoiner/rows=4096-8 9.66MB/s ± 1% HashJoiner/rows=65536-8 8.14MB/s ± 4% MergeJoiner/rows=0-8 MergeJoiner/rows=4-8 4.93MB/s ± 0% MergeJoiner/rows=16-8 8.61MB/s ± 0% MergeJoiner/rows=256-8 12.0MB/s ± 1% MergeJoiner/rows=4096-8 12.4MB/s ± 0% MergeJoiner/rows=65536-8 11.8MB/s ± 1% Release note: None

Plumb EvalContext into streamMerger and streamGroupAccumulator in order to avoid allocating it on every call to advanceGroup. name old time/op new time/op delta MergeJoiner/rows=0-8 3.15µs ± 0% 3.29µs ± 1% +4.54% (p=0.000 n=10+9) MergeJoiner/rows=4-8 6.50µs ± 1% 6.01µs ± 1% -7.62% (p=0.000 n=9+8) MergeJoiner/rows=16-8 14.9µs ± 0% 9.2µs ± 1% -38.01% (p=0.000 n=8+9) MergeJoiner/rows=256-8 170µs ± 1% 71µs ± 2% -58.19% (p=0.000 n=10+10) MergeJoiner/rows=4096-8 2.64ms ± 0% 1.05ms ± 0% -60.42% (p=0.000 n=8+9) MergeJoiner/rows=65536-8 44.4ms ± 1% 17.2ms ± 1% -61.22% (p=0.000 n=9+10) name old speed new speed delta MergeJoiner/rows=4-8 4.93MB/s ± 0% 5.33MB/s ± 1% +8.15% (p=0.000 n=8+8) MergeJoiner/rows=16-8 8.61MB/s ± 0% 13.89MB/s ± 1% +61.32% (p=0.000 n=8+9) MergeJoiner/rows=256-8 12.0MB/s ± 1% 28.8MB/s ± 2% +139.20% (p=0.000 n=10+10) MergeJoiner/rows=4096-8 12.4MB/s ± 0% 31.3MB/s ± 0% +152.68% (p=0.000 n=8+9) MergeJoiner/rows=65536-8 11.8MB/s ± 1% 30.4MB/s ± 1% +157.84% (p=0.000 n=9+10) Release note (performance change): Speed up except and merge joins by avoiding an unnecessary allocation.

petermattis · 2017-12-18T14:41:39Z

Review status: 1 of 8 files reviewed at latest revision, 6 unresolved discussions.

pkg/sql/distsqlrun/aggregator_test.go, line 411 at r1 (raw file):