Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] apd: embed small coefficient values in Decimal struct #102

Closed

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Jan 3, 2022

Replaces cockroachdb/cockroach#74369.
Replaces #101.

This commit introduces a performance optimization that embeds small coefficient values directly in their Decimal struct, instead of storing these values in a separate heap allocation. It does so by replacing math/big.Int with a new wrapper type called BigInt that provides an "inline" compact representation optimization.

Each BigInt maintains (through big.Int) an internal reference to a variable-length integer value, which is represented by a []big.Word. The _inline field and lazyInit method combine to allow BigInt to inline this variable-length integer array within the BigInt struct when its value is sufficiently small. In lazyInit, we point the _inner field's slice at the _inline array. big.Int will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity.

We set the capacity of the inline array to accommodate any value that would fit in a 128-bit integer (i.e. values up to 2^128 - 1).

This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the ericlagergren/decimal library, where each struct contains a compact uint64 and an unscaled big.Int. Prior concern from the authors of cockroachdb/apd regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code.

The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to big.Int. In fact, the entire optimization is best-effort, and bugs that lead to missed calls to lazyInit are merely missed opportunities to avoid a heap allocation, and nothing more serious.

However, one major complication with this approach is that Go's escape analysis struggles with understanding self-referential pointers. A naive implementation of this optimization would force all BigInt structs to escape to the heap. To work around this, we employ a similar trick to sync.Cond and strings.Builder. We trick escape analysis to allow the self-referential pointer without causing the struct to escape.

This works but it introduces complexity if BigInt structs are copied by value. So to avoid nasty bugs, we disallow copying of BigInt structs. The self-referencing pointer from _inner to _inline makes this unsafe, as it could allow aliasing between two BigInt structs which would be hidden from escape analysis. If the first BigInt then fell out of scope and was GCed, this could corrupt the state of the second BigInt. sync.Cond and strings.Builder also prevent copying to avoid this kind of issue. In fact, big.Int itself says that "shallow copies are not supported and may lead to errors", but it doesn't enforce this.

Impact on benchmarks:

name                 old time/op    new time/op    delta
GDA/comparetotal-10    46.3µs ± 0%    24.4µs ± 1%   -47.33%  (p=0.000 n=10+9)
GDA/remainder-10       68.4µs ± 0%    40.2µs ± 0%   -41.31%  (p=0.000 n=10+9)
GDA/abs-10             11.5µs ± 1%     7.0µs ± 0%   -39.46%  (p=0.000 n=10+10)
GDA/compare-10         55.7µs ± 0%    33.8µs ± 1%   -39.25%  (p=0.000 n=10+10)
GDA/tointegralx-10     36.0µs ± 1%    22.1µs ± 0%   -38.55%  (p=0.000 n=10+9)
GDA/minus-10           14.1µs ± 0%     8.8µs ± 0%   -38.10%  (p=0.000 n=10+10)
GDA/tointegral-10      35.1µs ± 1%    21.8µs ± 0%   -37.83%  (p=0.000 n=10+10)
GDA/quantize-10         134µs ± 1%      84µs ± 0%   -37.57%  (p=0.000 n=9+10)
GDA/subtract-10         171µs ± 0%     109µs ± 0%   -36.37%  (p=0.000 n=10+10)
GDA/reduce-10          21.7µs ± 1%    14.0µs ± 0%   -35.18%  (p=0.000 n=10+8)
GDA/divideint-10       34.2µs ± 0%    22.8µs ± 0%   -33.40%  (p=0.000 n=9+9)
GDA/multiply-10        80.5µs ± 0%    54.9µs ± 0%   -31.83%  (p=0.000 n=9+10)
GDA/randoms-10         3.20ms ± 0%    2.21ms ± 0%   -30.73%  (p=0.000 n=9+9)
GDA/add-10              917µs ± 0%     641µs ± 0%   -30.03%  (p=0.000 n=10+10)
GDA/rounding-10         623µs ± 0%     472µs ± 0%   -24.16%  (p=0.000 n=10+8)
GDA/plus-10            45.0µs ± 0%    37.5µs ± 0%   -16.63%  (p=0.000 n=10+9)
GDA/base-10             131µs ± 0%     114µs ± 0%   -13.40%  (p=0.000 n=10+10)
GDA/squareroot-10      31.6ms ± 0%    27.4ms ± 0%   -13.16%  (p=0.000 n=9+8)
GDA/powersqrt-10        431ms ± 0%     417ms ± 0%    -3.16%  (p=0.000 n=9+9)
GDA/divide-10           366µs ± 0%     360µs ± 0%    -1.72%  (p=0.000 n=9+8)
GDA/cuberoot-apd-10    1.97ms ± 0%    2.07ms ± 0%    +5.21%  (p=0.000 n=9+10)
GDA/exp-10              119ms ± 0%     126ms ± 0%    +6.19%  (p=0.000 n=10+8)
GDA/power-10            208ms ± 0%     225ms ± 0%    +8.55%  (p=0.000 n=10+9)
GDA/log10-10            101ms ± 0%     110ms ± 0%    +9.49%  (p=0.000 n=10+9)
GDA/ln-10              79.6ms ± 0%    87.3ms ± 0%    +9.61%  (p=0.000 n=9+10)

name                 old alloc/op   new alloc/op   delta
GDA/abs-10             6.50kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/compare-10         39.1kB ± 0%     0.0kB       -100.00%  (p=0.000 n=10+10)
GDA/comparetotal-10    37.2kB ± 0%     0.0kB       -100.00%  (p=0.000 n=10+10)
GDA/minus-10           7.71kB ± 0%    0.00kB       -100.00%  (p=0.000 n=10+10)
GDA/reduce-10          10.1kB ± 0%     0.0kB       -100.00%  (p=0.000 n=10+10)
GDA/remainder-10       45.5kB ± 0%     0.1kB ± 0%   -99.86%  (p=0.000 n=10+10)
GDA/rounding-10         292kB ± 0%       5kB ± 0%   -98.33%  (p=0.000 n=10+10)
GDA/squareroot-10      8.46MB ± 0%    0.29MB ± 0%   -96.55%  (p=0.000 n=10+10)
GDA/randoms-10         1.25MB ± 0%    0.05MB ± 0%   -95.98%  (p=0.000 n=10+10)
GDA/divideint-10       23.2kB ± 0%     1.2kB ± 0%   -94.85%  (p=0.000 n=10+10)
GDA/divide-10           102kB ± 0%       6kB ± 0%   -93.64%  (p=0.000 n=10+10)
GDA/powersqrt-10       77.8MB ± 0%     5.1MB ± 0%   -93.44%  (p=0.000 n=10+10)
GDA/quantize-10        76.4kB ± 0%     8.9kB ± 0%   -88.33%  (p=0.000 n=8+8)
GDA/multiply-10        55.4kB ± 0%    10.7kB ± 0%   -80.71%  (p=0.000 n=10+10)
GDA/tointegralx-10     27.9kB ± 0%     6.2kB ± 0%   -77.89%  (p=0.000 n=10+10)
GDA/tointegral-10      27.2kB ± 0%     6.2kB ± 0%   -77.34%  (p=0.000 n=9+10)
GDA/subtract-10         131kB ± 0%      34kB ± 0%   -73.82%  (p=0.000 n=10+7)
GDA/cuberoot-apd-10     265kB ± 0%      82kB ± 0%   -68.97%  (p=0.000 n=10+10)
GDA/base-10            60.3kB ± 0%    26.2kB ± 0%   -56.61%  (p=0.000 n=10+10)
GDA/ln-10              10.2MB ± 0%     5.0MB ± 0%   -50.58%  (p=0.000 n=10+10)
GDA/log10-10           12.5MB ± 0%     6.4MB ± 0%   -49.20%  (p=0.000 n=10+10)
GDA/add-10              811kB ± 0%     422kB ± 0%   -47.95%  (p=0.000 n=10+10)
GDA/power-10           27.0MB ± 0%    14.3MB ± 0%   -47.21%  (p=0.000 n=10+10)
GDA/plus-10            52.3kB ± 0%    40.0kB ± 0%   -23.47%  (p=0.000 n=10+8)
GDA/exp-10             61.3MB ± 0%    55.7MB ± 0%    -9.12%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
GDA/abs-10                238 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/compare-10          1.24k ± 0%     0.00k       -100.00%  (p=0.000 n=10+10)
GDA/comparetotal-10       881 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/minus-10              274 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/reduce-10             352 ± 0%         0       -100.00%  (p=0.000 n=10+10)
GDA/remainder-10        1.40k ± 0%     0.00k ± 0%   -99.93%  (p=0.000 n=10+10)
GDA/quantize-10         2.41k ± 0%     0.04k ± 0%   -98.18%  (p=0.000 n=10+10)
GDA/squareroot-10        278k ± 0%        5k ± 0%   -98.09%  (p=0.000 n=10+10)
GDA/randoms-10          42.2k ± 0%      0.9k ± 0%   -97.81%  (p=0.000 n=10+10)
GDA/rounding-10         10.2k ± 0%      0.3k ± 0%   -96.83%  (p=0.000 n=10+10)
GDA/divide-10           3.08k ± 0%     0.10k ± 0%   -96.72%  (p=0.000 n=10+10)
GDA/tointegralx-10        857 ± 0%        35 ± 0%   -95.92%  (p=0.000 n=10+10)
GDA/tointegral-10         833 ± 0%        35 ± 0%   -95.80%  (p=0.000 n=10+10)
GDA/subtract-10         3.51k ± 0%     0.18k ± 0%   -94.76%  (p=0.000 n=10+10)
GDA/powersqrt-10        2.63M ± 0%     0.17M ± 0%   -93.49%  (p=0.000 n=10+10)
GDA/multiply-10         1.35k ± 0%     0.16k ± 0%   -88.41%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10     7.75k ± 0%     1.19k ± 0%   -84.59%  (p=0.000 n=10+10)
GDA/add-10              15.2k ± 0%      3.0k ± 0%   -79.96%  (p=0.000 n=10+10)
GDA/divideint-10          704 ± 0%       142 ± 0%   -79.83%  (p=0.000 n=10+10)
GDA/ln-10                269k ± 0%       80k ± 0%   -70.21%  (p=0.000 n=10+10)
GDA/log10-10             327k ± 0%      101k ± 0%   -69.04%  (p=0.000 n=10+10)
GDA/power-10             700k ± 0%      226k ± 0%   -67.69%  (p=0.000 n=10+10)
GDA/plus-10               706 ± 0%       229 ± 0%   -67.56%  (p=0.000 n=10+10)
GDA/exp-10               911k ± 0%      544k ± 0%   -40.32%  (p=0.000 n=10+10)
GDA/base-10             2.86k ± 0%     2.52k ± 0%   -12.06%  (p=0.000 n=10+10)

cc. @mjibson

They cause the benchmarks to run for a very long time.
See golang/go#27217.

Adjust the benchmarks to have an explicit setup phase
and run phase, separated by `b.ResetTimer`.
This reduces the size of the Decimal struct from 48 bytes to 40 bytes.
This commit introduces a performance optimization that embeds small
coefficient values directly in their `Decimal` struct, instead of
storing these values in a separate heap allocation. It does so by
replacing `math/big.Int` with a new wrapper type called `BigInt` that
provides an "inline" compact representation optimization.

Each `BigInt` maintains (through `big.Int`) an internal reference to a
variable-length integer value, which is represented by a []big.Word. The
_inline field and lazyInit method combine to allow BigInt to inline this
variable-length integer array within the BigInt struct when its value is
sufficiently small. In lazyInit, we point the _inner field's slice at the
_inline array. big.Int will avoid re-allocating this array until it is
provided with a value that exceeds the initial capacity.

We set the capacity of the inline array to accommodate any value that
would fit in a 128-bit integer (i.e. values up to 2^128 - 1).

This is an alternative to an optimization that many other arbitrary
precision decimal libraries have where small coefficient values are
stored as numeric fields in their data type's struct. Only when this
coefficient value gets large enough do these libraries fall back to a
variable-length coefficient with internal indirection. We can see the
optimization in practice in the `ericlagergren/decimal` library, where
each struct contains a `compact uint64` and an `unscaled big.Int`. Prior
concern from the authors of `cockroachdb/apd` regarding this form of
compact representation optimization was that it traded performance for
complexity. The optimization fractured control flow, leaking out across
the library and leading to more complex, error-prone code.

The approach taken in this commit does not have the same issue. All
arithmetic on the decimal's coefficient is still deferred to `bit.Int`.
In fact, the entire optimization is best-effort, and bugs that lead to
missed calls to `lazyInit` are merely missed opportunities to avoid a
heap allocation, and nothing more serious.

However, one major complication with this approach is that Go's escape
analysis struggles with understanding self-referential pointers. A naive
implementation of this optimization would force all BigInt structs to
escape to the heap. To work around this, we employ a similar trick to
`sync.Cond` and `strings.Builder`. We trick escape analysis to allow the
self-referential pointer without causing the struct to escape.

This works but it introduces complexity if BigInt structs are copied by
value. So to avoid nasty bugs, we disallow copying of BigInt structs.
The self-referencing pointer from _inner to _inline makes this unsafe,
as it could allow aliasing between two BigInt structs which would be
hidden from escape analysis. If the first BigInt then fell out of scope
and was GCed, this could corrupt the state of the second BigInt.
`sync.Cond` and `strings.Builder` also prevent copying to avoid this
kind of issue. In fact, `big.Int` itself says that "shallow copies are
not supported and may lead to errors", but it doesn't enforce this.

Microbenchmarks:
```
BenchmarkBigIntBinomial-10         	 1460434	       804.5 ns/op	    1024 B/op	      38 allocs/op
BenchmarkBigIntQuoRem-10           	 1212969	       985.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntExp-10              	     454	   2623567 ns/op	   10969 B/op	      21 allocs/op
BenchmarkBigIntExp2-10             	     456	   2613395 ns/op	   11223 B/op	      22 allocs/op
BenchmarkBigIntBitset-10           	152604634	         7.856 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntBitsetNeg-10        	45926347	        25.68 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntBitsetOrig-10       	27844972	        41.64 ns/op	      55 B/op	       0 allocs/op
BenchmarkBigIntBitsetNegOrig-10    	12631069	        94.04 ns/op	     168 B/op	       1 allocs/op
BenchmarkBigIntModInverse-10       	 1913102	       622.4 ns/op	    1280 B/op	      11 allocs/op
BenchmarkBigIntSqrt-10             	   90704	     13227 ns/op	    5538 B/op	      12 allocs/op
BenchmarkBigIntDiv/20/10-10        	41788064	        27.61 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/40/20-10        	42714760	        27.57 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/100/50-10       	25163826	        47.14 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/200/100-10      	 7893946	       151.7 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/400/200-10      	 7052482	       169.5 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/1000/500-10     	 4212556	       283.9 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/2000/1000-10    	 2126505	       563.8 ns/op	       0 B/op	       0 allocs/op
BenchmarkBigIntDiv/20000/10000-10  	   71372	     16754 ns/op	     128 B/op	       1 allocs/op
BenchmarkBigIntDiv/200000/100000-10         	    1910	    618446 ns/op	     264 B/op	       1 allocs/op
BenchmarkBigIntDiv/2000000/1000000-10       	      45	  25053395 ns/op	   88072 B/op	       2 allocs/op
BenchmarkBigIntDiv/20000000/10000000-10     	       2	 937879542 ns/op	13384992 B/op	      48 allocs/op
```
This change replaces many calls to `new` for `BigInt` and `Decimal`
values with stack-allocated values.

This has less of an effect than it may initial seem on its own, because
Go's escape analysis can keep `new` "allocations" on the stack in some
cases. The larger benefit of this change is that it makes the cases
where a value does escape and is heap allocated more obvious, because
they now show up as "moved to heap" lines in escape analysis logs.
These were useful, but they created escape analysis barriers that
resulted in unnecessary heap allocations. Remove them.

Before
```
➜ goescape . | grep moved
./decimal.go:325:8: moved to heap: integ
./round.go:73:7: moved to heap: y
./context.go:287:6: moved to heap: quo
./context.go:493:6: moved to heap: f
./context.go:503:6: moved to heap: approx
./context.go:519:6: moved to heap: tmp
./context.go:570:6: moved to heap: ax
./context.go:570:10: moved to heap: z
./context.go:599:6: moved to heap: z0
./context.go:1008:6: moved to heap: n
./context.go:902:6: moved to heap: tmp1
./context.go:912:6: moved to heap: tmp2
./context.go:939:9: moved to heap: r
./context.go:965:6: moved to heap: sum
./context.go:697:6: moved to heap: tmp1
./context.go:697:12: moved to heap: tmp2
./context.go:697:18: moved to heap: tmp3
./context.go:697:24: moved to heap: tmp4
./context.go:697:30: moved to heap: z
./context.go:697:33: moved to heap: resAdjust
./context.go:1045:13: moved to heap: frac
./context.go:1067:6: moved to heap: tmp
```

After
```
➜ goescape . | grep moved
./decimal.go:325:8: moved to heap: integ
./round.go:73:7: moved to heap: y
./context.go:287:6: moved to heap: quo
```
This commit reworks the `Rounder` API to eliminate the escape analysis
barrier that its was creating, which was resulting in unnecessary heap
allocations. The commit replaces the opaque functions used for dynamic
dispatch with a switch statement, which escape analysis is more easily
able to understand.

Unfortunately, to do this, we need to make the roundings a closed set
and remove the ability for users to supply their own rounding routines.
I think this is a reasonable trade-off, given that we are not aware of
anyone actually using the extra flexibility.

Before
```
➜ goescape . | grep moved
./decimal.go:325:8: moved to heap: integ
./round.go:73:7: moved to heap: y
./context.go:287:6: moved to heap: quo
```

After
```
➜ goescape . | grep moved | wc -l
       0
```
@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Jan 4, 2022

Updates

Integration into CRDB

I took a stab at integrating this into CRDB in this nightmare of a prototype.

The restriction that apd.Decimal can no longer be copied by value wasn't too much of a hurdle in most places. In fact, it actually forced some improvements in the key and value decoding code, as it prevented memcpys and wasted allocations by preventing us from writing code that caused them. For example, DecodeDecimalAscending's signature changed from:

func DecodeDecimalAscending(buf []byte, tmp []byte) ([]byte, apd.Decimal, error)

to

func DecodeDecimalAscending(dec *apd.Decimal, buf []byte, tmp []byte) ([]byte, error)

This in turn forces the code to be slightly more deliberate about where it is decoding Decimals into, so we end up using the rowenc.DatumAlloc more reliably. All in all, I think that's fine.

The more complex part of the change was in the vectorized execution engine. The complexity wasn't even entirely due to this change, but instead due to the need to first address this todo:

// TODO(yuzefovich): consider whether Get and Set on Decimals should operate on
// pointers to apd.Decimal.

Since we can no longer copy Decimals by value, the handling of Decimal in the vectorized execution needs to become even more custom. This is because we don't currently have any data types that the vectorized execution engine interacts with by reference.

I spent a while learning about and playing with execgen. I made some progress, but eventually gave up and made changes manually. I'm sure someone who actually knows how execgen works would have more luck, but I really struggled with the conditional templating.

After making that change and getting CRDB compiling, I took it for a spin. First, I confirmed that we see the same reduction in heap allocations that we saw in cockroachdb/cockroach#74369. Next, I played around a bit with the tpcds dataset, which includes many DECIMAL columns. I didn't run any tpcds queries, but I did do a few custom full-table aggregations over the DECIMAL columns in ~2GB large tables.

Nothing was scientific, but there did seem to be something around a 5% speedup compared to master on these aggregations. Or maybe I just wanted there to be. Again, not scientific.


More compact memory representation

After writing this change I looked into how other systems like Postgres, MySQL, SQLServer, and Materialize handle Decimals. It turns out that only Postgres (not even Materialize, which strives for close to full PG support) supports arbitrary precision decimals (100k+ precision). Most other systems cap the precision of decimals at somewhere around 40, meaning that they don't even need to support a variable-length memory representation. They can instead inline the entire value in a single 24 byte large struct.

We do want to maintain compatibility with PG and so I think we need to continue to support arbitrary precision, but this made me question why we were tailoring so much of this code to arbitrary precision. Specifically, I started questioning why we were spending so many bytes of the Decimal memory representation to support absurdly large precision value. Out of the 64 bytes in a Decimal, 32 were going to the big.Int (bool + slice header) and 8 were going to the self-referential pointer to dynamically detect value copies which are unsafe as a second-order consequence of supporting arbitrary precision.

I drafted another prototype that further optimizes for small values at the expense of large values: nvanbenschoten@654be62.

The key idea is that instead of a memory representation that looks like:

type Decimal struct {
    Form     int8
    Negative bool
    Exponent int32
    Coeff    BigInt {
        _inner big.Int {
            neg bool
            abs []uint
        }
        _inline [2]uint
        _addr   uintptr
    }
} // sizeof = 64 bytes

We have a memory representation that looks like:

type Decimal struct {
    Form     int8
    Negative bool
    Exponent int32
    Coeff    BigInt {
        _inner *big.Int // nil when value fits in _inline
        _inline [2]uint
    }
} // sizeof = 32

There are a few tradeoffs here:

Pros:

  • reduces memory size from 64 bytes to 32 bytes, allowing us to pack two Decimals into a single cache line
  • removes the self-referential pointer, making the Decimal struct copyable again

Cons:

  • two heap allocations for large values instead of one (allocate big.Int then allocate large backing array)
  • need to inflate stack-allocated big.Int on demand during arithmetic if we wanted to continue to defer all arithmetic to big.Int

This second con leads to code that looks like:

func (z *BigInt) And(x, y *BigInt) *BigInt {
	var tmp1, tmp2, tmp3 big.Int
	zi := z.inner(&tmp1)
	zi.And(x.inner(&tmp2), y.inner(&tmp3))
	z.updateInner(zi)
	return z
}

This approach compared favorably to the master implementation on benchmarks. For instance, the reduction in heap allocations (see PR description) is mostly the same between the two prototype approaches. For the most part, this still led to net speedups across the board compared to master.

However, even with the reduction in memory size and on datasets large enough to not fit entirely in my CPU's cache, benchmarks did not present this new approach favorably compared to the approach taken by this PR. It appears that in this performance sensitive code, even with everything inlined, the on-demand inflation and deflation of stack-allocated big.Int structs on every operation was too expensive. Here's the comparison between this PR (old) and the approach taken by the compact representation prototype (new):

name                 old time/op    new time/op    delta
GDA/base-10             113µs ± 1%     114µs ± 1%    +0.51%  (p=0.005 n=10+10)
GDA/plus-10            38.3µs ± 0%    41.9µs ± 0%    +9.30%  (p=0.000 n=9+10)
GDA/divideint-10       23.1µs ± 0%    25.7µs ± 1%   +11.03%  (p=0.000 n=10+10)
GDA/remainder-10       41.7µs ± 0%    48.2µs ± 0%   +15.64%  (p=0.000 n=9+9)
GDA/minus-10           8.95µs ± 0%   10.82µs ± 0%   +20.88%  (p=0.000 n=10+10)
GDA/subtract-10         112µs ± 0%     136µs ± 0%   +21.82%  (p=0.000 n=10+9)
GDA/exp-10              130ms ± 0%     158ms ± 0%   +22.03%  (p=0.000 n=10+10)
GDA/add-10              655µs ± 0%     800µs ± 0%   +22.12%  (p=0.000 n=10+10)
GDA/comparetotal-10    24.4µs ± 1%    30.1µs ± 1%   +23.33%  (p=0.000 n=10+10)
GDA/abs-10             7.11µs ± 0%    8.81µs ± 0%   +23.79%  (p=0.000 n=10+10)
GDA/tointegralx-10     22.6µs ± 0%    28.1µs ± 0%   +24.59%  (p=0.000 n=10+7)
GDA/tointegral-10      22.2µs ± 0%    27.7µs ± 0%   +24.75%  (p=0.000 n=10+9)
GDA/multiply-10        56.3µs ± 0%    71.8µs ± 0%   +27.62%  (p=0.000 n=10+8)
GDA/divide-10           377µs ± 0%     486µs ± 0%   +28.87%  (p=0.000 n=10+10)
GDA/randoms-10         2.29ms ± 0%    2.96ms ± 0%   +29.00%  (p=0.000 n=10+9)
GDA/reduce-10          14.2µs ± 0%    18.4µs ± 0%   +29.25%  (p=0.000 n=10+10)
GDA/log10-10            114ms ± 0%     148ms ± 0%   +29.61%  (p=0.000 n=10+9)
GDA/quantize-10        85.0µs ± 0%   110.1µs ± 0%   +29.62%  (p=0.000 n=10+10)
GDA/ln-10              90.1ms ± 0%   117.0ms ± 0%   +29.82%  (p=0.000 n=10+9)
GDA/compare-10         34.5µs ± 1%    44.9µs ± 1%   +30.18%  (p=0.000 n=10+10)
GDA/power-10            233ms ± 0%     304ms ± 0%   +30.39%  (p=0.000 n=9+10)
GDA/squareroot-10      28.7ms ± 0%    37.4ms ± 0%   +30.45%  (p=0.000 n=8+9)
GDA/rounding-10         492µs ± 0%     645µs ± 0%   +31.13%  (p=0.000 n=10+9)
GDA/cuberoot-apd-10    2.14ms ± 0%    2.81ms ± 0%   +31.14%  (p=0.000 n=10+9)
GDA/powersqrt-10        433ms ± 0%     571ms ± 0%   +31.96%  (p=0.000 n=10+10)

So I think we'll want to stick with this PR.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, impressive work and a very thorough analysis! I haven't looked into the integration into CRDB, but change to apd library makes sense to me.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 11 of 11 files at r4, 8 of 8 files at r5, 1 of 1 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @jordanlewis and @nvanbenschoten)


bigint.go, line 29 at r4 (raw file):

// The zero value is ready to use.
// The value must not be copied after first use.
type BigInt struct {

What will happen if a new method is added to big.Int and we don't add the corresponding method to BigInt?


bigint.go, line 70 at r4 (raw file):

}

func (b *BigInt) inner() *big.Int {

nit: a quick comment describing when inner vs innerOrNil should be used would be helpful.


bigint.go, line 118 at r4 (raw file):

		// Before doing so, zero out the inline array, in case it had a value
		// previously. This is necessary in edge cases where _inner initially

I didn't get to the tests yet, but this sounds like a good regression test.


bigint.go, line 193 at r4 (raw file):

func (b *BigInt) Bits() []big.Word {
	// Don't expose direct access to the big.Int's word slice.
	panic("unimplemented")

nit: maybe also a regression test for this?


const.go, line 33 at r5 (raw file):

	decimalEight     = New(8, 0)

	decimalMax = New(math.MaxInt64, 0)

nit: maybe calling these decimalMaxInt64 and decimalMinInt64 would be more descriptive?


decimal.go, line 362 at r5 (raw file):

// them with this scaling, along with the scaling. An error can be produced
// if the resulting scale factor is out of range.
func upscale(a, b *Decimal, tmp *BigInt) (*BigInt, *BigInt, int32, error) {

nit: maybe mention what tmp should be and what it'll be used for.


gda_test.go, line 330 at r1 (raw file):

			b.ResetTimer()
			for i := 0; i < b.N; i++ {

Not your change, but I don't understand the order of these two for loops - currently for every benchmark iteration we ran many bench cases, but I would expect the order to be the opposite so that for a particular bench case we'd run as many iterations as needed for that bench case. Am I missing something?


gda_test.go, line 333 at r1 (raw file):

				for _, bc := range bcs {
					// Ignore errors here because the full tests catch them.
					var res Decimal

nit: not sure if it matters, but we could pull out the definition of res to be outside of the for loops.


table.go, line 91 at r5 (raw file):

	n := int64(float64(bl) / digitsToBitsRatio)
	var tpmE BigInt

nit: s/tpm/tmp/.

@yuzefovich
Copy link
Member

yuzefovich commented Jan 5, 2022

I spent some time trying to address the TODO you mentioned above in order to make the integration easier, and, indeed, it is quite annoying to do.

Another difficulty (apart from decimals becoming the only data type that is operated on by reference) is that in some cases we do want the decimal to be stored by value (for example, this is the case of aggregate functions for which we want to inline the decimal into the function itself).

I'm curious to hear @jordanlewis thoughts about this PR, and if we all think it's worthwhile, I'll figure out hot to address that TODO.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Jan 5, 2022

Thanks for taking a look, Yahor. I agree with everything you said. The no-copy limitation appeared to make the integration into the vectorized execution engine difficult for a number of different reasons, mostly stemming from the need to operate on references instead of values. It looked possible, but would likely be a lot of work.

In light of your exploration, I took another look at the "More compact memory representation" alternative. This alternative approach is appealing because it does not use a self-referential pointer, so it can be copied by value safely (though Decimal.Set should still be used to avoid aliasing when that is a problem). I mentioned above that the repeat inflation and deflation of stack-allocated big.Int structs on every operation was too expensive. We could see this in the benchmark results when compared to this PR.

I explored what it would take to bypass big.Int for basic arithmetic on small values, similar to what the ericlagergren/decimal library does. As it turns out, the separation between apd.Decimal and apd.BigInt made this all fairly straightforward — apd.Decimal remains entirely oblivious of these optimizations and apd.BigInt can continue to fallback to big.Int whenever the arithmetic gets hard. Furthermore, it also means that even the complex arithmetic at the Decimal level (e.g. Decimal.Sqrt) still benefits from fast-paths on simple arithmetic, because these complex Decimal operations are composed of many incremental operations on BigInts.

Here's the change to add these fast-paths to the compact memory representation approach: c5c0544. It's not a lot of code, and it benefits from the existing BigInt and Decimal test coverage.

With that commit, the comparison between the two approaches shift in favor of the compact (and copyable) memory representation (old is this PR, new is c5c0544):

name                 old time/op    new time/op    delta
GDA/squareroot-10      29.6ms ± 0%    16.1ms ± 1%   -45.48%  (p=0.000 n=10+9)
GDA/powersqrt-10        447ms ± 0%     248ms ± 0%   -44.68%  (p=0.000 n=9+9)
GDA/rounding-10         509µs ± 1%     306µs ± 0%   -39.90%  (p=0.000 n=10+9)
GDA/divideint-10       23.9µs ± 1%    15.6µs ± 1%   -34.56%  (p=0.000 n=10+10)
GDA/divide-10           390µs ± 1%     268µs ± 1%   -31.27%  (p=0.000 n=10+10)
GDA/remainder-10       42.9µs ± 1%    32.2µs ± 1%   -24.94%  (p=0.000 n=10+9)
GDA/tointegral-10      22.4µs ± 0%    17.7µs ± 1%   -21.13%  (p=0.000 n=10+10)
GDA/tointegralx-10     22.7µs ± 1%    18.1µs ± 1%   -20.36%  (p=0.000 n=10+10)
GDA/randoms-10         2.35ms ± 0%    1.89ms ± 1%   -19.68%  (p=0.000 n=10+10)
GDA/compare-10         35.8µs ± 1%    28.9µs ± 1%   -19.17%  (p=0.000 n=10+9)
GDA/minus-10           9.23µs ± 1%    7.61µs ± 1%   -17.62%  (p=0.000 n=10+10)
GDA/abs-10             7.37µs ± 1%    6.10µs ± 1%   -17.24%  (p=0.000 n=8+9)
GDA/subtract-10         113µs ± 2%      95µs ± 1%   -15.61%  (p=0.000 n=10+10)
GDA/cuberoot-apd-10    2.22ms ± 1%    1.96ms ± 0%   -11.90%  (p=0.000 n=10+10)
GDA/exp-10              137ms ± 0%     124ms ± 0%    -9.05%  (p=0.000 n=10+10)
GDA/ln-10              93.6ms ± 1%    85.5ms ± 0%    -8.63%  (p=0.000 n=10+10)
GDA/quantize-10        86.5µs ± 0%    80.0µs ± 1%    -7.48%  (p=0.000 n=10+10)
GDA/log10-10            119ms ± 1%     110ms ± 0%    -7.13%  (p=0.000 n=10+10)
GDA/reduce-10          14.8µs ± 1%    13.8µs ± 1%    -6.75%  (p=0.000 n=10+10)
GDA/power-10            242ms ± 0%     228ms ± 0%    -5.89%  (p=0.000 n=8+10)
GDA/plus-10            37.6µs ± 1%    36.2µs ± 0%    -3.95%  (p=0.000 n=10+10)
GDA/multiply-10        56.9µs ± 1%    56.0µs ± 0%    -1.59%  (p=0.000 n=9+10)
GDA/add-10              603µs ± 1%     598µs ± 1%    -0.91%  (p=0.000 n=10+9)
GDA/comparetotal-10    25.2µs ± 1%    25.2µs ± 1%      ~     (p=0.912 n=10+10)
GDA/base-10             114µs ± 1%     115µs ± 1%    +1.04%  (p=0.000 n=10+10)

I'll look into packaging up that alternative approach and addressing the existing comments of yours that apply to it.

@yuzefovich
Copy link
Member

Thanks Nathan, this looks very promising! I think I was able to make the switch from the value to the pointer in the vectorized engine (in cockroachdb/cockroach#74469), at least the unit tests seem to work, but I'll be happy if we end up not needing that change :)

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten changed the title apd: embed small coefficient values in Decimal struct [DNM] apd: embed small coefficient values in Decimal struct Jan 5, 2022
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #103 as an alternative to this PR. In that one, I've addressed each of your questions.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jordanlewis and @yuzefovich)


bigint.go, line 29 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What will happen if a new method is added to big.Int and we don't add the corresponding method to BigInt?

(answer the same for either implementation) We're not embedding the big.Int, so the methods won't be exported by this type. If we need them, we can add them after the fact.


bigint.go, line 70 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: a quick comment describing when inner vs innerOrNil should be used would be helpful.

Done.


bigint.go, line 118 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I didn't get to the tests yet, but this sounds like a good regression test.

Done.


bigint.go, line 193 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe also a regression test for this?

Done.


const.go, line 33 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe calling these decimalMaxInt64 and decimalMinInt64 would be more descriptive?

Good point, done.


decimal.go, line 362 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: maybe mention what tmp should be and what it'll be used for.

Done.


gda_test.go, line 330 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Not your change, but I don't understand the order of these two for loops - currently for every benchmark iteration we ran many bench cases, but I would expect the order to be the opposite so that for a particular bench case we'd run as many iterations as needed for that bench case. Am I missing something?

I don't think it makes a difference in practice. Either way, we're running b.N iterations of each bench case.


gda_test.go, line 333 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: not sure if it matters, but we could pull out the definition of res to be outside of the for loops.

Good point. I pulled this into a vector of Decimal inputs and outputs to more realistically model how the vectorized execution engine will evaluate arithmetic. As it turns out, doing so caused the benchmarks to skew even further (by a few percent) in favor of the compact representation, likely because the improved cache locality highlighted the size difference (2 Decimals per cache line vs. 1).


table.go, line 91 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/tpm/tmp/.

Done.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jordanlewis and @nvanbenschoten)


bigint.go, line 29 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

(answer the same for either implementation) We're not embedding the big.Int, so the methods won't be exported by this type. If we need them, we can add them after the fact.

Indeed, good point.


gda_test.go, line 330 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think it makes a difference in practice. Either way, we're running b.N iterations of each bench case.

I thought that the value of b.N changes for each bench case depending on the variability and length of the benchmark runs with the same name. Is that not the case?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jordanlewis and @yuzefovich)


gda_test.go, line 330 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I thought that the value of b.N changes for each bench case depending on the variability and length of the benchmark runs with the same name. Is that not the case?

Yeah, it's kind of confusing how this works. Each of these "GDA" files contains a series of roughly 1000 operations of varying complexity and we run the entire suite of operations in a given file per benchmark iteration. So at the file level, we have sub-benchmarks (see b.Run above), but within a test, we loop over all operations (benchCases) for each benchmark iteration.

I do think this is what we want though because it leads to stable results even with individual operations that vary wildly in cost.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jordanlewis and @nvanbenschoten)


gda_test.go, line 330 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, it's kind of confusing how this works. Each of these "GDA" files contains a series of roughly 1000 operations of varying complexity and we run the entire suite of operations in a given file per benchmark iteration. So at the file level, we have sub-benchmarks (see b.Run above), but within a test, we loop over all operations (benchCases) for each benchmark iteration.

I do think this is what we want though because it leads to stable results even with individual operations that vary wildly in cost.

I see, thanks for explaining.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/bigInt branch January 7, 2022 16:32
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Jan 11, 2022
74590: colexec: integrate flat, compact decimal datums r=nvanbenschoten a=nvanbenschoten

Replaces #74369 and #57593.

This PR picks up the following changes to `cockroachdb/apd`:
- cockroachdb/apd#103
- cockroachdb/apd#104
- cockroachdb/apd#107
- cockroachdb/apd#108
- cockroachdb/apd#109
- cockroachdb/apd#110
- cockroachdb/apd#111

Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets.

----

At a high-level, those changes implement the "compact memory representation" for Decimals described in cockroachdb/apd#102 (comment) and later implemented in cockroachdb/apd#103.

Compared to the approach on master, the approach in cockroachdb/apd#103 is a) faster, b) avoids indirection + heap allocation, c) smaller.

Compared to the alternate approach in cockroachdb/apd#102, the approach in cockroachdb/apd#103 is a) [faster for most operations](cockroachdb/apd#102 (comment)), b) more usable because values can be safely copied, c) half the memory size (32 bytes per `Decimal`, vs. 64). 

The memory representation of the Decimal struct in this approach looks like:
```go
type Decimal struct {
    Form     int8
    Negative bool
    Exponent int32
    Coeff    BigInt {
        _inner  *big.Int // nil when value fits in _inline
        _inline [2]uint
    }
} // sizeof = 32
```

With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in `_inline`. The indirection through `_inner` is only used for values larger than this.

Before this change, the memory representation of the `Decimal` struct looked like:
```go
type Decimal struct {
    Form     int64
    Negative bool
    Exponent int32
    Coeff    big.Int {
        neg bool
        abs []big.Word {
            data uintptr ---------------. 
            len  int64                  v
            cap  int64         [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes
        }
    }
} // sizeof = 48 flat bytes + variable-length heap allocated array
```

----

## Performance impact

### Speedup on TPC-DS dataset

The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (`web_sales`, row count = 7,197,566):

Queries
```sql
# q1
select sum(ws_wholesale_cost + ws_ext_list_price) from web_sales;

# q2
select sum(2 * ws_wholesale_cost + ws_ext_list_price) - max(4 * ws_ext_ship_cost), min(ws_net_profit) from web_sales;

# q3
select max(ws_bill_customer_sk + ws_bill_cdemo_sk + ws_bill_hdemo_sk + ws_bill_addr_sk + ws_ship_customer_sk + ws_ship_cdemo_sk + ws_ship_hdemo_sk + ws_ship_addr_sk + ws_web_page_sk + ws_web_site_sk + ws_ship_mode_sk + ws_warehouse_sk + ws_promo_sk + ws_order_number + ws_quantity + ws_wholesale_cost + ws_list_price + ws_sales_price + ws_ext_discount_amt + ws_ext_sales_price + ws_ext_wholesale_cost + ws_ext_list_price + ws_ext_tax + ws_coupon_amt + ws_ext_ship_cost + ws_net_paid + ws_net_paid_inc_tax + ws_net_paid_inc_ship + ws_net_paid_inc_ship_tax + ws_net_profit) from web_sales;
```

Here's the difference in runtime of these three queries before and after this change on an `n2-standard-4` instance:
```
name              old s/op   new s/op   delta
TPC-DS/custom/q1  7.21 ± 3%  6.59 ± 0%   -8.57%  (p=0.000 n=10+10)
TPC-DS/custom/q2  10.2 ± 0%   9.7 ± 3%   -5.42%  (p=0.000 n=10+10)
TPC-DS/custom/q3  21.9 ± 1%  17.3 ± 0%  -21.13%  (p=0.000 n=10+10)
```

### Heap allocation reduction in TPC-DS

Part of the reason for this speedup was that it significantly reduces heap allocations because most decimal values are stored inline. We can see this in q3 from above. Before the change, a heap profile looks like:

<img width="1751" alt="Screen Shot 2022-01-07 at 7 12 49 PM" src="https://user-images.githubusercontent.com/5438456/148625159-9ceb470a-0742-4f75-a533-530d9944143c.png">

After the change, a heap profile looks like:

<img width="1749" alt="Screen Shot 2022-01-07 at 7 17 32 PM" src="https://user-images.githubusercontent.com/5438456/148625174-629f4b47-07cc-4ef6-8723-2e556f7fc00d.png">

_(the dominant source of heap allocations is now `coldata.(*Nulls).Or`. #74592 should help here)_

### Heap allocation reduction in TPC-E

On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, `math/big.nat.make` was responsible for **51.07%** of total heap allocations:

<img width="1587" alt="Screen Shot 2021-12-31 at 8 01 00 PM" src="https://user-images.githubusercontent.com/5438456/147842722-965d649d-b29a-4f66-aa07-1b05e52e97af.png">

After the change, `math/big.nat.make` is responsible for only **1.1%** of total heap allocations:

<img width="1580" alt="Screen Shot 2021-12-31 at 9 04 24 PM" src="https://user-images.githubusercontent.com/5438456/147842727-a881a5a3-d038-48bb-bd44-4ade665afe73.png">

That equates to roughly a **50%** reduction in heap allocations.

### Microbenchmarks

```
name                                                                   old time/op    new time/op     delta
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10          65.6µs ± 2%     42.5µs ± 0%  -35.15%  (p=0.000 n=9+8)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10          68.4µs ± 1%     48.4µs ± 1%  -29.20%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10         1.65ms ± 1%     1.20ms ± 1%  -27.31%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10       51.4ms ± 1%     38.3ms ± 1%  -25.59%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10            12.5µs ± 1%      9.4µs ± 2%  -24.72%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10            12.5µs ± 1%      9.6µs ± 2%  -23.24%  (p=0.000 n=8+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10             10.5µs ± 1%      8.0µs ± 1%  -23.22%  (p=0.000 n=9+9)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10           12.4µs ± 1%      9.6µs ± 1%  -22.70%  (p=0.000 n=8+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10       60.5µs ± 1%     47.1µs ± 2%  -22.24%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10        61.2µs ± 1%     47.7µs ± 1%  -22.09%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10         62.3µs ± 1%     48.7µs ± 2%  -21.91%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10         1.31ms ± 0%     1.03ms ± 1%  -21.53%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10          82.3µs ± 1%     64.9µs ± 1%  -21.12%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10           86.6µs ± 1%     68.5µs ± 1%  -20.93%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10            96.0µs ± 1%     77.1µs ± 1%  -19.73%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10       41.2ms ± 0%     33.1ms ± 0%  -19.64%  (p=0.000 n=8+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10              17.5µs ± 1%     14.3µs ± 2%  -18.59%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10                14.8µs ± 3%     12.1µs ± 3%  -18.26%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10               20.0µs ± 1%     16.4µs ± 1%  -18.04%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10               20.9µs ± 1%     17.2µs ± 3%  -17.80%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10       884µs ± 0%      731µs ± 0%  -17.30%  (p=0.000 n=10+9)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10    27.9ms ± 0%     23.1ms ± 0%  -17.27%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10              218µs ± 2%      181µs ± 2%  -17.23%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10        911µs ± 1%      755µs ± 1%  -17.10%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10         957µs ± 1%      798µs ± 0%  -16.66%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10         1.54ms ± 1%     1.29ms ± 1%  -16.56%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10              188µs ± 1%      157µs ± 2%  -16.33%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10     28.8ms ± 0%     24.1ms ± 0%  -16.14%  (p=0.000 n=9+9)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10      30.4ms ± 0%     25.7ms ± 1%  -15.26%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10          135ms ± 1%      114ms ± 1%  -15.21%  (p=0.000 n=10+9)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10          1.79ms ± 1%     1.52ms ± 1%  -15.14%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10            6.29ms ± 1%     5.50ms ± 1%  -12.62%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10       62.2ms ± 0%     54.7ms ± 0%  -12.08%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10           2.46ms ± 1%     2.17ms ± 1%  -11.88%  (p=0.000 n=10+9)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10            5.64ms ± 0%     4.98ms ± 0%  -11.76%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10           354ms ± 2%      318ms ± 1%  -10.18%  (p=0.000 n=10+8)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10        91.8ms ± 1%     83.3ms ± 0%   -9.25%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10           396ms ± 1%      369ms ± 1%   -6.83%  (p=0.000 n=8+8)

name                                                                   old speed      new speed       delta
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10         125MB/s ± 2%    193MB/s ± 0%  +54.20%  (p=0.000 n=9+8)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10         120MB/s ± 1%    169MB/s ± 1%  +41.24%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10        159MB/s ± 1%    219MB/s ± 1%  +37.57%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10      163MB/s ± 1%    219MB/s ± 1%  +34.39%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10          20.4MB/s ± 1%   27.2MB/s ± 2%  +32.85%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10            764kB/s ± 2%    997kB/s ± 1%  +30.45%  (p=0.000 n=10+9)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10          20.5MB/s ± 1%   26.8MB/s ± 2%  +30.28%  (p=0.000 n=8+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10         20.7MB/s ± 1%   26.8MB/s ± 1%  +29.37%  (p=0.000 n=8+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10      135MB/s ± 1%    174MB/s ± 2%  +28.61%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10       134MB/s ± 1%    172MB/s ± 1%  +28.35%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10        131MB/s ± 1%    168MB/s ± 2%  +28.06%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10        200MB/s ± 0%    255MB/s ± 1%  +27.45%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10         100MB/s ± 1%    126MB/s ± 1%  +26.78%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10         94.6MB/s ± 1%  119.6MB/s ± 1%  +26.47%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10          85.3MB/s ± 1%  106.3MB/s ± 1%  +24.58%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10      204MB/s ± 0%    254MB/s ± 0%  +24.44%  (p=0.000 n=8+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10            14.6MB/s ± 1%   18.0MB/s ± 2%  +22.83%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10               544kB/s ± 3%    664kB/s ± 2%  +22.06%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10             12.8MB/s ± 1%   15.6MB/s ± 1%  +22.02%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10             12.3MB/s ± 1%   14.9MB/s ± 3%  +21.67%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10     296MB/s ± 0%    358MB/s ± 0%  +20.92%  (p=0.000 n=10+9)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10   300MB/s ± 0%    363MB/s ± 0%  +20.87%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10           37.5MB/s ± 2%   45.4MB/s ± 2%  +20.82%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10      288MB/s ± 1%    347MB/s ± 1%  +20.62%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10       274MB/s ± 1%    329MB/s ± 0%  +19.99%  (p=0.000 n=9+9)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10        170MB/s ± 1%    204MB/s ± 1%  +19.85%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10           43.6MB/s ± 1%   52.1MB/s ± 2%  +19.52%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10    292MB/s ± 0%    348MB/s ± 0%  +19.25%  (p=0.000 n=9+9)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10     276MB/s ± 0%    326MB/s ± 1%  +18.00%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10       62.1MB/s ± 1%   73.3MB/s ± 1%  +17.94%  (p=0.000 n=10+9)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10         147MB/s ± 1%    173MB/s ± 1%  +17.83%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10          41.7MB/s ± 1%   47.7MB/s ± 1%  +14.44%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10      135MB/s ± 0%    153MB/s ± 0%  +13.74%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10          106MB/s ± 1%    121MB/s ± 1%  +13.48%  (p=0.000 n=10+9)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10          46.5MB/s ± 0%   52.7MB/s ± 0%  +13.34%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10        23.7MB/s ± 2%   26.3MB/s ± 2%  +11.02%  (p=0.000 n=10+9)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10      91.3MB/s ± 0%  100.7MB/s ± 0%  +10.27%  (p=0.000 n=8+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10        21.2MB/s ± 1%   22.7MB/s ± 1%   +7.32%  (p=0.000 n=8+8)

name                                                                   old alloc/op   new alloc/op    delta
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10          354kB ± 0%      239kB ± 0%  -32.39%  (p=0.000 n=9+9)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10          348kB ± 0%      239kB ± 0%  -31.23%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10           251kB ± 0%      177kB ± 0%  -29.44%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10           246kB ± 0%      177kB ± 0%  -28.28%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10         275kB ± 0%      198kB ± 0%  -28.06%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10          243kB ± 0%      177kB ± 0%  -27.15%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10         242kB ± 0%      177kB ± 0%  -27.09%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10        242kB ± 0%      177kB ± 0%  -27.06%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10        268kB ± 0%      198kB ± 0%  -26.05%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10       264kB ± 0%      198kB ± 0%  -25.04%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10            75.1kB ± 0%     56.9kB ± 0%  -24.25%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10            74.9kB ± 0%     56.9kB ± 0%  -24.12%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10           74.8kB ± 0%     56.9kB ± 0%  -23.99%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10             69.6kB ± 0%     53.1kB ± 0%  -23.66%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10                95.2kB ± 0%     75.9kB ± 0%  -20.23%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10                102kB ± 0%       82kB ± 0%  -20.04%  (p=0.000 n=8+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10                103kB ± 0%       83kB ± 0%  -19.95%  (p=0.000 n=7+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10               100kB ± 0%       80kB ± 0%  -19.90%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10      1.14MB ± 0%     0.92MB ± 0%  -18.80%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10           271kB ± 0%      227kB ± 0%  -16.16%  (p=0.000 n=9+9)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10       1.10MB ± 0%     0.92MB ± 0%  -15.92%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10            280kB ± 1%      235kB ± 1%  -15.91%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10     1.09MB ± 1%     0.92MB ± 0%  -15.67%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10             291kB ± 0%      245kB ± 1%  -15.53%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10         1.11MB ± 0%     0.95MB ± 0%  -15.14%  (p=0.000 n=8+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10          1.22MB ± 0%     1.04MB ± 0%  -14.77%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10           1.65MB ± 0%     1.42MB ± 0%  -13.56%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10              593kB ± 0%      513kB ± 0%  -13.36%  (p=0.000 n=9+8)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10              520kB ± 0%      454kB ± 0%  -12.82%  (p=0.000 n=9+8)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10       1.04MB ± 0%     0.92MB ± 0%  -11.06%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10       2.48MB ± 0%     2.25MB ± 0%   -9.32%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10     967kB ± 0%      881kB ± 0%   -8.89%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10        7.86MB ± 0%     7.36MB ± 0%   -6.44%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10            14.2MB ± 1%     13.4MB ± 1%   -5.83%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10            12.3MB ± 0%     11.7MB ± 0%   -5.03%  (p=0.001 n=7+7)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10         27.2MB ± 1%     25.9MB ± 1%   -4.84%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10           465MB ± 0%      445MB ± 0%   -4.32%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10           403MB ± 0%      390MB ± 0%   -3.44%  (p=0.000 n=10+10)

name                                                                   old allocs/op  new allocs/op   delta
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10           1.07k ± 0%      0.05k ± 0%  -95.70%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10            702k ± 0%        32k ± 0%  -95.46%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10            489k ± 0%        28k ± 0%  -94.33%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10          4.40k ± 0%      0.30k ± 0%  -93.15%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10           1.11k ± 0%      0.09k ± 0%  -92.02%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10             561 ± 0%         46 ± 0%  -91.80%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10          3.45k ± 0%      0.30k ± 0%  -91.28%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10            1.19k ± 0%      0.15k ± 1%  -87.31%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10          4.87k ± 0%      0.70k ± 0%  -85.69%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10             32.2k ± 0%       6.3k ± 0%  -80.40%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10         1.45k ± 3%      0.29k ± 0%  -79.66%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10             1.39k ± 0%      0.30k ± 1%  -78.64%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10             26.2k ± 0%       6.8k ± 1%  -73.95%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10           6.64k ± 0%      1.95k ± 0%  -70.67%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10              3.44k ± 1%      1.12k ± 1%  -67.48%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10          62.4k ± 0%      20.4k ± 0%  -67.32%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10              2.95k ± 1%      1.05k ± 1%  -64.52%  (p=0.000 n=9+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10            10.8k ± 0%       4.5k ± 0%  -58.21%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10          628 ± 3%        294 ± 0%  -53.21%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10         36.1k ± 0%      20.2k ± 0%  -44.06%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10           81.7 ± 3%       46.0 ± 0%  -43.67%  (p=0.000 n=9+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10       14.4k ± 1%       8.2k ± 0%  -42.97%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10              79.0 ± 0%       46.0 ± 0%  -41.77%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10        13.7k ± 1%       8.2k ± 0%  -40.05%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10                  191 ± 1%        120 ± 1%  -37.52%  (p=0.000 n=7+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10      12.9k ± 2%       8.2k ± 0%  -36.17%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10                  176 ± 2%        115 ± 1%  -34.33%  (p=0.000 n=10+9)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10        12.3k ± 0%       8.2k ± 0%  -33.21%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10        21.8k ± 0%      15.2k ± 0%  -30.13%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10                 118 ± 0%         84 ± 0%  -28.81%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10              63.0 ± 0%       46.0 ± 0%  -26.98%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10          57.2 ±14%       46.0 ± 0%  -19.58%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10     9.69k ± 1%      8.23k ± 0%  -15.07%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10         340 ± 2%        294 ± 0%  -13.43%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10               48.0 ± 0%       46.0 ± 0%   -4.17%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10             48.0 ± 0%       46.0 ± 0%   -4.17%  (p=0.000 n=10+10)
Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10         48.0 ± 0%       46.0 ± 0%   -4.17%  (p=0.000 n=10+10)
Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10                  82.0 ± 0%       79.0 ± 0%   -3.66%  (p=0.000 n=10+10)
```

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants