-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DNM] apd: embed small coefficient values in Decimal struct #102
[DNM] apd: embed small coefficient values in Decimal struct #102
Conversation
They cause the benchmarks to run for a very long time. See golang/go#27217. Adjust the benchmarks to have an explicit setup phase and run phase, separated by `b.ResetTimer`.
Cleans up a bit of code.
This reduces the size of the Decimal struct from 48 bytes to 40 bytes.
This commit introduces a performance optimization that embeds small coefficient values directly in their `Decimal` struct, instead of storing these values in a separate heap allocation. It does so by replacing `math/big.Int` with a new wrapper type called `BigInt` that provides an "inline" compact representation optimization. Each `BigInt` maintains (through `big.Int`) an internal reference to a variable-length integer value, which is represented by a []big.Word. The _inline field and lazyInit method combine to allow BigInt to inline this variable-length integer array within the BigInt struct when its value is sufficiently small. In lazyInit, we point the _inner field's slice at the _inline array. big.Int will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity. We set the capacity of the inline array to accommodate any value that would fit in a 128-bit integer (i.e. values up to 2^128 - 1). This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the `ericlagergren/decimal` library, where each struct contains a `compact uint64` and an `unscaled big.Int`. Prior concern from the authors of `cockroachdb/apd` regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code. The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to `bit.Int`. In fact, the entire optimization is best-effort, and bugs that lead to missed calls to `lazyInit` are merely missed opportunities to avoid a heap allocation, and nothing more serious. However, one major complication with this approach is that Go's escape analysis struggles with understanding self-referential pointers. A naive implementation of this optimization would force all BigInt structs to escape to the heap. To work around this, we employ a similar trick to `sync.Cond` and `strings.Builder`. We trick escape analysis to allow the self-referential pointer without causing the struct to escape. This works but it introduces complexity if BigInt structs are copied by value. So to avoid nasty bugs, we disallow copying of BigInt structs. The self-referencing pointer from _inner to _inline makes this unsafe, as it could allow aliasing between two BigInt structs which would be hidden from escape analysis. If the first BigInt then fell out of scope and was GCed, this could corrupt the state of the second BigInt. `sync.Cond` and `strings.Builder` also prevent copying to avoid this kind of issue. In fact, `big.Int` itself says that "shallow copies are not supported and may lead to errors", but it doesn't enforce this. Microbenchmarks: ``` BenchmarkBigIntBinomial-10 1460434 804.5 ns/op 1024 B/op 38 allocs/op BenchmarkBigIntQuoRem-10 1212969 985.9 ns/op 0 B/op 0 allocs/op BenchmarkBigIntExp-10 454 2623567 ns/op 10969 B/op 21 allocs/op BenchmarkBigIntExp2-10 456 2613395 ns/op 11223 B/op 22 allocs/op BenchmarkBigIntBitset-10 152604634 7.856 ns/op 0 B/op 0 allocs/op BenchmarkBigIntBitsetNeg-10 45926347 25.68 ns/op 0 B/op 0 allocs/op BenchmarkBigIntBitsetOrig-10 27844972 41.64 ns/op 55 B/op 0 allocs/op BenchmarkBigIntBitsetNegOrig-10 12631069 94.04 ns/op 168 B/op 1 allocs/op BenchmarkBigIntModInverse-10 1913102 622.4 ns/op 1280 B/op 11 allocs/op BenchmarkBigIntSqrt-10 90704 13227 ns/op 5538 B/op 12 allocs/op BenchmarkBigIntDiv/20/10-10 41788064 27.61 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/40/20-10 42714760 27.57 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/100/50-10 25163826 47.14 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/200/100-10 7893946 151.7 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/400/200-10 7052482 169.5 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/1000/500-10 4212556 283.9 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/2000/1000-10 2126505 563.8 ns/op 0 B/op 0 allocs/op BenchmarkBigIntDiv/20000/10000-10 71372 16754 ns/op 128 B/op 1 allocs/op BenchmarkBigIntDiv/200000/100000-10 1910 618446 ns/op 264 B/op 1 allocs/op BenchmarkBigIntDiv/2000000/1000000-10 45 25053395 ns/op 88072 B/op 2 allocs/op BenchmarkBigIntDiv/20000000/10000000-10 2 937879542 ns/op 13384992 B/op 48 allocs/op ```
This change replaces many calls to `new` for `BigInt` and `Decimal` values with stack-allocated values. This has less of an effect than it may initial seem on its own, because Go's escape analysis can keep `new` "allocations" on the stack in some cases. The larger benefit of this change is that it makes the cases where a value does escape and is heap allocated more obvious, because they now show up as "moved to heap" lines in escape analysis logs.
These were useful, but they created escape analysis barriers that resulted in unnecessary heap allocations. Remove them. Before ``` ➜ goescape . | grep moved ./decimal.go:325:8: moved to heap: integ ./round.go:73:7: moved to heap: y ./context.go:287:6: moved to heap: quo ./context.go:493:6: moved to heap: f ./context.go:503:6: moved to heap: approx ./context.go:519:6: moved to heap: tmp ./context.go:570:6: moved to heap: ax ./context.go:570:10: moved to heap: z ./context.go:599:6: moved to heap: z0 ./context.go:1008:6: moved to heap: n ./context.go:902:6: moved to heap: tmp1 ./context.go:912:6: moved to heap: tmp2 ./context.go:939:9: moved to heap: r ./context.go:965:6: moved to heap: sum ./context.go:697:6: moved to heap: tmp1 ./context.go:697:12: moved to heap: tmp2 ./context.go:697:18: moved to heap: tmp3 ./context.go:697:24: moved to heap: tmp4 ./context.go:697:30: moved to heap: z ./context.go:697:33: moved to heap: resAdjust ./context.go:1045:13: moved to heap: frac ./context.go:1067:6: moved to heap: tmp ``` After ``` ➜ goescape . | grep moved ./decimal.go:325:8: moved to heap: integ ./round.go:73:7: moved to heap: y ./context.go:287:6: moved to heap: quo ```
This commit reworks the `Rounder` API to eliminate the escape analysis barrier that its was creating, which was resulting in unnecessary heap allocations. The commit replaces the opaque functions used for dynamic dispatch with a switch statement, which escape analysis is more easily able to understand. Unfortunately, to do this, we need to make the roundings a closed set and remove the ability for users to supply their own rounding routines. I think this is a reasonable trade-off, given that we are not aware of anyone actually using the extra flexibility. Before ``` ➜ goescape . | grep moved ./decimal.go:325:8: moved to heap: integ ./round.go:73:7: moved to heap: y ./context.go:287:6: moved to heap: quo ``` After ``` ➜ goescape . | grep moved | wc -l 0 ```
236cd8a
to
c951ca9
Compare
UpdatesIntegration into CRDBI took a stab at integrating this into CRDB in this nightmare of a prototype. The restriction that
to
This in turn forces the code to be slightly more deliberate about where it is decoding The more complex part of the change was in the vectorized execution engine. The complexity wasn't even entirely due to this change, but instead due to the need to first address this todo:
Since we can no longer copy I spent a while learning about and playing with After making that change and getting CRDB compiling, I took it for a spin. First, I confirmed that we see the same reduction in heap allocations that we saw in cockroachdb/cockroach#74369. Next, I played around a bit with the tpcds dataset, which includes many DECIMAL columns. I didn't run any tpcds queries, but I did do a few custom full-table aggregations over the DECIMAL columns in ~2GB large tables. Nothing was scientific, but there did seem to be something around a 5% speedup compared to master on these aggregations. Or maybe I just wanted there to be. Again, not scientific. More compact memory representationAfter writing this change I looked into how other systems like Postgres, MySQL, SQLServer, and Materialize handle Decimals. It turns out that only Postgres (not even Materialize, which strives for close to full PG support) supports arbitrary precision decimals (100k+ precision). Most other systems cap the precision of decimals at somewhere around 40, meaning that they don't even need to support a variable-length memory representation. They can instead inline the entire value in a single 24 byte large struct. We do want to maintain compatibility with PG and so I think we need to continue to support arbitrary precision, but this made me question why we were tailoring so much of this code to arbitrary precision. Specifically, I started questioning why we were spending so many bytes of the I drafted another prototype that further optimizes for small values at the expense of large values: nvanbenschoten@654be62. The key idea is that instead of a memory representation that looks like: type Decimal struct {
Form int8
Negative bool
Exponent int32
Coeff BigInt {
_inner big.Int {
neg bool
abs []uint
}
_inline [2]uint
_addr uintptr
}
} // sizeof = 64 bytes We have a memory representation that looks like: type Decimal struct {
Form int8
Negative bool
Exponent int32
Coeff BigInt {
_inner *big.Int // nil when value fits in _inline
_inline [2]uint
}
} // sizeof = 32 There are a few tradeoffs here: Pros:
Cons:
This second con leads to code that looks like: func (z *BigInt) And(x, y *BigInt) *BigInt {
var tmp1, tmp2, tmp3 big.Int
zi := z.inner(&tmp1)
zi.And(x.inner(&tmp2), y.inner(&tmp3))
z.updateInner(zi)
return z
} This approach compared favorably to the However, even with the reduction in memory size and on datasets large enough to not fit entirely in my CPU's cache, benchmarks did not present this new approach favorably compared to the approach taken by this PR. It appears that in this performance sensitive code, even with everything inlined, the on-demand inflation and deflation of stack-allocated
So I think we'll want to stick with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, impressive work and a very thorough analysis! I haven't looked into the integration into CRDB, but change to apd
library makes sense to me.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 11 of 11 files at r4, 8 of 8 files at r5, 1 of 1 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @jordanlewis and @nvanbenschoten)
bigint.go, line 29 at r4 (raw file):
// The zero value is ready to use. // The value must not be copied after first use. type BigInt struct {
What will happen if a new method is added to big.Int
and we don't add the corresponding method to BigInt
?
bigint.go, line 70 at r4 (raw file):
} func (b *BigInt) inner() *big.Int {
nit: a quick comment describing when inner
vs innerOrNil
should be used would be helpful.
bigint.go, line 118 at r4 (raw file):
// Before doing so, zero out the inline array, in case it had a value // previously. This is necessary in edge cases where _inner initially
I didn't get to the tests yet, but this sounds like a good regression test.
bigint.go, line 193 at r4 (raw file):
func (b *BigInt) Bits() []big.Word { // Don't expose direct access to the big.Int's word slice. panic("unimplemented")
nit: maybe also a regression test for this?
const.go, line 33 at r5 (raw file):
decimalEight = New(8, 0) decimalMax = New(math.MaxInt64, 0)
nit: maybe calling these decimalMaxInt64
and decimalMinInt64
would be more descriptive?
decimal.go, line 362 at r5 (raw file):
// them with this scaling, along with the scaling. An error can be produced // if the resulting scale factor is out of range. func upscale(a, b *Decimal, tmp *BigInt) (*BigInt, *BigInt, int32, error) {
nit: maybe mention what tmp
should be and what it'll be used for.
gda_test.go, line 330 at r1 (raw file):
b.ResetTimer() for i := 0; i < b.N; i++ {
Not your change, but I don't understand the order of these two for
loops - currently for every benchmark iteration we ran many bench cases, but I would expect the order to be the opposite so that for a particular bench case we'd run as many iterations as needed for that bench case. Am I missing something?
gda_test.go, line 333 at r1 (raw file):
for _, bc := range bcs { // Ignore errors here because the full tests catch them. var res Decimal
nit: not sure if it matters, but we could pull out the definition of res
to be outside of the for
loops.
table.go, line 91 at r5 (raw file):
n := int64(float64(bl) / digitsToBitsRatio) var tpmE BigInt
nit: s/tpm/tmp/
.
I spent some time trying to address the TODO you mentioned above in order to make the integration easier, and, indeed, it is quite annoying to do. Another difficulty (apart from decimals becoming the only data type that is operated on by reference) is that in some cases we do want the decimal to be stored by value (for example, this is the case of aggregate functions for which we want to inline the decimal into the function itself). I'm curious to hear @jordanlewis thoughts about this PR, and if we all think it's worthwhile, I'll figure out hot to address that TODO. |
Thanks for taking a look, Yahor. I agree with everything you said. The no-copy limitation appeared to make the integration into the vectorized execution engine difficult for a number of different reasons, mostly stemming from the need to operate on references instead of values. It looked possible, but would likely be a lot of work. In light of your exploration, I took another look at the "More compact memory representation" alternative. This alternative approach is appealing because it does not use a self-referential pointer, so it can be copied by value safely (though I explored what it would take to bypass Here's the change to add these fast-paths to the compact memory representation approach: c5c0544. It's not a lot of code, and it benefits from the existing With that commit, the comparison between the two approaches shift in favor of the compact (and copyable) memory representation (old is this PR, new is c5c0544):
I'll look into packaging up that alternative approach and addressing the existing comments of yours that apply to it. |
Thanks Nathan, this looks very promising! I think I was able to make the switch from the value to the pointer in the vectorized engine (in cockroachdb/cockroach#74469), at least the unit tests seem to work, but I'll be happy if we end up not needing that change :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #103 as an alternative to this PR. In that one, I've addressed each of your questions.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jordanlewis and @yuzefovich)
bigint.go, line 29 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
What will happen if a new method is added to
big.Int
and we don't add the corresponding method toBigInt
?
(answer the same for either implementation) We're not embedding the big.Int
, so the methods won't be exported by this type. If we need them, we can add them after the fact.
bigint.go, line 70 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: a quick comment describing when
inner
vsinnerOrNil
should be used would be helpful.
Done.
bigint.go, line 118 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I didn't get to the tests yet, but this sounds like a good regression test.
Done.
bigint.go, line 193 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe also a regression test for this?
Done.
const.go, line 33 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe calling these
decimalMaxInt64
anddecimalMinInt64
would be more descriptive?
Good point, done.
decimal.go, line 362 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: maybe mention what
tmp
should be and what it'll be used for.
Done.
gda_test.go, line 330 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Not your change, but I don't understand the order of these two
for
loops - currently for every benchmark iteration we ran many bench cases, but I would expect the order to be the opposite so that for a particular bench case we'd run as many iterations as needed for that bench case. Am I missing something?
I don't think it makes a difference in practice. Either way, we're running b.N
iterations of each bench case.
gda_test.go, line 333 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: not sure if it matters, but we could pull out the definition of
res
to be outside of thefor
loops.
Good point. I pulled this into a vector of Decimal
inputs and outputs to more realistically model how the vectorized execution engine will evaluate arithmetic. As it turns out, doing so caused the benchmarks to skew even further (by a few percent) in favor of the compact representation, likely because the improved cache locality highlighted the size difference (2 Decimals
per cache line vs. 1).
table.go, line 91 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/tpm/tmp/
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jordanlewis and @nvanbenschoten)
bigint.go, line 29 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
(answer the same for either implementation) We're not embedding the
big.Int
, so the methods won't be exported by this type. If we need them, we can add them after the fact.
Indeed, good point.
gda_test.go, line 330 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think it makes a difference in practice. Either way, we're running
b.N
iterations of each bench case.
I thought that the value of b.N
changes for each bench case depending on the variability and length of the benchmark runs with the same name. Is that not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jordanlewis and @yuzefovich)
gda_test.go, line 330 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I thought that the value of
b.N
changes for each bench case depending on the variability and length of the benchmark runs with the same name. Is that not the case?
Yeah, it's kind of confusing how this works. Each of these "GDA" files contains a series of roughly 1000 operations of varying complexity and we run the entire suite of operations in a given file per benchmark iteration. So at the file level, we have sub-benchmarks (see b.Run
above), but within a test, we loop over all operations (benchCase
s) for each benchmark iteration.
I do think this is what we want though because it leads to stable results even with individual operations that vary wildly in cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @jordanlewis and @nvanbenschoten)
gda_test.go, line 330 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, it's kind of confusing how this works. Each of these "GDA" files contains a series of roughly 1000 operations of varying complexity and we run the entire suite of operations in a given file per benchmark iteration. So at the file level, we have sub-benchmarks (see
b.Run
above), but within a test, we loop over all operations (benchCase
s) for each benchmark iteration.I do think this is what we want though because it leads to stable results even with individual operations that vary wildly in cost.
I see, thanks for explaining.
74590: colexec: integrate flat, compact decimal datums r=nvanbenschoten a=nvanbenschoten Replaces #74369 and #57593. This PR picks up the following changes to `cockroachdb/apd`: - cockroachdb/apd#103 - cockroachdb/apd#104 - cockroachdb/apd#107 - cockroachdb/apd#108 - cockroachdb/apd#109 - cockroachdb/apd#110 - cockroachdb/apd#111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. ---- At a high-level, those changes implement the "compact memory representation" for Decimals described in cockroachdb/apd#102 (comment) and later implemented in cockroachdb/apd#103. Compared to the approach on master, the approach in cockroachdb/apd#103 is a) faster, b) avoids indirection + heap allocation, c) smaller. Compared to the alternate approach in cockroachdb/apd#102, the approach in cockroachdb/apd#103 is a) [faster for most operations](cockroachdb/apd#102 (comment)), b) more usable because values can be safely copied, c) half the memory size (32 bytes per `Decimal`, vs. 64). The memory representation of the Decimal struct in this approach looks like: ```go type Decimal struct { Form int8 Negative bool Exponent int32 Coeff BigInt { _inner *big.Int // nil when value fits in _inline _inline [2]uint } } // sizeof = 32 ``` With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in `_inline`. The indirection through `_inner` is only used for values larger than this. Before this change, the memory representation of the `Decimal` struct looked like: ```go type Decimal struct { Form int64 Negative bool Exponent int32 Coeff big.Int { neg bool abs []big.Word { data uintptr ---------------. len int64 v cap int64 [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes } } } // sizeof = 48 flat bytes + variable-length heap allocated array ``` ---- ## Performance impact ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (`web_sales`, row count = 7,197,566): Queries ```sql # q1 select sum(ws_wholesale_cost + ws_ext_list_price) from web_sales; # q2 select sum(2 * ws_wholesale_cost + ws_ext_list_price) - max(4 * ws_ext_ship_cost), min(ws_net_profit) from web_sales; # q3 select max(ws_bill_customer_sk + ws_bill_cdemo_sk + ws_bill_hdemo_sk + ws_bill_addr_sk + ws_ship_customer_sk + ws_ship_cdemo_sk + ws_ship_hdemo_sk + ws_ship_addr_sk + ws_web_page_sk + ws_web_site_sk + ws_ship_mode_sk + ws_warehouse_sk + ws_promo_sk + ws_order_number + ws_quantity + ws_wholesale_cost + ws_list_price + ws_sales_price + ws_ext_discount_amt + ws_ext_sales_price + ws_ext_wholesale_cost + ws_ext_list_price + ws_ext_tax + ws_coupon_amt + ws_ext_ship_cost + ws_net_paid + ws_net_paid_inc_tax + ws_net_paid_inc_ship + ws_net_paid_inc_ship_tax + ws_net_profit) from web_sales; ``` Here's the difference in runtime of these three queries before and after this change on an `n2-standard-4` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 7.21 ± 3% 6.59 ± 0% -8.57% (p=0.000 n=10+10) TPC-DS/custom/q2 10.2 ± 0% 9.7 ± 3% -5.42% (p=0.000 n=10+10) TPC-DS/custom/q3 21.9 ± 1% 17.3 ± 0% -21.13% (p=0.000 n=10+10) ``` ### Heap allocation reduction in TPC-DS Part of the reason for this speedup was that it significantly reduces heap allocations because most decimal values are stored inline. We can see this in q3 from above. Before the change, a heap profile looks like: <img width="1751" alt="Screen Shot 2022-01-07 at 7 12 49 PM" src="https://user-images.githubusercontent.com/5438456/148625159-9ceb470a-0742-4f75-a533-530d9944143c.png"> After the change, a heap profile looks like: <img width="1749" alt="Screen Shot 2022-01-07 at 7 17 32 PM" src="https://user-images.githubusercontent.com/5438456/148625174-629f4b47-07cc-4ef6-8723-2e556f7fc00d.png"> _(the dominant source of heap allocations is now `coldata.(*Nulls).Or`. #74592 should help here)_ ### Heap allocation reduction in TPC-E On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, `math/big.nat.make` was responsible for **51.07%** of total heap allocations: <img width="1587" alt="Screen Shot 2021-12-31 at 8 01 00 PM" src="https://user-images.githubusercontent.com/5438456/147842722-965d649d-b29a-4f66-aa07-1b05e52e97af.png"> After the change, `math/big.nat.make` is responsible for only **1.1%** of total heap allocations: <img width="1580" alt="Screen Shot 2021-12-31 at 9 04 24 PM" src="https://user-images.githubusercontent.com/5438456/147842727-a881a5a3-d038-48bb-bd44-4ade665afe73.png"> That equates to roughly a **50%** reduction in heap allocations. ### Microbenchmarks ``` name old time/op new time/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 65.6µs ± 2% 42.5µs ± 0% -35.15% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 68.4µs ± 1% 48.4µs ± 1% -29.20% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 1.65ms ± 1% 1.20ms ± 1% -27.31% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 51.4ms ± 1% 38.3ms ± 1% -25.59% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 12.5µs ± 1% 9.4µs ± 2% -24.72% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 12.5µs ± 1% 9.6µs ± 2% -23.24% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 10.5µs ± 1% 8.0µs ± 1% -23.22% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 12.4µs ± 1% 9.6µs ± 1% -22.70% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 60.5µs ± 1% 47.1µs ± 2% -22.24% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 61.2µs ± 1% 47.7µs ± 1% -22.09% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 62.3µs ± 1% 48.7µs ± 2% -21.91% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 1.31ms ± 0% 1.03ms ± 1% -21.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 82.3µs ± 1% 64.9µs ± 1% -21.12% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 86.6µs ± 1% 68.5µs ± 1% -20.93% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 96.0µs ± 1% 77.1µs ± 1% -19.73% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 41.2ms ± 0% 33.1ms ± 0% -19.64% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 17.5µs ± 1% 14.3µs ± 2% -18.59% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 14.8µs ± 3% 12.1µs ± 3% -18.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 20.0µs ± 1% 16.4µs ± 1% -18.04% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 20.9µs ± 1% 17.2µs ± 3% -17.80% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 884µs ± 0% 731µs ± 0% -17.30% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 27.9ms ± 0% 23.1ms ± 0% -17.27% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 218µs ± 2% 181µs ± 2% -17.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 911µs ± 1% 755µs ± 1% -17.10% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 957µs ± 1% 798µs ± 0% -16.66% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.54ms ± 1% 1.29ms ± 1% -16.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 188µs ± 1% 157µs ± 2% -16.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 28.8ms ± 0% 24.1ms ± 0% -16.14% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 30.4ms ± 0% 25.7ms ± 1% -15.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 135ms ± 1% 114ms ± 1% -15.21% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.79ms ± 1% 1.52ms ± 1% -15.14% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 6.29ms ± 1% 5.50ms ± 1% -12.62% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 62.2ms ± 0% 54.7ms ± 0% -12.08% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 2.46ms ± 1% 2.17ms ± 1% -11.88% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 5.64ms ± 0% 4.98ms ± 0% -11.76% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 354ms ± 2% 318ms ± 1% -10.18% (p=0.000 n=10+8) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.8ms ± 1% 83.3ms ± 0% -9.25% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 396ms ± 1% 369ms ± 1% -6.83% (p=0.000 n=8+8) name old speed new speed delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 125MB/s ± 2% 193MB/s ± 0% +54.20% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 120MB/s ± 1% 169MB/s ± 1% +41.24% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 159MB/s ± 1% 219MB/s ± 1% +37.57% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 163MB/s ± 1% 219MB/s ± 1% +34.39% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 20.4MB/s ± 1% 27.2MB/s ± 2% +32.85% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 764kB/s ± 2% 997kB/s ± 1% +30.45% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 20.5MB/s ± 1% 26.8MB/s ± 2% +30.28% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 20.7MB/s ± 1% 26.8MB/s ± 1% +29.37% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 135MB/s ± 1% 174MB/s ± 2% +28.61% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 134MB/s ± 1% 172MB/s ± 1% +28.35% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 131MB/s ± 1% 168MB/s ± 2% +28.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 200MB/s ± 0% 255MB/s ± 1% +27.45% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 100MB/s ± 1% 126MB/s ± 1% +26.78% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 94.6MB/s ± 1% 119.6MB/s ± 1% +26.47% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 85.3MB/s ± 1% 106.3MB/s ± 1% +24.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 204MB/s ± 0% 254MB/s ± 0% +24.44% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 14.6MB/s ± 1% 18.0MB/s ± 2% +22.83% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 544kB/s ± 3% 664kB/s ± 2% +22.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 12.8MB/s ± 1% 15.6MB/s ± 1% +22.02% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 12.3MB/s ± 1% 14.9MB/s ± 3% +21.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 296MB/s ± 0% 358MB/s ± 0% +20.92% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 300MB/s ± 0% 363MB/s ± 0% +20.87% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 37.5MB/s ± 2% 45.4MB/s ± 2% +20.82% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 288MB/s ± 1% 347MB/s ± 1% +20.62% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 274MB/s ± 1% 329MB/s ± 0% +19.99% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 170MB/s ± 1% 204MB/s ± 1% +19.85% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 43.6MB/s ± 1% 52.1MB/s ± 2% +19.52% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 292MB/s ± 0% 348MB/s ± 0% +19.25% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 276MB/s ± 0% 326MB/s ± 1% +18.00% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.1MB/s ± 1% 73.3MB/s ± 1% +17.94% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 147MB/s ± 1% 173MB/s ± 1% +17.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 41.7MB/s ± 1% 47.7MB/s ± 1% +14.44% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 135MB/s ± 0% 153MB/s ± 0% +13.74% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 106MB/s ± 1% 121MB/s ± 1% +13.48% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 46.5MB/s ± 0% 52.7MB/s ± 0% +13.34% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 23.7MB/s ± 2% 26.3MB/s ± 2% +11.02% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.3MB/s ± 0% 100.7MB/s ± 0% +10.27% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 21.2MB/s ± 1% 22.7MB/s ± 1% +7.32% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 354kB ± 0% 239kB ± 0% -32.39% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 348kB ± 0% 239kB ± 0% -31.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 251kB ± 0% 177kB ± 0% -29.44% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 246kB ± 0% 177kB ± 0% -28.28% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 275kB ± 0% 198kB ± 0% -28.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 243kB ± 0% 177kB ± 0% -27.15% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.09% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 268kB ± 0% 198kB ± 0% -26.05% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 264kB ± 0% 198kB ± 0% -25.04% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 75.1kB ± 0% 56.9kB ± 0% -24.25% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 74.9kB ± 0% 56.9kB ± 0% -24.12% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 74.8kB ± 0% 56.9kB ± 0% -23.99% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 69.6kB ± 0% 53.1kB ± 0% -23.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 95.2kB ± 0% 75.9kB ± 0% -20.23% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 102kB ± 0% 82kB ± 0% -20.04% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 103kB ± 0% 83kB ± 0% -19.95% (p=0.000 n=7+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 100kB ± 0% 80kB ± 0% -19.90% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 1.14MB ± 0% 0.92MB ± 0% -18.80% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 271kB ± 0% 227kB ± 0% -16.16% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 1.10MB ± 0% 0.92MB ± 0% -15.92% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 280kB ± 1% 235kB ± 1% -15.91% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 1.09MB ± 1% 0.92MB ± 0% -15.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 291kB ± 0% 245kB ± 1% -15.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.11MB ± 0% 0.95MB ± 0% -15.14% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.22MB ± 0% 1.04MB ± 0% -14.77% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 1.65MB ± 0% 1.42MB ± 0% -13.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 593kB ± 0% 513kB ± 0% -13.36% (p=0.000 n=9+8) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 520kB ± 0% 454kB ± 0% -12.82% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 1.04MB ± 0% 0.92MB ± 0% -11.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 2.48MB ± 0% 2.25MB ± 0% -9.32% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 967kB ± 0% 881kB ± 0% -8.89% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 7.86MB ± 0% 7.36MB ± 0% -6.44% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 14.2MB ± 1% 13.4MB ± 1% -5.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 12.3MB ± 0% 11.7MB ± 0% -5.03% (p=0.001 n=7+7) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 27.2MB ± 1% 25.9MB ± 1% -4.84% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 465MB ± 0% 445MB ± 0% -4.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 403MB ± 0% 390MB ± 0% -3.44% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 1.07k ± 0% 0.05k ± 0% -95.70% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 702k ± 0% 32k ± 0% -95.46% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 489k ± 0% 28k ± 0% -94.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 4.40k ± 0% 0.30k ± 0% -93.15% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 1.11k ± 0% 0.09k ± 0% -92.02% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 561 ± 0% 46 ± 0% -91.80% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 3.45k ± 0% 0.30k ± 0% -91.28% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 1.19k ± 0% 0.15k ± 1% -87.31% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 4.87k ± 0% 0.70k ± 0% -85.69% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 32.2k ± 0% 6.3k ± 0% -80.40% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 1.45k ± 3% 0.29k ± 0% -79.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 1.39k ± 0% 0.30k ± 1% -78.64% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 26.2k ± 0% 6.8k ± 1% -73.95% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 6.64k ± 0% 1.95k ± 0% -70.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 3.44k ± 1% 1.12k ± 1% -67.48% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.4k ± 0% 20.4k ± 0% -67.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 2.95k ± 1% 1.05k ± 1% -64.52% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 10.8k ± 0% 4.5k ± 0% -58.21% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 628 ± 3% 294 ± 0% -53.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 36.1k ± 0% 20.2k ± 0% -44.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 81.7 ± 3% 46.0 ± 0% -43.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 14.4k ± 1% 8.2k ± 0% -42.97% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 79.0 ± 0% 46.0 ± 0% -41.77% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 13.7k ± 1% 8.2k ± 0% -40.05% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 191 ± 1% 120 ± 1% -37.52% (p=0.000 n=7+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 12.9k ± 2% 8.2k ± 0% -36.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 176 ± 2% 115 ± 1% -34.33% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 12.3k ± 0% 8.2k ± 0% -33.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 21.8k ± 0% 15.2k ± 0% -30.13% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 118 ± 0% 84 ± 0% -28.81% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 63.0 ± 0% 46.0 ± 0% -26.98% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 57.2 ±14% 46.0 ± 0% -19.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 9.69k ± 1% 8.23k ± 0% -15.07% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 340 ± 2% 294 ± 0% -13.43% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 82.0 ± 0% 79.0 ± 0% -3.66% (p=0.000 n=10+10) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
Replaces cockroachdb/cockroach#74369.
Replaces #101.
This commit introduces a performance optimization that embeds small coefficient values directly in their
Decimal
struct, instead of storing these values in a separate heap allocation. It does so by replacingmath/big.Int
with a new wrapper type calledBigInt
that provides an "inline" compact representation optimization.Each
BigInt
maintains (throughbig.Int
) an internal reference to a variable-length integer value, which is represented by a[]big.Word
. The_inline
field andlazyInit
method combine to allowBigInt
to inline this variable-length integer array within theBigInt
struct when its value is sufficiently small. InlazyInit
, we point the_inner
field's slice at the_inline
array.big.Int
will avoid re-allocating this array until it is provided with a value that exceeds the initial capacity.We set the capacity of the inline array to accommodate any value that would fit in a 128-bit integer (i.e. values up to 2^128 - 1).
This is an alternative to an optimization that many other arbitrary precision decimal libraries have where small coefficient values are stored as numeric fields in their data type's struct. Only when this coefficient value gets large enough do these libraries fall back to a variable-length coefficient with internal indirection. We can see the optimization in practice in the
ericlagergren/decimal
library, where each struct contains acompact uint64
and anunscaled big.Int
. Prior concern from the authors ofcockroachdb/apd
regarding this form of compact representation optimization was that it traded performance for complexity. The optimization fractured control flow, leaking out across the library and leading to more complex, error-prone code.The approach taken in this commit does not have the same issue. All arithmetic on the decimal's coefficient is still deferred to
big.Int
. In fact, the entire optimization is best-effort, and bugs that lead to missed calls tolazyInit
are merely missed opportunities to avoid a heap allocation, and nothing more serious.However, one major complication with this approach is that Go's escape analysis struggles with understanding self-referential pointers. A naive implementation of this optimization would force all
BigInt
structs to escape to the heap. To work around this, we employ a similar trick tosync.Cond
andstrings.Builder
. We trick escape analysis to allow the self-referential pointer without causing the struct to escape.This works but it introduces complexity if
BigInt
structs are copied by value. So to avoid nasty bugs, we disallow copying ofBigInt
structs. The self-referencing pointer from_inner
to_inline
makes this unsafe, as it could allow aliasing between twoBigInt
structs which would be hidden from escape analysis. If the firstBigInt
then fell out of scope and was GCed, this could corrupt the state of the secondBigInt
.sync.Cond
andstrings.Builder
also prevent copying to avoid this kind of issue. In fact,big.Int
itself says that "shallow copies are not supported and may lead to errors", but it doesn't enforce this.Impact on benchmarks:
cc. @mjibson