[DNM] col/coldata: batch allocate big.Int coefficients for Decimals vectors #74369

nvanbenschoten · 2022-01-01T02:25:56Z

Related to #57593. Not a full replacement, but maybe a medium-term mitigation.

This commit updates defaultColumnFactory.MakeColumn to batch allocate the coefficients of each decimal in a Decimals vector.

Each Decimal maintains (through an embedded big.Int) an internal reference to a variable-length coefficient which is represented by a []big.Word. This commit attempts to minimize heap allocations by pre-allocating a single large []big.Word and distributing chunks of this slice to each Decimal in a Decimals vector. big.Int will avoid re-allocating unless its coefficient unless it is provided with a coefficient that exceeds the initial capacity. We set this capacity to accommodate any coefficient that would fit in a 64-bit integer (i.e. up to 2^64).

On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, math/big.nat.make was responsible for 51.07% of total heap allocations:

After the change, math/big.nat.make is responsible for only 1.1% of total heap allocations:

That equates to roughly a 50% reduction in heap allocations.

The PR also contains a second commit that does something similar for non-vectorized execution. That commit needs more work and can be split from the first commit if we'd like.

cockroach-teamcity · 2022-01-01T02:26:04Z

This change is

This commit updates `defaultColumnFactory.MakeColumn` to batch allocate the coefficients of each decimal in a `Decimals` vector. Each `Decimal` maintains (through an embedded `big.Int`) an internal reference to a variable-length coefficient which is represented by a `[]big.Word`. This commit attempts to minimize heap allocations by pre-allocating a single large `[]big.Word` and distributing chunks of this slice to each `Decimal` in a `Decimals` vector. unless it is provided with a coefficient that exceeds the initial capacity. We set this capacity to accommodate any coefficient that would fit in a 64-bit integer (i.e. up to 2^64).

Similar to the previous commit, but for non-vectorized execution. Needs polish.

nvanbenschoten · 2022-01-01T04:05:50Z

After thinking about this a bit further, I wonder whether we should be pushing this optimization into apd.Decimal.

I recall @mjibson expressing caution when discussing the "inline" representation that other similar libraries use (e.g. ericlagergren.decimal.Big.compact). I believe his concern was that the optimization would leak out across the library and would lead to more complex, error-prone code. Based on the number of references to "compact" in the business logic of https://github.com/ericlagergren/decimal/blob/aca2edc11f73e28a0e93d592cc6c3de4a352a81c/big.go, I can see where he was coming from.

If that was the main concern, then the approach used in this PR of inlining the big.Int's Word slice might be worth pursuing. This is a much more targeted optimization, as it still defers all arithmetic on the decimal's coefficient to big.Int. However, it provides the same compact memory representation, while trading off only marginal performance (would need validation). The idea would be to change apd.Decimal to look something like:

type Decimal struct {
	Form        Form
	Negative    bool
	Exponent    int32
	Coeff       big.Int
+	coeffInline [1]big.Word
}

+// lazyInit lazily initializes a zero Decimal value.
+func (d *Decimal) lazyInit() {
+	if d.Coeff.Bits() == nil {
+		d.Coeff.SetBits(d.coeffInline[:0])
+	}
+}

We could then add d.lazyInit() calls to the prelude of apd.Decimal's mutable methods. I think that's all that would be needed. Then we wouldn't need to leave any of this up in cockroach code.

If I get the chance, I'll try to play around with this idea.

Also, while we're at it, we should change apd.Decimal.Form to use an int8 as its base type. That should save a word. Or put another way, it should offset the size of coeffInline, making the other optimization free.

74590: colexec: integrate flat, compact decimal datums r=nvanbenschoten a=nvanbenschoten Replaces #74369 and #57593. This PR picks up the following changes to `cockroachdb/apd`: - cockroachdb/apd#103 - cockroachdb/apd#104 - cockroachdb/apd#107 - cockroachdb/apd#108 - cockroachdb/apd#109 - cockroachdb/apd#110 - cockroachdb/apd#111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. ---- At a high-level, those changes implement the "compact memory representation" for Decimals described in cockroachdb/apd#102 (comment) and later implemented in cockroachdb/apd#103. Compared to the approach on master, the approach in cockroachdb/apd#103 is a) faster, b) avoids indirection + heap allocation, c) smaller. Compared to the alternate approach in cockroachdb/apd#102, the approach in cockroachdb/apd#103 is a) [faster for most operations](cockroachdb/apd#102 (comment)), b) more usable because values can be safely copied, c) half the memory size (32 bytes per `Decimal`, vs. 64). The memory representation of the Decimal struct in this approach looks like: ```go type Decimal struct { Form int8 Negative bool Exponent int32 Coeff BigInt { _inner *big.Int // nil when value fits in _inline _inline [2]uint } } // sizeof = 32 ``` With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in `_inline`. The indirection through `_inner` is only used for values larger than this. Before this change, the memory representation of the `Decimal` struct looked like: ```go type Decimal struct { Form int64 Negative bool Exponent int32 Coeff big.Int { neg bool abs []big.Word { data uintptr ---------------. len int64 v cap int64 [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes } } } // sizeof = 48 flat bytes + variable-length heap allocated array ``` ---- ## Performance impact ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (`web_sales`, row count = 7,197,566): Queries ```sql # q1 select sum(ws_wholesale_cost + ws_ext_list_price) from web_sales; # q2 select sum(2 * ws_wholesale_cost + ws_ext_list_price) - max(4 * ws_ext_ship_cost), min(ws_net_profit) from web_sales; # q3 select max(ws_bill_customer_sk + ws_bill_cdemo_sk + ws_bill_hdemo_sk + ws_bill_addr_sk + ws_ship_customer_sk + ws_ship_cdemo_sk + ws_ship_hdemo_sk + ws_ship_addr_sk + ws_web_page_sk + ws_web_site_sk + ws_ship_mode_sk + ws_warehouse_sk + ws_promo_sk + ws_order_number + ws_quantity + ws_wholesale_cost + ws_list_price + ws_sales_price + ws_ext_discount_amt + ws_ext_sales_price + ws_ext_wholesale_cost + ws_ext_list_price + ws_ext_tax + ws_coupon_amt + ws_ext_ship_cost + ws_net_paid + ws_net_paid_inc_tax + ws_net_paid_inc_ship + ws_net_paid_inc_ship_tax + ws_net_profit) from web_sales; ``` Here's the difference in runtime of these three queries before and after this change on an `n2-standard-4` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 7.21 ± 3% 6.59 ± 0% -8.57% (p=0.000 n=10+10) TPC-DS/custom/q2 10.2 ± 0% 9.7 ± 3% -5.42% (p=0.000 n=10+10) TPC-DS/custom/q3 21.9 ± 1% 17.3 ± 0% -21.13% (p=0.000 n=10+10) ``` ### Heap allocation reduction in TPC-DS Part of the reason for this speedup was that it significantly reduces heap allocations because most decimal values are stored inline. We can see this in q3 from above. Before the change, a heap profile looks like: <img width="1751" alt="Screen Shot 2022-01-07 at 7 12 49 PM" src="https://user-images.githubusercontent.com/5438456/148625159-9ceb470a-0742-4f75-a533-530d9944143c.png"> After the change, a heap profile looks like: <img width="1749" alt="Screen Shot 2022-01-07 at 7 17 32 PM" src="https://user-images.githubusercontent.com/5438456/148625174-629f4b47-07cc-4ef6-8723-2e556f7fc00d.png"> _(the dominant source of heap allocations is now `coldata.(*Nulls).Or`. #74592 should help here)_ ### Heap allocation reduction in TPC-E On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, `math/big.nat.make` was responsible for **51.07%** of total heap allocations: <img width="1587" alt="Screen Shot 2021-12-31 at 8 01 00 PM" src="https://user-images.githubusercontent.com/5438456/147842722-965d649d-b29a-4f66-aa07-1b05e52e97af.png"> After the change, `math/big.nat.make` is responsible for only **1.1%** of total heap allocations: <img width="1580" alt="Screen Shot 2021-12-31 at 9 04 24 PM" src="https://user-images.githubusercontent.com/5438456/147842727-a881a5a3-d038-48bb-bd44-4ade665afe73.png"> That equates to roughly a **50%** reduction in heap allocations. ### Microbenchmarks ``` name old time/op new time/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 65.6µs ± 2% 42.5µs ± 0% -35.15% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 68.4µs ± 1% 48.4µs ± 1% -29.20% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 1.65ms ± 1% 1.20ms ± 1% -27.31% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 51.4ms ± 1% 38.3ms ± 1% -25.59% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 12.5µs ± 1% 9.4µs ± 2% -24.72% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 12.5µs ± 1% 9.6µs ± 2% -23.24% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 10.5µs ± 1% 8.0µs ± 1% -23.22% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 12.4µs ± 1% 9.6µs ± 1% -22.70% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 60.5µs ± 1% 47.1µs ± 2% -22.24% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 61.2µs ± 1% 47.7µs ± 1% -22.09% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 62.3µs ± 1% 48.7µs ± 2% -21.91% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 1.31ms ± 0% 1.03ms ± 1% -21.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 82.3µs ± 1% 64.9µs ± 1% -21.12% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 86.6µs ± 1% 68.5µs ± 1% -20.93% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 96.0µs ± 1% 77.1µs ± 1% -19.73% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 41.2ms ± 0% 33.1ms ± 0% -19.64% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 17.5µs ± 1% 14.3µs ± 2% -18.59% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 14.8µs ± 3% 12.1µs ± 3% -18.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 20.0µs ± 1% 16.4µs ± 1% -18.04% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 20.9µs ± 1% 17.2µs ± 3% -17.80% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 884µs ± 0% 731µs ± 0% -17.30% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 27.9ms ± 0% 23.1ms ± 0% -17.27% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 218µs ± 2% 181µs ± 2% -17.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 911µs ± 1% 755µs ± 1% -17.10% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 957µs ± 1% 798µs ± 0% -16.66% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.54ms ± 1% 1.29ms ± 1% -16.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 188µs ± 1% 157µs ± 2% -16.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 28.8ms ± 0% 24.1ms ± 0% -16.14% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 30.4ms ± 0% 25.7ms ± 1% -15.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 135ms ± 1% 114ms ± 1% -15.21% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.79ms ± 1% 1.52ms ± 1% -15.14% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 6.29ms ± 1% 5.50ms ± 1% -12.62% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 62.2ms ± 0% 54.7ms ± 0% -12.08% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 2.46ms ± 1% 2.17ms ± 1% -11.88% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 5.64ms ± 0% 4.98ms ± 0% -11.76% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 354ms ± 2% 318ms ± 1% -10.18% (p=0.000 n=10+8) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.8ms ± 1% 83.3ms ± 0% -9.25% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 396ms ± 1% 369ms ± 1% -6.83% (p=0.000 n=8+8) name old speed new speed delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 125MB/s ± 2% 193MB/s ± 0% +54.20% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 120MB/s ± 1% 169MB/s ± 1% +41.24% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 159MB/s ± 1% 219MB/s ± 1% +37.57% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 163MB/s ± 1% 219MB/s ± 1% +34.39% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 20.4MB/s ± 1% 27.2MB/s ± 2% +32.85% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 764kB/s ± 2% 997kB/s ± 1% +30.45% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 20.5MB/s ± 1% 26.8MB/s ± 2% +30.28% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 20.7MB/s ± 1% 26.8MB/s ± 1% +29.37% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 135MB/s ± 1% 174MB/s ± 2% +28.61% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 134MB/s ± 1% 172MB/s ± 1% +28.35% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 131MB/s ± 1% 168MB/s ± 2% +28.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 200MB/s ± 0% 255MB/s ± 1% +27.45% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 100MB/s ± 1% 126MB/s ± 1% +26.78% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 94.6MB/s ± 1% 119.6MB/s ± 1% +26.47% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 85.3MB/s ± 1% 106.3MB/s ± 1% +24.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 204MB/s ± 0% 254MB/s ± 0% +24.44% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 14.6MB/s ± 1% 18.0MB/s ± 2% +22.83% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 544kB/s ± 3% 664kB/s ± 2% +22.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 12.8MB/s ± 1% 15.6MB/s ± 1% +22.02% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 12.3MB/s ± 1% 14.9MB/s ± 3% +21.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 296MB/s ± 0% 358MB/s ± 0% +20.92% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 300MB/s ± 0% 363MB/s ± 0% +20.87% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 37.5MB/s ± 2% 45.4MB/s ± 2% +20.82% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 288MB/s ± 1% 347MB/s ± 1% +20.62% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 274MB/s ± 1% 329MB/s ± 0% +19.99% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 170MB/s ± 1% 204MB/s ± 1% +19.85% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 43.6MB/s ± 1% 52.1MB/s ± 2% +19.52% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 292MB/s ± 0% 348MB/s ± 0% +19.25% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 276MB/s ± 0% 326MB/s ± 1% +18.00% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.1MB/s ± 1% 73.3MB/s ± 1% +17.94% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 147MB/s ± 1% 173MB/s ± 1% +17.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 41.7MB/s ± 1% 47.7MB/s ± 1% +14.44% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 135MB/s ± 0% 153MB/s ± 0% +13.74% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 106MB/s ± 1% 121MB/s ± 1% +13.48% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 46.5MB/s ± 0% 52.7MB/s ± 0% +13.34% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 23.7MB/s ± 2% 26.3MB/s ± 2% +11.02% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.3MB/s ± 0% 100.7MB/s ± 0% +10.27% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 21.2MB/s ± 1% 22.7MB/s ± 1% +7.32% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 354kB ± 0% 239kB ± 0% -32.39% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 348kB ± 0% 239kB ± 0% -31.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 251kB ± 0% 177kB ± 0% -29.44% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 246kB ± 0% 177kB ± 0% -28.28% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 275kB ± 0% 198kB ± 0% -28.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 243kB ± 0% 177kB ± 0% -27.15% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.09% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 268kB ± 0% 198kB ± 0% -26.05% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 264kB ± 0% 198kB ± 0% -25.04% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 75.1kB ± 0% 56.9kB ± 0% -24.25% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 74.9kB ± 0% 56.9kB ± 0% -24.12% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 74.8kB ± 0% 56.9kB ± 0% -23.99% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 69.6kB ± 0% 53.1kB ± 0% -23.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 95.2kB ± 0% 75.9kB ± 0% -20.23% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 102kB ± 0% 82kB ± 0% -20.04% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 103kB ± 0% 83kB ± 0% -19.95% (p=0.000 n=7+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 100kB ± 0% 80kB ± 0% -19.90% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 1.14MB ± 0% 0.92MB ± 0% -18.80% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 271kB ± 0% 227kB ± 0% -16.16% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 1.10MB ± 0% 0.92MB ± 0% -15.92% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 280kB ± 1% 235kB ± 1% -15.91% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 1.09MB ± 1% 0.92MB ± 0% -15.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 291kB ± 0% 245kB ± 1% -15.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.11MB ± 0% 0.95MB ± 0% -15.14% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.22MB ± 0% 1.04MB ± 0% -14.77% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 1.65MB ± 0% 1.42MB ± 0% -13.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 593kB ± 0% 513kB ± 0% -13.36% (p=0.000 n=9+8) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 520kB ± 0% 454kB ± 0% -12.82% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 1.04MB ± 0% 0.92MB ± 0% -11.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 2.48MB ± 0% 2.25MB ± 0% -9.32% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 967kB ± 0% 881kB ± 0% -8.89% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 7.86MB ± 0% 7.36MB ± 0% -6.44% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 14.2MB ± 1% 13.4MB ± 1% -5.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 12.3MB ± 0% 11.7MB ± 0% -5.03% (p=0.001 n=7+7) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 27.2MB ± 1% 25.9MB ± 1% -4.84% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 465MB ± 0% 445MB ± 0% -4.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 403MB ± 0% 390MB ± 0% -3.44% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 1.07k ± 0% 0.05k ± 0% -95.70% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 702k ± 0% 32k ± 0% -95.46% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 489k ± 0% 28k ± 0% -94.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 4.40k ± 0% 0.30k ± 0% -93.15% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 1.11k ± 0% 0.09k ± 0% -92.02% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 561 ± 0% 46 ± 0% -91.80% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 3.45k ± 0% 0.30k ± 0% -91.28% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 1.19k ± 0% 0.15k ± 1% -87.31% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 4.87k ± 0% 0.70k ± 0% -85.69% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 32.2k ± 0% 6.3k ± 0% -80.40% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 1.45k ± 3% 0.29k ± 0% -79.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 1.39k ± 0% 0.30k ± 1% -78.64% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 26.2k ± 0% 6.8k ± 1% -73.95% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 6.64k ± 0% 1.95k ± 0% -70.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 3.44k ± 1% 1.12k ± 1% -67.48% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.4k ± 0% 20.4k ± 0% -67.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 2.95k ± 1% 1.05k ± 1% -64.52% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 10.8k ± 0% 4.5k ± 0% -58.21% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 628 ± 3% 294 ± 0% -53.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 36.1k ± 0% 20.2k ± 0% -44.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 81.7 ± 3% 46.0 ± 0% -43.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 14.4k ± 1% 8.2k ± 0% -42.97% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 79.0 ± 0% 46.0 ± 0% -41.77% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 13.7k ± 1% 8.2k ± 0% -40.05% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 191 ± 1% 120 ± 1% -37.52% (p=0.000 n=7+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 12.9k ± 2% 8.2k ± 0% -36.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 176 ± 2% 115 ± 1% -34.33% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 12.3k ± 0% 8.2k ± 0% -33.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 21.8k ± 0% 15.2k ± 0% -30.13% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 118 ± 0% 84 ± 0% -28.81% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 63.0 ± 0% 46.0 ± 0% -26.98% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 57.2 ±14% 46.0 ± 0% -19.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 9.69k ± 1% 8.23k ± 0% -15.07% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 340 ± 2% 294 ± 0% -13.43% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 82.0 ± 0% 79.0 ± 0% -3.66% (p=0.000 n=10+10) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten requested review from jordanlewis, yuzefovich and a team January 1, 2022 02:25

nvanbenschoten requested a review from a team as a code owner January 1, 2022 02:25

nvanbenschoten requested review from stevendanna and removed request for a team January 1, 2022 02:25

nvanbenschoten force-pushed the nvanbenschoten/decimalAlloc branch from 79970c5 to c1597da Compare January 1, 2022 03:09

nvanbenschoten added 2 commits December 31, 2021 22:43

[DNM] sql: combine DDecimal alloc with big.Int coefficient alloc

9512cb4

Similar to the previous commit, but for non-vectorized execution. Needs polish.

nvanbenschoten force-pushed the nvanbenschoten/decimalAlloc branch from c1597da to 9512cb4 Compare January 1, 2022 03:44

nvanbenschoten closed this Jan 11, 2022

nvanbenschoten deleted the nvanbenschoten/decimalAlloc branch January 11, 2022 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] col/coldata: batch allocate big.Int coefficients for Decimals vectors #74369

[DNM] col/coldata: batch allocate big.Int coefficients for Decimals vectors #74369

nvanbenschoten commented Jan 1, 2022

cockroach-teamcity commented Jan 1, 2022

nvanbenschoten commented Jan 1, 2022 •

edited

Loading

[DNM] col/coldata: batch allocate big.Int coefficients for Decimals vectors #74369

[DNM] col/coldata: batch allocate big.Int coefficients for Decimals vectors #74369

Conversation

nvanbenschoten commented Jan 1, 2022

cockroach-teamcity commented Jan 1, 2022

nvanbenschoten commented Jan 1, 2022 • edited Loading

nvanbenschoten commented Jan 1, 2022 •

edited

Loading