-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colexec: begin to implement flat decimal columns #57593
Conversation
I fixed up most things since I was curious in the benchmark numbers, and here is what I got (note that I modified the benchmark to use the decimals):
|
Can you push a patch with your benchmark change too? |
Sure, I switched one place from value to pointer, rerunning the benchmarks. |
That didn't help (benchmarks only of the switching commit):
|
775cc1c
to
c8c9c61
Compare
This commit is a hybrid, where we store a slice of
|
fa17595
to
b9c8c12
Compare
This commit changes the representation of the Decimals column in the colexec package to be a wrapped flat Bytes representation. Instead of storing values of apd.Decimal in a slice (which contain heap pointers), we now store a serialized form of apd.Decimal in a flat bytes slice, without any heap pointers. Then, to access the apd.Decimals at runtime, we "deserialize" them in a close to zero-copy fashion, inflating an apd.Decimal by pointing its internal varlen Coeff field directly at the serialized bytes. Release note: None
Latest commit with completely flat representation has the following benchmark results:
|
* origin/master: (113 commits) update external contributor hall of fame ccl/sqlproxyccl: idle connection timout support builtins: add fuzzystrmatch soundex and difference builtin functions sql,log: productionize the event logging kv: fix a snapshot error test matcher server: fix decomm status replica overcounting for r1 sql: fix bug allowing FKs referencing columns with no unique constraint sql: fix bug preventing rollback of ALTER TABLE ADD FOREIGN KEY pkg/sql: implement levenshtein sql: populated pg_depend with table and view dependencies sql: use constraint name when adding a primary key constraint bazel: generate a few SQL files within the sandbox execinfrapb: include complete component ID in stats proto build/teamcity-support.sh: re-instate the github-post install kvserver: disallow racing replicate queue during tests ui: DB Console branding refresh sql/catalog/descs: return appropriate type from Get(Table|Type)ByName sql: add sql.trace.stmt.enable_threshold build: tweak the TC test runner to detect package fails sql: add unique constraints to table descriptor for UNIQUE WITHOUT INDEX ...
* origin/master: colexec: remove almost all usages of execgen.SLICE sql: initial support for virtual columns kv/kvserver: skip TestReplicateAfterTruncation optbuilder: reduce redundant building of arbiter filter expressions opt: build all partial index predicate expressions in TableMeta
74590: colexec: integrate flat, compact decimal datums r=nvanbenschoten a=nvanbenschoten Replaces #74369 and #57593. This PR picks up the following changes to `cockroachdb/apd`: - cockroachdb/apd#103 - cockroachdb/apd#104 - cockroachdb/apd#107 - cockroachdb/apd#108 - cockroachdb/apd#109 - cockroachdb/apd#110 - cockroachdb/apd#111 Release note (performance improvement): The memory representation of DECIMAL datums has been optimized to save space, avoid heap allocations, and eliminate indirection. This increases the speed of DECIMAL arithmetic and aggregation by up to 20% on large data sets. ---- At a high-level, those changes implement the "compact memory representation" for Decimals described in cockroachdb/apd#102 (comment) and later implemented in cockroachdb/apd#103. Compared to the approach on master, the approach in cockroachdb/apd#103 is a) faster, b) avoids indirection + heap allocation, c) smaller. Compared to the alternate approach in cockroachdb/apd#102, the approach in cockroachdb/apd#103 is a) [faster for most operations](cockroachdb/apd#102 (comment)), b) more usable because values can be safely copied, c) half the memory size (32 bytes per `Decimal`, vs. 64). The memory representation of the Decimal struct in this approach looks like: ```go type Decimal struct { Form int8 Negative bool Exponent int32 Coeff BigInt { _inner *big.Int // nil when value fits in _inline _inline [2]uint } } // sizeof = 32 ``` With a two-word inline array, any value that would fit in a 128-bit integer (i.e. decimals with a scale-adjusted absolute value up to 2^128 - 1) fit in `_inline`. The indirection through `_inner` is only used for values larger than this. Before this change, the memory representation of the `Decimal` struct looked like: ```go type Decimal struct { Form int64 Negative bool Exponent int32 Coeff big.Int { neg bool abs []big.Word { data uintptr ---------------. len int64 v cap int64 [uint, uint, ...] // sizeof = variable, but around cap = 4, so 32 bytes } } } // sizeof = 48 flat bytes + variable-length heap allocated array ``` ---- ## Performance impact ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (`web_sales`, row count = 7,197,566): Queries ```sql # q1 select sum(ws_wholesale_cost + ws_ext_list_price) from web_sales; # q2 select sum(2 * ws_wholesale_cost + ws_ext_list_price) - max(4 * ws_ext_ship_cost), min(ws_net_profit) from web_sales; # q3 select max(ws_bill_customer_sk + ws_bill_cdemo_sk + ws_bill_hdemo_sk + ws_bill_addr_sk + ws_ship_customer_sk + ws_ship_cdemo_sk + ws_ship_hdemo_sk + ws_ship_addr_sk + ws_web_page_sk + ws_web_site_sk + ws_ship_mode_sk + ws_warehouse_sk + ws_promo_sk + ws_order_number + ws_quantity + ws_wholesale_cost + ws_list_price + ws_sales_price + ws_ext_discount_amt + ws_ext_sales_price + ws_ext_wholesale_cost + ws_ext_list_price + ws_ext_tax + ws_coupon_amt + ws_ext_ship_cost + ws_net_paid + ws_net_paid_inc_tax + ws_net_paid_inc_ship + ws_net_paid_inc_ship_tax + ws_net_profit) from web_sales; ``` Here's the difference in runtime of these three queries before and after this change on an `n2-standard-4` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 7.21 ± 3% 6.59 ± 0% -8.57% (p=0.000 n=10+10) TPC-DS/custom/q2 10.2 ± 0% 9.7 ± 3% -5.42% (p=0.000 n=10+10) TPC-DS/custom/q3 21.9 ± 1% 17.3 ± 0% -21.13% (p=0.000 n=10+10) ``` ### Heap allocation reduction in TPC-DS Part of the reason for this speedup was that it significantly reduces heap allocations because most decimal values are stored inline. We can see this in q3 from above. Before the change, a heap profile looks like: <img width="1751" alt="Screen Shot 2022-01-07 at 7 12 49 PM" src="https://user-images.githubusercontent.com/5438456/148625159-9ceb470a-0742-4f75-a533-530d9944143c.png"> After the change, a heap profile looks like: <img width="1749" alt="Screen Shot 2022-01-07 at 7 17 32 PM" src="https://user-images.githubusercontent.com/5438456/148625174-629f4b47-07cc-4ef6-8723-2e556f7fc00d.png"> _(the dominant source of heap allocations is now `coldata.(*Nulls).Or`. #74592 should help here)_ ### Heap allocation reduction in TPC-E On the read-only portion of the TPC-E (77% of the full workload, in terms of txn mix), this change has a significant impact on total heap allocations. Before the change, `math/big.nat.make` was responsible for **51.07%** of total heap allocations: <img width="1587" alt="Screen Shot 2021-12-31 at 8 01 00 PM" src="https://user-images.githubusercontent.com/5438456/147842722-965d649d-b29a-4f66-aa07-1b05e52e97af.png"> After the change, `math/big.nat.make` is responsible for only **1.1%** of total heap allocations: <img width="1580" alt="Screen Shot 2021-12-31 at 9 04 24 PM" src="https://user-images.githubusercontent.com/5438456/147842727-a881a5a3-d038-48bb-bd44-4ade665afe73.png"> That equates to roughly a **50%** reduction in heap allocations. ### Microbenchmarks ``` name old time/op new time/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 65.6µs ± 2% 42.5µs ± 0% -35.15% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 68.4µs ± 1% 48.4µs ± 1% -29.20% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 1.65ms ± 1% 1.20ms ± 1% -27.31% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 51.4ms ± 1% 38.3ms ± 1% -25.59% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 12.5µs ± 1% 9.4µs ± 2% -24.72% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 12.5µs ± 1% 9.6µs ± 2% -23.24% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 10.5µs ± 1% 8.0µs ± 1% -23.22% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 12.4µs ± 1% 9.6µs ± 1% -22.70% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 60.5µs ± 1% 47.1µs ± 2% -22.24% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 61.2µs ± 1% 47.7µs ± 1% -22.09% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 62.3µs ± 1% 48.7µs ± 2% -21.91% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 1.31ms ± 0% 1.03ms ± 1% -21.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 82.3µs ± 1% 64.9µs ± 1% -21.12% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 86.6µs ± 1% 68.5µs ± 1% -20.93% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 96.0µs ± 1% 77.1µs ± 1% -19.73% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 41.2ms ± 0% 33.1ms ± 0% -19.64% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 17.5µs ± 1% 14.3µs ± 2% -18.59% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 14.8µs ± 3% 12.1µs ± 3% -18.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 20.0µs ± 1% 16.4µs ± 1% -18.04% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 20.9µs ± 1% 17.2µs ± 3% -17.80% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 884µs ± 0% 731µs ± 0% -17.30% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 27.9ms ± 0% 23.1ms ± 0% -17.27% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 218µs ± 2% 181µs ± 2% -17.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 911µs ± 1% 755µs ± 1% -17.10% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 957µs ± 1% 798µs ± 0% -16.66% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.54ms ± 1% 1.29ms ± 1% -16.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 188µs ± 1% 157µs ± 2% -16.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 28.8ms ± 0% 24.1ms ± 0% -16.14% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 30.4ms ± 0% 25.7ms ± 1% -15.26% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 135ms ± 1% 114ms ± 1% -15.21% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.79ms ± 1% 1.52ms ± 1% -15.14% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 6.29ms ± 1% 5.50ms ± 1% -12.62% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 62.2ms ± 0% 54.7ms ± 0% -12.08% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 2.46ms ± 1% 2.17ms ± 1% -11.88% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 5.64ms ± 0% 4.98ms ± 0% -11.76% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 354ms ± 2% 318ms ± 1% -10.18% (p=0.000 n=10+8) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.8ms ± 1% 83.3ms ± 0% -9.25% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 396ms ± 1% 369ms ± 1% -6.83% (p=0.000 n=8+8) name old speed new speed delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 125MB/s ± 2% 193MB/s ± 0% +54.20% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 120MB/s ± 1% 169MB/s ± 1% +41.24% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 159MB/s ± 1% 219MB/s ± 1% +37.57% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 163MB/s ± 1% 219MB/s ± 1% +34.39% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 20.4MB/s ± 1% 27.2MB/s ± 2% +32.85% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 764kB/s ± 2% 997kB/s ± 1% +30.45% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 20.5MB/s ± 1% 26.8MB/s ± 2% +30.28% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 20.7MB/s ± 1% 26.8MB/s ± 1% +29.37% (p=0.000 n=8+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 135MB/s ± 1% 174MB/s ± 2% +28.61% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 134MB/s ± 1% 172MB/s ± 1% +28.35% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 131MB/s ± 1% 168MB/s ± 2% +28.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 200MB/s ± 0% 255MB/s ± 1% +27.45% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 100MB/s ± 1% 126MB/s ± 1% +26.78% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 94.6MB/s ± 1% 119.6MB/s ± 1% +26.47% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 85.3MB/s ± 1% 106.3MB/s ± 1% +24.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 204MB/s ± 0% 254MB/s ± 0% +24.44% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 14.6MB/s ± 1% 18.0MB/s ± 2% +22.83% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 544kB/s ± 3% 664kB/s ± 2% +22.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 12.8MB/s ± 1% 15.6MB/s ± 1% +22.02% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 12.3MB/s ± 1% 14.9MB/s ± 3% +21.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 296MB/s ± 0% 358MB/s ± 0% +20.92% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 300MB/s ± 0% 363MB/s ± 0% +20.87% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 37.5MB/s ± 2% 45.4MB/s ± 2% +20.82% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 288MB/s ± 1% 347MB/s ± 1% +20.62% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 274MB/s ± 1% 329MB/s ± 0% +19.99% (p=0.000 n=9+9) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 170MB/s ± 1% 204MB/s ± 1% +19.85% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 43.6MB/s ± 1% 52.1MB/s ± 2% +19.52% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 292MB/s ± 0% 348MB/s ± 0% +19.25% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 276MB/s ± 0% 326MB/s ± 1% +18.00% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.1MB/s ± 1% 73.3MB/s ± 1% +17.94% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 147MB/s ± 1% 173MB/s ± 1% +17.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 41.7MB/s ± 1% 47.7MB/s ± 1% +14.44% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 135MB/s ± 0% 153MB/s ± 0% +13.74% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 106MB/s ± 1% 121MB/s ± 1% +13.48% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 46.5MB/s ± 0% 52.7MB/s ± 0% +13.34% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 23.7MB/s ± 2% 26.3MB/s ± 2% +11.02% (p=0.000 n=10+9) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 91.3MB/s ± 0% 100.7MB/s ± 0% +10.27% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 21.2MB/s ± 1% 22.7MB/s ± 1% +7.32% (p=0.000 n=8+8) name old alloc/op new alloc/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 354kB ± 0% 239kB ± 0% -32.39% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 348kB ± 0% 239kB ± 0% -31.23% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 251kB ± 0% 177kB ± 0% -29.44% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 246kB ± 0% 177kB ± 0% -28.28% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 275kB ± 0% 198kB ± 0% -28.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 243kB ± 0% 177kB ± 0% -27.15% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.09% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 242kB ± 0% 177kB ± 0% -27.06% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 268kB ± 0% 198kB ± 0% -26.05% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 264kB ± 0% 198kB ± 0% -25.04% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 75.1kB ± 0% 56.9kB ± 0% -24.25% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 74.9kB ± 0% 56.9kB ± 0% -24.12% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 74.8kB ± 0% 56.9kB ± 0% -23.99% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 69.6kB ± 0% 53.1kB ± 0% -23.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 95.2kB ± 0% 75.9kB ± 0% -20.23% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 102kB ± 0% 82kB ± 0% -20.04% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 103kB ± 0% 83kB ± 0% -19.95% (p=0.000 n=7+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 100kB ± 0% 80kB ± 0% -19.90% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 1.14MB ± 0% 0.92MB ± 0% -18.80% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 271kB ± 0% 227kB ± 0% -16.16% (p=0.000 n=9+9) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 1.10MB ± 0% 0.92MB ± 0% -15.92% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 280kB ± 1% 235kB ± 1% -15.91% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 1.09MB ± 1% 0.92MB ± 0% -15.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 291kB ± 0% 245kB ± 1% -15.53% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 1.11MB ± 0% 0.95MB ± 0% -15.14% (p=0.000 n=8+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 1.22MB ± 0% 1.04MB ± 0% -14.77% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 1.65MB ± 0% 1.42MB ± 0% -13.56% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 593kB ± 0% 513kB ± 0% -13.36% (p=0.000 n=9+8) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 520kB ± 0% 454kB ± 0% -12.82% (p=0.000 n=9+8) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 1.04MB ± 0% 0.92MB ± 0% -11.06% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 2.48MB ± 0% 2.25MB ± 0% -9.32% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 967kB ± 0% 881kB ± 0% -8.89% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 7.86MB ± 0% 7.36MB ± 0% -6.44% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 14.2MB ± 1% 13.4MB ± 1% -5.83% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 12.3MB ± 0% 11.7MB ± 0% -5.03% (p=0.001 n=7+7) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 27.2MB ± 1% 25.9MB ± 1% -4.84% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 465MB ± 0% 445MB ± 0% -4.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 403MB ± 0% 390MB ± 0% -3.44% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1024-10 1.07k ± 0% 0.05k ± 0% -95.70% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1048576-10 702k ± 0% 32k ± 0% -95.46% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1048576-10 489k ± 0% 28k ± 0% -94.33% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32768-10 4.40k ± 0% 0.30k ± 0% -93.15% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1024-10 1.11k ± 0% 0.09k ± 0% -92.02% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1024-10 561 ± 0% 46 ± 0% -91.80% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32768-10 3.45k ± 0% 0.30k ± 0% -91.28% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1024-10 1.19k ± 0% 0.15k ± 1% -87.31% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=32768-10 4.87k ± 0% 0.70k ± 0% -85.69% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32768-10 32.2k ± 0% 6.3k ± 0% -80.40% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32768-10 1.45k ± 3% 0.29k ± 0% -79.66% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1024-10 1.39k ± 0% 0.30k ± 1% -78.64% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32768-10 26.2k ± 0% 6.8k ± 1% -73.95% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=32768-10 6.64k ± 0% 1.95k ± 0% -70.67% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1024-10 3.44k ± 1% 1.12k ± 1% -67.48% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=1048576-10 62.4k ± 0% 20.4k ± 0% -67.32% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=1024-10 2.95k ± 1% 1.05k ± 1% -64.52% (p=0.000 n=9+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32768-10 10.8k ± 0% 4.5k ± 0% -58.21% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=32768-10 628 ± 3% 294 ± 0% -53.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=128/numInputRows=1048576-10 36.1k ± 0% 20.2k ± 0% -44.06% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1024-10 81.7 ± 3% 46.0 ± 0% -43.67% (p=0.000 n=9+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=1048576-10 14.4k ± 1% 8.2k ± 0% -42.97% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=32-10 79.0 ± 0% 46.0 ± 0% -41.77% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=1048576-10 13.7k ± 1% 8.2k ± 0% -40.05% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=32-10 191 ± 1% 120 ± 1% -37.52% (p=0.000 n=7+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1048576-10 12.9k ± 2% 8.2k ± 0% -36.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=2/numInputRows=32-10 176 ± 2% 115 ± 1% -34.33% (p=0.000 n=10+9) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1048576-10 12.3k ± 0% 8.2k ± 0% -33.21% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1024/numInputRows=1048576-10 21.8k ± 0% 15.2k ± 0% -30.13% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=32/numInputRows=32-10 118 ± 0% 84 ± 0% -28.81% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=2/numInputRows=32-10 63.0 ± 0% 46.0 ± 0% -26.98% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=128/numInputRows=1024-10 57.2 ±14% 46.0 ± 0% -19.58% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1048576-10 9.69k ± 1% 8.23k ± 0% -15.07% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=32768-10 340 ± 2% 294 ± 0% -13.43% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1/numInputRows=1-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=32/numInputRows=32-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/ordered/decimal/groupSize=1024/numInputRows=1024-10 48.0 ± 0% 46.0 ± 0% -4.17% (p=0.000 n=10+10) Aggregator/MIN/hash/decimal/groupSize=1/numInputRows=1-10 82.0 ± 0% 79.0 ± 0% -3.66% (p=0.000 n=10+10) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
This is 1 year late but thanks for taking this over the finish line @nvanbenschoten! |
This commit changes the representation of the Decimals column in the
colexec package to be a wrapped flat Bytes representation. Instead of
storing values of apd.Decimal in a slice (which contain heap pointers),
we now store a serialized form of apd.Decimal in a flat bytes slice,
without any heap pointers. Then, to access the apd.Decimals at runtime,
we "deserialize" them in a close to zero-copy fashion, inflating an
apd.Decimal by pointing its internal varlen Coeff field directly at the
serialized bytes.
Release note: None