Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-44084: [C++] Improve merge step in chunked sorting (apache#4…
…4217) ### Rationale for this change When merge-sorting the chunks of a chunked array or table, we would currently repeatedly resolve the chunk indices for each individual value lookup. This requires `O(n*log k)` chunk resolutions with `n` being the chunked array or table length, and `k` the number of chunks. Instead, this PR translates the logical indices to physical all at once, without even requiring expensive chunk resolution as the logical indices are initially chunk-partitioned. This change yields significant speedups on chunked array and table sorting: ``` benchmark baseline contender change % counters ChunkedArraySortIndicesInt64Narrow/1048576/100 345.419 MiB/sec 628.334 MiB/sec 81.905 {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 242, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/0/1/32 25.997M items/sec 44.550M items/sec 71.366 {'family_index': 3, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Wide/32768/10000 91.182 MiB/sec 153.756 MiB/sec 68.625 {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2067, 'null_percent': 0.01} ChunkedArraySortIndicesInt64Wide/32768/10 96.536 MiB/sec 161.648 MiB/sec 67.449 {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2238, 'null_percent': 10.0} TableSortIndicesInt64Narrow/1048576/100/1/32 24.290M items/sec 40.513M items/sec 66.791 {'family_index': 3, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/100 90.030 MiB/sec 149.633 MiB/sec 66.203 {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2017, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/0 91.982 MiB/sec 152.840 MiB/sec 66.163 {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2115, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/8388608/100 240.335 MiB/sec 387.423 MiB/sec 61.201 {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 21, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/32768/2 172.376 MiB/sec 274.133 MiB/sec 59.032 {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3770, 'null_percent': 50.0} TableSortIndicesInt64Wide/1048576/4/1/32 7.407M items/sec 11.621M items/sec 56.904 {'family_index': 4, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/100/1/32 5.788M items/sec 9.062M items/sec 56.565 {'family_index': 4, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/0/1/32 5.785M items/sec 9.049M items/sec 56.409 {'family_index': 4, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/32768/2 194.743 MiB/sec 291.432 MiB/sec 49.649 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4340, 'null_percent': 50.0} TableSortIndicesInt64Narrow/1048576/4/1/32 25.686M items/sec 38.087M items/sec 48.279 {'family_index': 3, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/0/8/32 5.766M items/sec 8.374M items/sec 45.240 {'family_index': 4, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/0/16/32 5.752M items/sec 8.352M items/sec 45.202 {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0} ChunkedArraySortIndicesInt64Narrow/32768/10000 121.253 MiB/sec 175.286 MiB/sec 44.562 {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2673, 'null_percent': 0.01} TableSortIndicesInt64Wide/1048576/100/2/32 5.549M items/sec 7.984M items/sec 43.876 {'family_index': 4, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Wide/1048576/100 69.599 MiB/sec 99.666 MiB/sec 43.200 {'family_index': 1, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Wide/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/0/1/4 55.940M items/sec 79.984M items/sec 42.982 {'family_index': 3, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 37, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/100/16/32 5.554M items/sec 7.909M items/sec 42.417 {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/10 127.758 MiB/sec 181.407 MiB/sec 41.992 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2856, 'null_percent': 10.0} TableSortIndicesInt64Wide/1048576/100/8/32 5.572M items/sec 7.775M items/sec 39.548 {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/100 119.600 MiB/sec 166.454 MiB/sec 39.176 {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2667, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/0/2/32 5.781M items/sec 8.016M items/sec 38.669 {'family_index': 4, 'per_family_instance_index': 8, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/100/1/4 52.252M items/sec 72.193M items/sec 38.162 {'family_index': 3, 'per_family_instance_index': 21, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 35, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 1.0} ChunkedArraySortIndicesInt64Narrow/32768/0 121.868 MiB/sec 168.364 MiB/sec 38.152 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2691, 'null_percent': 0.0} TableSortIndicesInt64Wide/1048576/4/2/32 5.017M items/sec 6.720M items/sec 33.934 {'family_index': 4, 'per_family_instance_index': 7, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 25.0} ChunkedArraySortIndicesInt64Wide/8388608/100 54.785 MiB/sec 72.642 MiB/sec 32.593 {'family_index': 1, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Wide/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/4/8/32 4.222M items/sec 5.483M items/sec 29.861 {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/32768/10 146.866 MiB/sec 190.314 MiB/sec 29.583 {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesString/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3494, 'null_percent': 10.0} TableSortIndicesInt64Wide/1048576/4/16/32 4.225M items/sec 5.433M items/sec 28.599 {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0} TableSortIndicesInt64Narrow/1048576/100/16/32 2.193M items/sec 2.711M items/sec 23.652 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0} ChunkedArraySortIndicesString/32768/100 156.401 MiB/sec 191.910 MiB/sec 22.704 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesString/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3488, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/4/1/4 47.342M items/sec 58.062M items/sec 22.644 {'family_index': 3, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/32768/0 161.457 MiB/sec 195.782 MiB/sec 21.259 {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesString/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/4/16/32 1.915M items/sec 2.309M items/sec 20.561 {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0} TableSortIndicesInt64Narrow/1048576/0/16/32 2.561M items/sec 3.079M items/sec 20.208 {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0} ChunkedArraySortIndicesString/32768/10000 157.786 MiB/sec 189.412 MiB/sec 20.043 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesString/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3539, 'null_percent': 0.01} ChunkedArraySortIndicesString/32768/2 139.241 MiB/sec 164.172 MiB/sec 17.904 {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesString/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3155, 'null_percent': 50.0} TableSortIndicesInt64Narrow/1048576/0/8/32 2.595M items/sec 3.038M items/sec 17.081 {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0} TableSortIndicesInt64Narrow/1048576/4/8/32 1.999M items/sec 2.298M items/sec 14.936 {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0} ChunkedArraySortIndicesString/8388608/100 81.026 MiB/sec 93.120 MiB/sec 14.926 {'family_index': 2, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesString/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/100/8/32 2.382M items/sec 2.719M items/sec 14.168 {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0} ChunkedArraySortIndicesString/1048576/100 107.722 MiB/sec 122.229 MiB/sec 13.467 {'family_index': 2, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesString/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 77, 'null_percent': 1.0} TableSortIndicesInt64Narrow/1048576/100/2/32 4.019M items/sec 4.477M items/sec 11.383 {'family_index': 3, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0} TableSortIndicesInt64Wide/1048576/4/1/4 11.595M items/sec 12.791M items/sec 10.314 {'family_index': 4, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0} TableSortIndicesInt64Wide/1048576/0/1/4 9.231M items/sec 10.181M items/sec 10.294 {'family_index': 4, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0} ``` However, performance also regresses when the input is all-nulls (which is probably rare): ``` benchmark baseline contender change % counters ChunkedArraySortIndicesString/32768/1 5.636 GiB/sec 4.336 GiB/sec -23.068 {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesString/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 127778, 'null_percent': 100.0} ChunkedArraySortIndicesInt64Narrow/32768/1 3.963 GiB/sec 2.852 GiB/sec -28.025 {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 91209, 'null_percent': 100.0} ChunkedArraySortIndicesInt64Wide/32768/1 4.038 GiB/sec 2.869 GiB/sec -28.954 {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94090, 'null_percent': 100.0} ``` ### Are these changes tested? Yes, by existing tests. ### Are there any user-facing changes? No. * GitHub Issue: apache#44084 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
- Loading branch information