GH-44084: [C++] Improve merge step in chunked sorting #44217

pitrou · 2024-09-24T17:16:07Z

Rationale for this change

When merge-sorting the chunks of a chunked array or table, we would currently repeatedly resolve the chunk indices for each individual value lookup. This requires O(n*log k) chunk resolutions with n being the chunked array or table length, and k the number of chunks.

Instead, this PR translates the logical indices to physical all at once, without even requiring expensive chunk resolution as the logical indices are initially chunk-partitioned.

This change yields significant speedups on chunked array and table sorting:

                                           benchmark          baseline         contender  change %                                                                                                                                                                                                                                       counters
      ChunkedArraySortIndicesInt64Narrow/1048576/100   345.419 MiB/sec   628.334 MiB/sec    81.905                               {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 242, 'null_percent': 1.0}
          TableSortIndicesInt64Narrow/1048576/0/1/32 25.997M items/sec 44.550M items/sec    71.366   {'family_index': 3, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0}
        ChunkedArraySortIndicesInt64Wide/32768/10000    91.182 MiB/sec   153.756 MiB/sec    68.625                               {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2067, 'null_percent': 0.01}
           ChunkedArraySortIndicesInt64Wide/32768/10    96.536 MiB/sec   161.648 MiB/sec    67.449                                  {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2238, 'null_percent': 10.0}
        TableSortIndicesInt64Narrow/1048576/100/1/32 24.290M items/sec 40.513M items/sec    66.791  {'family_index': 3, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0}
          ChunkedArraySortIndicesInt64Wide/32768/100    90.030 MiB/sec   149.633 MiB/sec    66.203                                  {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2017, 'null_percent': 1.0}
            ChunkedArraySortIndicesInt64Wide/32768/0    91.982 MiB/sec   152.840 MiB/sec    66.163                                    {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2115, 'null_percent': 0.0}
      ChunkedArraySortIndicesInt64Narrow/8388608/100   240.335 MiB/sec   387.423 MiB/sec    61.201                                {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 21, 'null_percent': 1.0}
            ChunkedArraySortIndicesInt64Wide/32768/2   172.376 MiB/sec   274.133 MiB/sec    59.032                                   {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3770, 'null_percent': 50.0}
            TableSortIndicesInt64Wide/1048576/4/1/32  7.407M items/sec 11.621M items/sec    56.904     {'family_index': 4, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0}
          TableSortIndicesInt64Wide/1048576/100/1/32  5.788M items/sec  9.062M items/sec    56.565     {'family_index': 4, 'per_family_instance_index': 9, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 1.0}
            TableSortIndicesInt64Wide/1048576/0/1/32  5.785M items/sec  9.049M items/sec    56.409      {'family_index': 4, 'per_family_instance_index': 11, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 0.0}
          ChunkedArraySortIndicesInt64Narrow/32768/2   194.743 MiB/sec   291.432 MiB/sec    49.649                                 {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4340, 'null_percent': 50.0}
          TableSortIndicesInt64Narrow/1048576/4/1/32 25.686M items/sec 38.087M items/sec    48.279  {'family_index': 3, 'per_family_instance_index': 10, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17, 'chunks': 32.0, 'columns': 1.0, 'null_percent': 25.0}
            TableSortIndicesInt64Wide/1048576/0/8/32  5.766M items/sec  8.374M items/sec    45.240       {'family_index': 4, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0}
           TableSortIndicesInt64Wide/1048576/0/16/32  5.752M items/sec  8.352M items/sec    45.202     {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0}
      ChunkedArraySortIndicesInt64Narrow/32768/10000   121.253 MiB/sec   175.286 MiB/sec    44.562                             {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2673, 'null_percent': 0.01}
          TableSortIndicesInt64Wide/1048576/100/2/32  5.549M items/sec  7.984M items/sec    43.876     {'family_index': 4, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0}
        ChunkedArraySortIndicesInt64Wide/1048576/100    69.599 MiB/sec    99.666 MiB/sec    43.200                                  {'family_index': 1, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesInt64Wide/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 49, 'null_percent': 1.0}
           TableSortIndicesInt64Narrow/1048576/0/1/4 55.940M items/sec 79.984M items/sec    42.982     {'family_index': 3, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 37, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0}
         TableSortIndicesInt64Wide/1048576/100/16/32  5.554M items/sec  7.909M items/sec    42.417   {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0}
         ChunkedArraySortIndicesInt64Narrow/32768/10   127.758 MiB/sec   181.407 MiB/sec    41.992                                {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2856, 'null_percent': 10.0}
          TableSortIndicesInt64Wide/1048576/100/8/32  5.572M items/sec  7.775M items/sec    39.548     {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Wide/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0}
        ChunkedArraySortIndicesInt64Narrow/32768/100   119.600 MiB/sec   166.454 MiB/sec    39.176                                {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2667, 'null_percent': 1.0}
            TableSortIndicesInt64Wide/1048576/0/2/32  5.781M items/sec  8.016M items/sec    38.669       {'family_index': 4, 'per_family_instance_index': 8, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 4, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 0.0}
         TableSortIndicesInt64Narrow/1048576/100/1/4 52.252M items/sec 72.193M items/sec    38.162   {'family_index': 3, 'per_family_instance_index': 21, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 35, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 1.0}
          ChunkedArraySortIndicesInt64Narrow/32768/0   121.868 MiB/sec   168.364 MiB/sec    38.152                                  {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2691, 'null_percent': 0.0}
            TableSortIndicesInt64Wide/1048576/4/2/32  5.017M items/sec  6.720M items/sec    33.934      {'family_index': 4, 'per_family_instance_index': 7, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 25.0}
        ChunkedArraySortIndicesInt64Wide/8388608/100    54.785 MiB/sec    72.642 MiB/sec    32.593                                   {'family_index': 1, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesInt64Wide/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 5, 'null_percent': 1.0}
            TableSortIndicesInt64Wide/1048576/4/8/32  4.222M items/sec  5.483M items/sec    29.861      {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0}
              ChunkedArraySortIndicesString/32768/10   146.866 MiB/sec   190.314 MiB/sec    29.583                                     {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'ChunkedArraySortIndicesString/32768/10', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3494, 'null_percent': 10.0}
           TableSortIndicesInt64Wide/1048576/4/16/32  4.225M items/sec  5.433M items/sec    28.599    {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0}
       TableSortIndicesInt64Narrow/1048576/100/16/32  2.193M items/sec  2.711M items/sec    23.652 {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 1.0}
             ChunkedArraySortIndicesString/32768/100   156.401 MiB/sec   191.910 MiB/sec    22.704                                     {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'ChunkedArraySortIndicesString/32768/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3488, 'null_percent': 1.0}
           TableSortIndicesInt64Narrow/1048576/4/1/4 47.342M items/sec 58.062M items/sec    22.644    {'family_index': 3, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 32, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0}
               ChunkedArraySortIndicesString/32768/0   161.457 MiB/sec   195.782 MiB/sec    21.259                                       {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'ChunkedArraySortIndicesString/32768/0', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3644, 'null_percent': 0.0}
         TableSortIndicesInt64Narrow/1048576/4/16/32  1.915M items/sec  2.309M items/sec    20.561  {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 25.0}
         TableSortIndicesInt64Narrow/1048576/0/16/32  2.561M items/sec  3.079M items/sec    20.208   {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/16/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 16.0, 'null_percent': 0.0}
           ChunkedArraySortIndicesString/32768/10000   157.786 MiB/sec   189.412 MiB/sec    20.043                                  {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'ChunkedArraySortIndicesString/32768/10000', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3539, 'null_percent': 0.01}
               ChunkedArraySortIndicesString/32768/2   139.241 MiB/sec   164.172 MiB/sec    17.904                                      {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'ChunkedArraySortIndicesString/32768/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3155, 'null_percent': 50.0}
          TableSortIndicesInt64Narrow/1048576/0/8/32  2.595M items/sec  3.038M items/sec    17.081     {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'TableSortIndicesInt64Narrow/1048576/0/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 0.0}
          TableSortIndicesInt64Narrow/1048576/4/8/32  1.999M items/sec  2.298M items/sec    14.936    {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TableSortIndicesInt64Narrow/1048576/4/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 25.0}
           ChunkedArraySortIndicesString/8388608/100    81.026 MiB/sec    93.120 MiB/sec    14.926                                      {'family_index': 2, 'per_family_instance_index': 7, 'run_name': 'ChunkedArraySortIndicesString/8388608/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7, 'null_percent': 1.0}
        TableSortIndicesInt64Narrow/1048576/100/8/32  2.382M items/sec  2.719M items/sec    14.168   {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/8/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2, 'chunks': 32.0, 'columns': 8.0, 'null_percent': 1.0}
           ChunkedArraySortIndicesString/1048576/100   107.722 MiB/sec   122.229 MiB/sec    13.467                                     {'family_index': 2, 'per_family_instance_index': 6, 'run_name': 'ChunkedArraySortIndicesString/1048576/100', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 77, 'null_percent': 1.0}
        TableSortIndicesInt64Narrow/1048576/100/2/32  4.019M items/sec  4.477M items/sec    11.383   {'family_index': 3, 'per_family_instance_index': 6, 'run_name': 'TableSortIndicesInt64Narrow/1048576/100/2/32', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 3, 'chunks': 32.0, 'columns': 2.0, 'null_percent': 1.0}
             TableSortIndicesInt64Wide/1048576/4/1/4 11.595M items/sec 12.791M items/sec    10.314       {'family_index': 4, 'per_family_instance_index': 22, 'run_name': 'TableSortIndicesInt64Wide/1048576/4/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 8, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 25.0}
             TableSortIndicesInt64Wide/1048576/0/1/4  9.231M items/sec 10.181M items/sec    10.294        {'family_index': 4, 'per_family_instance_index': 23, 'run_name': 'TableSortIndicesInt64Wide/1048576/0/1/4', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6, 'chunks': 4.0, 'columns': 1.0, 'null_percent': 0.0}

However, performance also regresses when the input is all-nulls (which is probably rare):

                                       benchmark           baseline          contender  change %                                                                                                                                                                                                                                      counters
           ChunkedArraySortIndicesString/32768/1      5.636 GiB/sec      4.336 GiB/sec   -23.068                                  {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesString/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 127778, 'null_percent': 100.0}
      ChunkedArraySortIndicesInt64Narrow/32768/1      3.963 GiB/sec      2.852 GiB/sec   -28.025                              {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Narrow/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 91209, 'null_percent': 100.0}
        ChunkedArraySortIndicesInt64Wide/32768/1      4.038 GiB/sec      2.869 GiB/sec   -28.954                                {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'ChunkedArraySortIndicesInt64Wide/32768/1', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94090, 'null_percent': 100.0}

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

No.

GitHub Issue: [C++][Compute] Pre-solve chunked indices before merging chunks in Sort kernels #44084

github-actions · 2024-09-24T17:16:33Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

pitrou · 2024-09-24T17:17:55Z

@ursabot please benchmark lang=C++

ursabot · 2024-09-24T17:18:04Z

Benchmark runs are scheduled for commit a24e70a. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2024-09-24T17:18:26Z

@ursabot please benchmark lang=C++

ursabot · 2024-09-24T17:18:31Z

Benchmark runs are scheduled for commit 45566ce. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2024-09-25T00:49:39Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit a24e70a.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

conbench-apache-arrow · 2024-09-25T00:51:18Z

Thanks for your patience. Conbench analyzed the 3 benchmarking runs that have been run so far on PR commit 45566ce.

There were 21 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2024-09-24 18:13:17Z
- RoundDerivativesArrayBenchmark (C++) with params=<Floor, FloatType>/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-round-benchmark
- ChunkedArraySortIndicesString (C++) with params=32768/1, source=cpp-micro, suite=arrow-compute-vector-sort-benchmark
and 19 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2024-09-25T07:18:42Z

@ursabot benchmark help

ursabot · 2024-09-25T07:18:44Z

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2

For other command=cpp-micro options, please see https://github.com/voltrondata-labs/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

pitrou · 2024-09-25T08:40:21Z

@ursabot please benchmark command=cpp-micro --suite-filter=vector-sort

ursabot · 2024-09-25T08:40:27Z

Benchmark runs are scheduled for commit df0f691. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2024-09-25T08:56:14Z

@ursabot please benchmark lang=C++

ursabot · 2024-09-25T08:56:17Z

Commit df0f691 already has scheduled benchmark runs.

pitrou · 2024-09-25T08:56:36Z

@ursabot please benchmark lang=C++

ursabot · 2024-09-25T08:56:43Z

Benchmark runs are scheduled for commit 275871b. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

github-actions · 2024-09-25T09:35:55Z

⚠️ GitHub issue #44084 has been automatically assigned in GitHub to PR creator.

pitrou · 2024-09-25T11:09:12Z

Set back to draft because some things can be further improved.

pitrou · 2024-09-25T11:42:59Z

@ursabot please benchmark lang=C++

ursabot · 2024-09-25T11:43:05Z

Benchmark runs are scheduled for commit f69b3b8. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

cpp/src/arrow/compute/kernels/chunked_internal.h

pitrou · 2024-11-18T15:07:35Z

@felipecrv Would you like to give this another look (assuming CI passes, which it should :-))?

zanmato1984 · 2024-11-21T08:43:27Z

Sorry I will be fully occupied until the end of this week. I'll help review next week.

zanmato1984

Just some questions and nits.

cpp/src/arrow/compute/kernels/chunked_internal.h

cpp/src/arrow/compute/kernels/chunked_internal.cc

cpp/src/arrow/compute/kernels/vector_sort_internal.h

cpp/src/arrow/compute/kernels/chunked_internal.cc

cpp/src/arrow/compute/kernels/chunked_internal.h

cpp/src/arrow/compute/kernels/chunked_internal.cc

pitrou · 2024-11-26T10:23:02Z

@github-actions crossbow submit -g cpp

github-actions · 2024-11-26T10:25:34Z

Revision: 4f2fff4

Submitted crossbow builds: ursacomputing/crossbow @ actions-fa64807be3

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-20.04-cuda-11.2.2
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

zanmato1984

+1

Thanks for the improvement!

conbench-apache-arrow · 2024-11-26T19:06:49Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit d5cda4a.

There were 132 benchmark results with an error:

Commit Run on arm64-t4g-2xlarge-linux at 2024-11-26 14:35:43Z
- tpch (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-07, scale_factor=1
- tpch (R) with engine=arrow, format=native, language=R, memory_map=False, query_id=TPCH-09, scale_factor=1
and 130 more (see the report linked below)

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

assignUser · 2024-11-30T02:42:58Z

It seems like this change causes issues on gcc 8 https://github.com/ursacomputing/crossbow/actions/runs/12081500439/job/33690725970#step:7:2058 probably the change from std::vector to span in chunkresolver?

pitrou · 2024-12-02T09:23:17Z

@assignUser Thanks for the heads up, I'll take a look.

pitrou · 2024-12-02T13:39:16Z

@assignUser See #44898 and #44899

felipecrv · 2024-12-05T15:30:40Z

@felipecrv Would you like to give this another look (assuming CI passes, which it should :-))?

Sorry. I've been away. This looks great. Really nice improvements.

github-actions bot added Component: C++ awaiting review Awaiting review labels Sep 24, 2024

pitrou force-pushed the gh44084-chunked-sort branch from a24e70a to 45566ce Compare September 24, 2024 17:18

pitrou force-pushed the gh44084-chunked-sort branch from 674407b to 562d1f9 Compare September 25, 2024 07:42

pitrou force-pushed the gh44084-chunked-sort branch from df0f691 to 275871b Compare September 25, 2024 08:56

pitrou force-pushed the gh44084-chunked-sort branch from 275871b to 24979f3 Compare September 25, 2024 09:32

pitrou changed the title ~~EXP: GH-44084: [C++] Improve merge step in chunked sorting~~ GH-44084: [C++] Improve merge step in chunked sorting Sep 25, 2024

pitrou marked this pull request as ready for review September 25, 2024 09:40

pitrou requested a review from felipecrv September 25, 2024 09:41

pitrou marked this pull request as draft September 25, 2024 11:08

pitrou force-pushed the gh44084-chunked-sort branch from 24979f3 to f69b3b8 Compare September 25, 2024 11:41

felipecrv mentioned this pull request Oct 12, 2024

GH-34535: [C++] Move ChunkResolver to the public API #44357

Merged

2 tasks

felipecrv reviewed Oct 30, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/chunked_internal.h Outdated Show resolved Hide resolved

pitrou force-pushed the gh44084-chunked-sort branch from f69b3b8 to bdfe17b Compare November 18, 2024 14:29

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 18, 2024

pitrou requested a review from felipecrv November 18, 2024 15:07

pitrou requested a review from zanmato1984 November 21, 2024 08:28

zanmato1984 reviewed Nov 26, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/chunked_internal.h Show resolved Hide resolved

cpp/src/arrow/compute/kernels/chunked_internal.cc Show resolved Hide resolved

zanmato1984 reviewed Nov 26, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/vector_sort_internal.h Outdated Show resolved Hide resolved

pitrou added 2 commits November 26, 2024 10:53

apacheGH-44084: [C++] Improve merge step in chunked sorting

88a92b3

Rename ResolvedChunkIndex to CompressedChunkLocation

c85bb4c

zanmato1984 reviewed Nov 26, 2024

View reviewed changes

cpp/src/arrow/compute/kernels/chunked_internal.cc Show resolved Hide resolved

cpp/src/arrow/compute/kernels/chunked_internal.h Show resolved Hide resolved

cpp/src/arrow/compute/kernels/chunked_internal.cc Show resolved Hide resolved

Apply suggestion of using a regular method

13e5cc2

pitrou force-pushed the gh44084-chunked-sort branch from ee712a6 to 13e5cc2 Compare November 26, 2024 10:11

Add suggested assertion and a comment

4f2fff4

zanmato1984 approved these changes Nov 26, 2024

View reviewed changes

pitrou merged commit d5cda4a into apache:main Nov 26, 2024
39 of 40 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Nov 26, 2024

pitrou mentioned this pull request Nov 26, 2024

[C++][Compute] Pre-solve chunked indices before merging chunks in Sort kernels #44084

Closed

pitrou deleted the gh44084-chunked-sort branch December 2, 2024 09:23

pitrou mentioned this pull request Dec 2, 2024

[C++] Compilation error on gcc 8 #44898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-44084: [C++] Improve merge step in chunked sorting #44217

GH-44084: [C++] Improve merge step in chunked sorting #44217

pitrou commented Sep 24, 2024 •

edited

Loading

github-actions bot commented Sep 24, 2024

pitrou commented Sep 24, 2024

ursabot commented Sep 24, 2024

pitrou commented Sep 24, 2024

ursabot commented Sep 24, 2024

conbench-apache-arrow bot commented Sep 25, 2024

conbench-apache-arrow bot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

pitrou commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Nov 18, 2024 •

edited

Loading

zanmato1984 commented Nov 21, 2024

zanmato1984 left a comment

pitrou commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

zanmato1984 left a comment

conbench-apache-arrow bot commented Nov 26, 2024

assignUser commented Nov 30, 2024

pitrou commented Dec 2, 2024

pitrou commented Dec 2, 2024

felipecrv commented Dec 5, 2024

GH-44084: [C++] Improve merge step in chunked sorting #44217

GH-44084: [C++] Improve merge step in chunked sorting #44217

Conversation

pitrou commented Sep 24, 2024 • edited Loading

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 24, 2024

pitrou commented Sep 24, 2024

ursabot commented Sep 24, 2024

pitrou commented Sep 24, 2024

ursabot commented Sep 24, 2024

conbench-apache-arrow bot commented Sep 25, 2024

conbench-apache-arrow bot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

pitrou commented Sep 25, 2024

pitrou commented Sep 25, 2024

ursabot commented Sep 25, 2024

pitrou commented Nov 18, 2024 • edited Loading

zanmato1984 commented Nov 21, 2024

zanmato1984 left a comment

Choose a reason for hiding this comment

pitrou commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

zanmato1984 left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented Nov 26, 2024

assignUser commented Nov 30, 2024

pitrou commented Dec 2, 2024

pitrou commented Dec 2, 2024

felipecrv commented Dec 5, 2024

pitrou commented Sep 24, 2024 •

edited

Loading

pitrou commented Nov 18, 2024 •

edited

Loading