Optimize `groupby::scan` #9754

PointKernel · 2021-11-22T22:53:19Z

This PR gets rid of redundant rearranging processes in groupby::scan if input values are presorted. Instead of a short circuit in sort_helper, it adds an early exit in the scan functor to avoid materializing sorted_values/grouped_values thus reducing memory footprint. This optimization brings a 1.6x speedup for presorted scan operations.

Baseline

-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1388
Groupby/BasicSumScan/10000000/manual_time            8.80 ms         8.81 ms           61
Groupby/BasicSumScan/100000000/manual_time            543 ms          543 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.217 ms        0.236 ms         3319
Groupby/PreSortedSumScan/10000000/manual_time        1.45 ms         1.47 ms          479
Groupby/PreSortedSumScan/100000000/manual_time       14.0 ms         14.0 ms           47

After optimization

-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1393
Groupby/BasicSumScan/10000000/manual_time            8.81 ms         8.82 ms           60
Groupby/BasicSumScan/100000000/manual_time            546 ms          546 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.129 ms        0.148 ms         5389
Groupby/PreSortedSumScan/10000000/manual_time       0.901 ms        0.921 ms          769
Groupby/PreSortedSumScan/100000000/manual_time       8.68 ms         8.70 ms           74

codecov · 2021-11-23T00:32:12Z

Codecov Report

Merging #9754 (d6959f2) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9754      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20471     +166     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18337     +162

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61199ea...d6959f2. Read the comment docs.

cpp/include/cudf/groupby.hpp

github-actions · 2021-12-23T18:02:53Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

revans2 · 2022-01-05T17:26:33Z

This would be nice to have work, as we use this in a number of areas.

…oupby-scan

vuule

Looks good, just have an idea to potentially make this even simpler.

cpp/src/groupby/sort/functors.hpp

vuule · 2022-01-07T17:03:55Z

Is there a benchmark that shows the perf benefit?

devavret · 2022-01-07T17:05:29Z

I think the original plan was better to short circuit in the sort_helper. It is also used in rolling. Otherwise, we'd have to fix it again for rolling.

Would it be possible or would it require a bigger refactor?

PointKernel · 2022-01-07T19:13:17Z

I think the original plan was better to short circuit in the sort_helper. It is also used in rolling. Otherwise, we'd have to fix it again for rolling.

Would it be possible or would it require a bigger refactor?

@devavret It's definitely possible and actually involves fewer code changes than the current refactor. The question is really whether we want to materialize sorted_values/grouped_values or not. short circuit in sort_helper will materialize/memorize those two while the current refactor won't (thus smaller memory footprint).

devavret · 2022-01-07T19:30:10Z

short circuit in sort_helper will materialize/memorize those two while the current refactor won't (thus less memory footprint).

You can memoize both the view and the owning column. Anyway I now think your method would be cleaner

PointKernel · 2022-01-07T20:30:46Z

@vuule I've just added the benchmark results. This PR brings about 1.6x speedups for presorted scans.

cpp/benchmarks/groupby/group_scan_benchmark.cu

karthikeyann · 2022-01-08T06:58:03Z

cpp/benchmarks/groupby/group_scan_benchmark.cu

+  }
+}
+
+BENCHMARK_DEFINE_F(Groupby, BasicSumScan)(::benchmark::State& state) { BM_basic_sum_scan(state); }


Using nvbench is preferred over googlebench for any new benchmarks.
We want to move our all benchmarks to nvbench.
If it's not much effort, please use nvbench here.

nvbench requires target kernels to take an explicit stream (see here) but CUDA stream is not exposed to the public groupby::scan. What I did in JOIN_NVBENCH is to expose stream arguments to hash join APIs. Do we want to do the same for groupby::scan? @jrhemstad Any comments?

I'd rather do this: NVIDIA/nvbench#13

OK. I will leave the new scan benchmark as gbench for this PR and look into NVIDIA/nvbench#13.

PointKernel · 2022-01-10T21:14:55Z

@gpucibot merge

Add values_pre_sorted member variable in scan_request

7b743bd

PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt non-breaking Non-breaking change labels Nov 22, 2021

PointKernel self-assigned this Nov 22, 2021

jrhemstad reviewed Nov 23, 2021

View reviewed changes

cpp/include/cudf/groupby.hpp Outdated Show resolved Hide resolved

github-actions bot added the inactive-30d label Dec 23, 2021

github-actions bot removed the inactive-30d label Jan 5, 2022

PointKernel added 4 commits January 6, 2022 15:40

Merge remote-tracking branch 'upstream/branch-22.02' into optimize-gr…

49a3a10

…oupby-scan

Correction: use _keys_are_sorted in groupby

d6d2881

Add early exit for overloaded function

52af2bf

Add pre-sorted scan unit tests

86ce4df

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jan 6, 2022

PointKernel marked this pull request as ready for review January 6, 2022 23:18

PointKernel requested a review from a team as a code owner January 6, 2022 23:18

PointKernel requested review from devavret, vuule and jrhemstad January 6, 2022 23:18

Fix a typo

2b24f60

vuule reviewed Jan 7, 2022

View reviewed changes

cpp/src/groupby/sort/functors.hpp Show resolved Hide resolved

Add sum scan benchmark

cb4adb4

github-actions bot added the CMake CMake build issue label Jan 7, 2022

PointKernel requested a review from vuule January 7, 2022 20:32

vuule approved these changes Jan 7, 2022

View reviewed changes

karthikeyann requested changes Jan 8, 2022

View reviewed changes

Move random_int to a common header

d6959f2

PointKernel requested a review from karthikeyann January 10, 2022 18:23

karthikeyann approved these changes Jan 10, 2022

View reviewed changes

rapids-bot bot merged commit cee55fd into rapidsai:branch-22.02 Jan 10, 2022

PointKernel deleted the optimize-groupby-scan branch May 26, 2022 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `groupby::scan` #9754

Optimize `groupby::scan` #9754

PointKernel commented Nov 22, 2021 •

edited

Loading

codecov bot commented Nov 23, 2021 •

edited

Loading

github-actions bot commented Dec 23, 2021

revans2 commented Jan 5, 2022

vuule left a comment

vuule commented Jan 7, 2022

devavret commented Jan 7, 2022

PointKernel commented Jan 7, 2022 •

edited

Loading

devavret commented Jan 7, 2022

PointKernel commented Jan 7, 2022 •

edited

Loading

karthikeyann Jan 8, 2022

PointKernel Jan 10, 2022 •

edited

Loading

jrhemstad Jan 10, 2022

PointKernel Jan 10, 2022

PointKernel commented Jan 10, 2022

Optimize groupby::scan #9754

Optimize groupby::scan #9754

Conversation

PointKernel commented Nov 22, 2021 • edited Loading

codecov bot commented Nov 23, 2021 • edited Loading

Codecov Report

github-actions bot commented Dec 23, 2021

revans2 commented Jan 5, 2022

vuule left a comment

Choose a reason for hiding this comment

vuule commented Jan 7, 2022

devavret commented Jan 7, 2022

PointKernel commented Jan 7, 2022 • edited Loading

devavret commented Jan 7, 2022

PointKernel commented Jan 7, 2022 • edited Loading

karthikeyann Jan 8, 2022

Choose a reason for hiding this comment

PointKernel Jan 10, 2022 • edited Loading

Choose a reason for hiding this comment

jrhemstad Jan 10, 2022

Choose a reason for hiding this comment

PointKernel Jan 10, 2022

Choose a reason for hiding this comment

PointKernel commented Jan 10, 2022

Optimize `groupby::scan` #9754

Optimize `groupby::scan` #9754

PointKernel commented Nov 22, 2021 •

edited

Loading

codecov bot commented Nov 23, 2021 •

edited

Loading

PointKernel commented Jan 7, 2022 •

edited

Loading

PointKernel commented Jan 7, 2022 •

edited

Loading

PointKernel Jan 10, 2022 •

edited

Loading