Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize groupby::scan #9754

Merged
merged 8 commits into from
Jan 10, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Nov 22, 2021

Closes #8522

This PR gets rid of redundant rearranging processes in groupby::scan if input values are presorted. Instead of a short circuit in sort_helper, it adds an early exit in the scan functor to avoid materializing sorted_values/grouped_values thus reducing memory footprint. This optimization brings a 1.6x speedup for presorted scan operations.

  • Baseline
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1388
Groupby/BasicSumScan/10000000/manual_time            8.80 ms         8.81 ms           61
Groupby/BasicSumScan/100000000/manual_time            543 ms          543 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.217 ms        0.236 ms         3319
Groupby/PreSortedSumScan/10000000/manual_time        1.45 ms         1.47 ms          479
Groupby/PreSortedSumScan/100000000/manual_time       14.0 ms         14.0 ms           47
  • After optimization
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
Groupby/BasicSumScan/1000000/manual_time            0.455 ms        0.472 ms         1393
Groupby/BasicSumScan/10000000/manual_time            8.81 ms         8.82 ms           60
Groupby/BasicSumScan/100000000/manual_time            546 ms          546 ms            1
Groupby/PreSortedSumScan/1000000/manual_time        0.129 ms        0.148 ms         5389
Groupby/PreSortedSumScan/10000000/manual_time       0.901 ms        0.921 ms          769
Groupby/PreSortedSumScan/100000000/manual_time       8.68 ms         8.70 ms           74

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt non-breaking Non-breaking change labels Nov 22, 2021
@PointKernel PointKernel self-assigned this Nov 22, 2021
@codecov
Copy link

codecov bot commented Nov 23, 2021

Codecov Report

Merging #9754 (d6959f2) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9754      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20471     +166     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18337     +162     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 61199ea...d6959f2. Read the comment docs.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@revans2
Copy link
Contributor

revans2 commented Jan 5, 2022

This would be nice to have work, as we use this in a number of areas.

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jan 6, 2022
@PointKernel PointKernel marked this pull request as ready for review January 6, 2022 23:18
@PointKernel PointKernel requested a review from a team as a code owner January 6, 2022 23:18
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just have an idea to potentially make this even simpler.

cpp/src/groupby/sort/functors.hpp Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Jan 7, 2022

Is there a benchmark that shows the perf benefit?

@devavret
Copy link
Contributor

devavret commented Jan 7, 2022

I think the original plan was better to short circuit in the sort_helper. It is also used in rolling. Otherwise, we'd have to fix it again for rolling.

Would it be possible or would it require a bigger refactor?

@PointKernel
Copy link
Member Author

PointKernel commented Jan 7, 2022

I think the original plan was better to short circuit in the sort_helper. It is also used in rolling. Otherwise, we'd have to fix it again for rolling.

Would it be possible or would it require a bigger refactor?

@devavret It's definitely possible and actually involves fewer code changes than the current refactor. The question is really whether we want to materialize sorted_values/grouped_values or not. short circuit in sort_helper will materialize/memorize those two while the current refactor won't (thus smaller memory footprint).

@devavret
Copy link
Contributor

devavret commented Jan 7, 2022

short circuit in sort_helper will materialize/memorize those two while the current refactor won't (thus less memory footprint).

You can memoize both the view and the owning column. Anyway I now think your method would be cleaner

@github-actions github-actions bot added the CMake CMake build issue label Jan 7, 2022
@PointKernel
Copy link
Member Author

PointKernel commented Jan 7, 2022

@vuule I've just added the benchmark results. This PR brings about 1.6x speedups for presorted scans.

@PointKernel PointKernel requested a review from vuule January 7, 2022 20:32
cpp/benchmarks/groupby/group_scan_benchmark.cu Outdated Show resolved Hide resolved
}
}

BENCHMARK_DEFINE_F(Groupby, BasicSumScan)(::benchmark::State& state) { BM_basic_sum_scan(state); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using nvbench is preferred over googlebench for any new benchmarks.
We want to move our all benchmarks to nvbench.
If it's not much effort, please use nvbench here.

Copy link
Member Author

@PointKernel PointKernel Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvbench requires target kernels to take an explicit stream (see here) but CUDA stream is not exposed to the public groupby::scan. What I did in JOIN_NVBENCH is to expose stream arguments to hash join APIs. Do we want to do the same for groupby::scan? @jrhemstad Any comments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather do this: NVIDIA/nvbench#13

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will leave the new scan benchmark as gbench for this PR and look into NVIDIA/nvbench#13.

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cee55fd into rapidsai:branch-22.02 Jan 10, 2022
@PointKernel PointKernel deleted the optimize-groupby-scan branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Optimizations for groupby::scan
6 participants