Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dictionary support for reductions any/all #7242

Merged
merged 10 commits into from
Feb 8, 2021

Conversation

davidwendt
Copy link
Contributor

Here are the current top 10 compile time offenders

2107683 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
2106547 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
1538794 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
1522533 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
1519147 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
1188127 CMakeFiles/cudf_base.dir/src/groupby/sort/group_sum.cu.o
1006601 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 789776 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 651817 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 539513 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
...

Times are in milliseconds so any.cu and all.cu take 35 minutes each to build on my machine with CUDA 10.1.

The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the cub::DeviceReduce::Reduce used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce.

This PR creates new specialization logic for dictionary columns to call thrust::all_of for all() and and thrust::any_of for any() instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an early-out feature which can be faster than a generic reduce depending on the data.

The compile time for any.cu and all.cu is now around 3 minutes each.

Also in this PR, I've changed the dictionary_pair_iterator to convert the has_nulls template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time.

Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 28, 2021
@davidwendt davidwendt requested a review from a team as a code owner January 28, 2021 16:44
@davidwendt davidwendt self-assigned this Jan 28, 2021
@davidwendt davidwendt requested a review from a team as a code owner January 28, 2021 16:44
@raydouglass
Copy link
Member

rerun tests

1 similar comment
@davidwendt
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thrust any_of and all_of are very slow: https://github.com/NVIDIA/thrust/issues/1016

If using CUB directly is a problem, then I'd suggest using thrust::reduce directly.

@davidwendt
Copy link
Contributor Author

So thrust::reduce has the same compile-time issue as CUB DeviceReduce. It is likely thrust::reduce just calls CUB DeviceReduce anyway. The main problem appears to be the aggressive inlining of the iterator logic. Unfortunately, globally reducing the inlining significantly effects the performance of reductions for the other column types. So I was trying to find a solution that did not use DeviceReduce -- just for dictionary.

It looks like the anyall_benchmark I created was not correct because the thrust::all_of and thrust::any_of were faster than DeviceReduce especially for large columns. I found this was because the data I used was randomly generated and therefore always had zero and non-zero rows. This means the thrust::any_of and thrust::all_of could exit out early while the reduce functions processed every row regardless. Changing the data to force thrust::any_of (only zeros) and thrust::all_of (only non-zeros) to process every row showed it was in fact about 4x slower than DeviceReduce.

Fortunately, I was able to find another solution for dictionary that does not use DeviceReduce and so is still faster to compile but now also runs almost 2x faster too. This uses a simple functor with thrust::for_each_n and sets the result using atomics operations. I'm sure this could be further improved with some warp shuffle operations which may come in a future PR.

Also, I did try this solution for fixed-width types too and DeviceReduce is still faster so I'm only using this for dictionary columns.

@davidwendt
Copy link
Contributor Author

Some benchmark results for dictionary columns.

reduction row count baseline (ms) this PR (ms) factor
all 1M 47 32 1.5x
all 10M 182 106 1.7x
all 100M 1565 822 1.9x
any 1M 48 31 1.5x
any 10M 180 109 1.7x
any 100M 1557 852 1.8x

The keys type of the dictionary for the results are int32_t.
The times are from my Linux 18.04 desktop with NVIDIA driver 455.45 built with CUDA 10.1 and running on a Quadro GV100.

@davidwendt
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@8215b5b). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7242   +/-   ##
==============================================
  Coverage               ?   82.22%           
==============================================
  Files                  ?      100           
  Lines                  ?    16969           
  Branches               ?        0           
==============================================
  Hits                   ?    13953           
  Misses                 ?     3016           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8215b5b...96e5ea2. Read the comment docs.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Love to see these files get tamed :)

@mythrocks
Copy link
Contributor

Thrust any_of and all_of are very slow: NVIDIA/cccl#720

Would this hold for none_of as well? It would have to, I'm guessing.

@jrhemstad
Copy link
Contributor

Thrust any_of and all_of are very slow: NVIDIA/cccl#720

Would this hold for none_of as well? It would have to, I'm guessing.

Yes, any of the logical algorithms.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 366573d into rapidsai:branch-0.19 Feb 8, 2021
@davidwendt davidwendt deleted the compile-reduction-anyall branch February 8, 2021 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants