Refactor dictionary support for reductions any/all #7242

davidwendt · 2021-01-28T16:44:45Z

Here are the current top 10 compile time offenders

2107683 CMakeFiles/cudf_reductions.dir/src/reductions/any.cu.o
2106547 CMakeFiles/cudf_reductions.dir/src/reductions/all.cu.o
1538794 CMakeFiles/cudf_reductions.dir/src/reductions/sum_of_squares.cu.o
1522533 CMakeFiles/cudf_reductions.dir/src/reductions/product.cu.o
1519147 CMakeFiles/cudf_reductions.dir/src/reductions/sum.cu.o
1188127 CMakeFiles/cudf_base.dir/src/groupby/sort/group_sum.cu.o
1006601 CMakeFiles/cudf_base.dir/src/groupby/hash/groupby.cu.o
 789776 CMakeFiles/cudf_reductions.dir/src/reductions/mean.cu.o
 651817 CMakeFiles/cudf_join.dir/src/join/semi_join.cu.o
 539513 CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o
...

Times are in milliseconds so any.cu and all.cu take 35 minutes each to build on my machine with CUDA 10.1.

The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the cub::DeviceReduce::Reduce used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce.

This PR creates new specialization logic for dictionary columns to call thrust::all_of for all() and and thrust::any_of for any() instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an early-out feature which can be faster than a generic reduce depending on the data.

The compile time for any.cu and all.cu is now around 3 minutes each.

Also in this PR, I've changed the dictionary_pair_iterator to convert the has_nulls template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time.

Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime.

raydouglass · 2021-01-29T15:52:00Z

rerun tests

davidwendt · 2021-01-30T12:04:01Z

rerun tests

jrhemstad

Thrust any_of and all_of are very slow: https://github.com/NVIDIA/thrust/issues/1016

If using CUB directly is a problem, then I'd suggest using thrust::reduce directly.

davidwendt · 2021-02-02T21:28:51Z

So thrust::reduce has the same compile-time issue as CUB DeviceReduce. It is likely thrust::reduce just calls CUB DeviceReduce anyway. The main problem appears to be the aggressive inlining of the iterator logic. Unfortunately, globally reducing the inlining significantly effects the performance of reductions for the other column types. So I was trying to find a solution that did not use DeviceReduce -- just for dictionary.

It looks like the anyall_benchmark I created was not correct because the thrust::all_of and thrust::any_of were faster than DeviceReduce especially for large columns. I found this was because the data I used was randomly generated and therefore always had zero and non-zero rows. This means the thrust::any_of and thrust::all_of could exit out early while the reduce functions processed every row regardless. Changing the data to force thrust::any_of (only zeros) and thrust::all_of (only non-zeros) to process every row showed it was in fact about 4x slower than DeviceReduce.

Fortunately, I was able to find another solution for dictionary that does not use DeviceReduce and so is still faster to compile but now also runs almost 2x faster too. This uses a simple functor with thrust::for_each_n and sets the result using atomics operations. I'm sure this could be further improved with some warp shuffle operations which may come in a future PR.

Also, I did try this solution for fixed-width types too and DeviceReduce is still faster so I'm only using this for dictionary columns.

davidwendt · 2021-02-02T21:45:08Z

Some benchmark results for dictionary columns.

reduction	row count	baseline (ms)	this PR (ms)	factor
all	1M	47	32	1.5x
all	10M	182	106	1.7x
all	100M	1565	822	1.9x
any	1M	48	31	1.5x
any	10M	180	109	1.7x
any	100M	1557	852	1.8x

The keys type of the dictionary for the results are int32_t.
The times are from my Linux 18.04 desktop with NVIDIA driver 455.45 built with CUDA 10.1 and running on a Quadro GV100.

davidwendt · 2021-02-03T22:59:16Z

rerun tests

codecov · 2021-02-04T03:07:44Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@8215b5b). Click here to learn what that means.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             branch-0.19    #7242   +/-   ##
==============================================
  Coverage               ?   82.22%           
==============================================
  Files                  ?      100           
  Lines                  ?    16969           
  Branches               ?        0           
==============================================
  Hits                   ?    13953           
  Misses                 ?     3016           
  Partials               ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8215b5b...96e5ea2. Read the comment docs.

nvdbaranec

Nice work! Love to see these files get tamed :)

mythrocks · 2021-02-08T21:38:19Z

Thrust any_of and all_of are very slow: NVIDIA/cccl#720

Would this hold for none_of as well? It would have to, I'm guessing.

jrhemstad · 2021-02-08T22:05:09Z

Thrust any_of and all_of are very slow: NVIDIA/cccl#720

Would this hold for none_of as well? It would have to, I'm guessing.

Yes, any of the logical algorithms.

davidwendt · 2021-02-08T22:16:00Z

@gpucibot merge

davidwendt added 2 commits January 28, 2021 10:38

Refactor dictionary support for reductions any/all

89f7477

Merge branch 'branch-0.19' into compile-reduction-anyall

9d4e07b

davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 28, 2021

davidwendt requested a review from a team as a code owner January 28, 2021 16:44

davidwendt self-assigned this Jan 28, 2021

davidwendt requested a review from a team as a code owner January 28, 2021 16:44

davidwendt requested review from trxcllnt and nvdbaranec January 28, 2021 16:44

Merge branch 'branch-0.19' into compile-reduction-anyall

28ca8e9

jrhemstad requested changes Feb 2, 2021

View reviewed changes

davidwendt added 2 commits February 2, 2021 16:27

fix anyall benchmarks to use hostile data

c4de8d4

change thrust any_of/all_of to simple for_each_n instead

b004548

davidwendt requested a review from jrhemstad February 4, 2021 12:47

davidwendt added 5 commits February 4, 2021 09:56

Merge branch 'branch-0.19' into compile-reduction-anyall

8eae096

Merge branch 'branch-0.19' into compile-reduction-anyall

fd12828

remove commented out header include

be462b9

Merge branch 'branch-0.19' into compile-reduction-anyall

ff1600e

fix merge conflicts

96e5ea2

jrhemstad approved these changes Feb 5, 2021

View reviewed changes

nvdbaranec approved these changes Feb 8, 2021

View reviewed changes

rapids-bot bot merged commit 366573d into rapidsai:branch-0.19 Feb 8, 2021

davidwendt deleted the compile-reduction-anyall branch February 8, 2021 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dictionary support for reductions any/all #7242

Refactor dictionary support for reductions any/all #7242

davidwendt commented Jan 28, 2021

raydouglass commented Jan 29, 2021

davidwendt commented Jan 30, 2021

jrhemstad left a comment

davidwendt commented Feb 2, 2021

davidwendt commented Feb 2, 2021

davidwendt commented Feb 3, 2021

codecov bot commented Feb 4, 2021 •

edited

Loading

nvdbaranec left a comment

mythrocks commented Feb 8, 2021

jrhemstad commented Feb 8, 2021

davidwendt commented Feb 8, 2021

Refactor dictionary support for reductions any/all #7242

Refactor dictionary support for reductions any/all #7242

Conversation

davidwendt commented Jan 28, 2021

raydouglass commented Jan 29, 2021

davidwendt commented Jan 30, 2021

jrhemstad left a comment

Choose a reason for hiding this comment

davidwendt commented Feb 2, 2021

davidwendt commented Feb 2, 2021

davidwendt commented Feb 3, 2021

codecov bot commented Feb 4, 2021 • edited Loading

Codecov Report

nvdbaranec left a comment

Choose a reason for hiding this comment

mythrocks commented Feb 8, 2021

jrhemstad commented Feb 8, 2021

davidwendt commented Feb 8, 2021

codecov bot commented Feb 4, 2021 •

edited

Loading