-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor dictionary support for reductions any/all #7242
Refactor dictionary support for reductions any/all #7242
Conversation
rerun tests |
1 similar comment
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thrust any_of
and all_of
are very slow: https://github.com/NVIDIA/thrust/issues/1016
If using CUB directly is a problem, then I'd suggest using thrust::reduce
directly.
So It looks like the Fortunately, I was able to find another solution for dictionary that does not use Also, I did try this solution for fixed-width types too and |
Some benchmark results for dictionary columns.
The keys type of the dictionary for the results are |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7242 +/- ##
==============================================
Coverage ? 82.22%
==============================================
Files ? 100
Lines ? 16969
Branches ? 0
==============================================
Hits ? 13953
Misses ? 3016
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Love to see these files get tamed :)
Would this hold for |
Yes, any of the logical algorithms. |
@gpucibot merge |
Here are the current top 10 compile time offenders
Times are in milliseconds so
any.cu
andall.cu
take 35 minutes each to build on my machine with CUDA 10.1.The times have increased with the addition of dictionary and fixed-point types support. The large times are directly related to some aggressive inlining of the iterators in the
cub::DeviceReduce::Reduce
used by all the reduction aggregations. For small iterators, this is not an issue. The dictionary iterator is more complex since it must type-dispatch the indices and then access the keys data. The code is very fast but causes large compile times when used by CUB Reduce.This PR creates new specialization logic for dictionary columns to call
thrust::all_of
forall()
and andthrust::any_of
forany()
instead of CUB Reduce. This reduces the compile time significantly with little effect on the runtime. In fact, the thrust algorithms appear to have an early-out feature which can be faster than a generic reduce depending on the data.The compile time for
any.cu
andall.cu
is now around 3 minutes each.Also in this PR, I've changed the
dictionary_pair_iterator
to convert thehas_nulls
template parameter to runtime parameter. This adds very little overhead to the iterator but improves the compile time for all the other reductions source files. A more general process for applying this to other iterators and operators is mentioned in #6952 The compile time for the other reductions source files is now about half their original time.Finally, this PR includes gbenchmarks for dictionary columns in reduction operations. These were necessary to measure how changes impacted the runtime.