[FEA] Improve cudf::distinct
with cuco reduction map
#13157
Labels
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
Is your feature request related to a problem? Please describe.
#11052 introduces the keep control option into
cudf::distinct
and makes it possible for users to perform a more efficient hash-baseddrop_duplicates
. The PR uses a single hash map together with thrust algorithms to mimic the behavior of a reduction map. This whole process can be largely simplified once NVIDIA/cuCollections#98 is ready. TODO:static_map
+thrust
algos withcuco::static_reduction_map
+cudf::sort
Describe the solution you'd like
Uses a
cuco::static_reduction_map
where the key is the row index and the value is the min/max index of equivalent rows (depending on the keep option).Describe alternatives you've considered
We could also take a pair of row hash value and row index as the key which performs the expensive row hash computation only once for better runtime performance. This requires more memory footprint though. To be evaluated.
Additional context
#11656 may not be required by the new reduction map implementation.
The text was updated successfully, but these errors were encountered: