[FEA] Improve `cudf::distinct` with cuco reduction map #13157

PointKernel · 2023-04-17T20:43:29Z

Is your feature request related to a problem? Please describe.
#11052 introduces the keep control option into cudf::distinct and makes it possible for users to perform a more efficient hash-based drop_duplicates. The PR uses a single hash map together with thrust algorithms to mimic the behavior of a reduction map. This whole process can be largely simplified once NVIDIA/cuCollections#98 is ready. TODO:

Replace static_map + thrust algos with cuco::static_reduction_map + cudf::sort
Update Python bindings to use the hash-based algorithm
Investigate the performance impact with various map occupancy and sort-based algo v.s. hash-based algo
Minimize memory footprint

Describe the solution you'd like
Uses a cuco::static_reduction_map where the key is the row index and the value is the min/max index of equivalent rows (depending on the keep option).

Describe alternatives you've considered
We could also take a pair of row hash value and row index as the key which performs the expensive row hash computation only once for better runtime performance. This requires more memory footprint though. To be evaluated.

Additional context
#11656 may not be required by the new reduction map implementation.

The text was updated successfully, but these errors were encountered:

PointKernel · 2023-04-17T20:46:50Z

cc @sleeepyjack

bdice · 2023-05-19T02:38:42Z

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

wence- · 2023-11-17T10:45:41Z

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

The current stable_distinct implementation would still be needed I think. This approach using the proposed cuco insert_or_apply interface in NVIDIA/cuCollections#384 (which supersedes NVIDIA/cuCollections#98) would avoid the need for the thrust-based post-processing of the result that is currently done in the keep != ANY case.

wence- · 2023-12-04T11:00:42Z

Just to note, I would like any putative API that exposes the insert_or_apply interface in libcudf to allow user-provided binops. Any value type (within the usual cuco constraints) with monoid structure overlaid should be allowable. This would allow, for example, implementation of value_counts (count multiplicity of unique rows) which is needed in cudf's Index.union implementation (or will be soon), by using (0, +). Right now, we must compute a groupby over the rows and then compute groupby.size() on a dummy column to deduce multiplicity

As noted, the distinct flavours use (MAX_INT, min) (for keep=first), (MIN_INT, max) (for keep=last).

PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt labels Apr 17, 2023

PointKernel added this to the Refactor using cuco containers milestone Apr 17, 2023

PointKernel mentioned this issue May 2, 2023

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

VibhuJawa mentioned this issue Nov 17, 2023

[FEA] Scalable and optimal dask_cudf based drop duplicates #14439

Closed

vyasr removed the tech debt label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve `cudf::distinct` with cuco reduction map #13157

[FEA] Improve `cudf::distinct` with cuco reduction map #13157

PointKernel commented Apr 17, 2023 •

edited

Loading

PointKernel commented Apr 17, 2023

bdice commented May 19, 2023 •

edited

Loading

wence- commented Nov 17, 2023

wence- commented Dec 4, 2023

[FEA] Improve cudf::distinct with cuco reduction map #13157

[FEA] Improve cudf::distinct with cuco reduction map #13157

Comments

PointKernel commented Apr 17, 2023 • edited Loading

PointKernel commented Apr 17, 2023

bdice commented May 19, 2023 • edited Loading

wence- commented Nov 17, 2023

wence- commented Dec 4, 2023

[FEA] Improve `cudf::distinct` with cuco reduction map #13157

[FEA] Improve `cudf::distinct` with cuco reduction map #13157

PointKernel commented Apr 17, 2023 •

edited

Loading

bdice commented May 19, 2023 •

edited

Loading