Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Improve cudf::distinct with cuco reduction map #13157

Open
4 tasks
PointKernel opened this issue Apr 17, 2023 · 4 comments
Open
4 tasks

[FEA] Improve cudf::distinct with cuco reduction map #13157

PointKernel opened this issue Apr 17, 2023 · 4 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@PointKernel
Copy link
Member

PointKernel commented Apr 17, 2023

Is your feature request related to a problem? Please describe.
#11052 introduces the keep control option into cudf::distinct and makes it possible for users to perform a more efficient hash-based drop_duplicates. The PR uses a single hash map together with thrust algorithms to mimic the behavior of a reduction map. This whole process can be largely simplified once NVIDIA/cuCollections#98 is ready. TODO:

  • Replace static_map + thrust algos with cuco::static_reduction_map + cudf::sort
  • Update Python bindings to use the hash-based algorithm
  • Investigate the performance impact with various map occupancy and sort-based algo v.s. hash-based algo
  • Minimize memory footprint

Describe the solution you'd like
Uses a cuco::static_reduction_map where the key is the row index and the value is the min/max index of equivalent rows (depending on the keep option).

Describe alternatives you've considered
We could also take a pair of row hash value and row index as the key which performs the expensive row hash computation only once for better runtime performance. This requires more memory footprint though. To be evaluated.

Additional context
#11656 may not be required by the new reduction map implementation.

@PointKernel PointKernel added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt labels Apr 17, 2023
@PointKernel
Copy link
Member Author

cc @sleeepyjack

@bdice
Copy link
Contributor

bdice commented May 19, 2023

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

@wence-
Copy link
Contributor

wence- commented Nov 17, 2023

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

The current stable_distinct implementation would still be needed I think. This approach using the proposed cuco insert_or_apply interface in NVIDIA/cuCollections#384 (which supersedes NVIDIA/cuCollections#98) would avoid the need for the thrust-based post-processing of the result that is currently done in the keep != ANY case.

@wence-
Copy link
Contributor

wence- commented Dec 4, 2023

Just to note, I would like any putative API that exposes the insert_or_apply interface in libcudf to allow user-provided binops. Any value type (within the usual cuco constraints) with monoid structure overlaid should be allowable. This would allow, for example, implementation of value_counts (count multiplicity of unique rows) which is needed in cudf's Index.union implementation (or will be soon), by using (0, +). Right now, we must compute a groupby over the rows and then compute groupby.size() on a dummy column to deduce multiplicity

As noted, the distinct flavours use (MAX_INT, min) (for keep=first), (MIN_INT, max) (for keep=last).

@vyasr vyasr removed the tech debt label Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants