-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Implement cudf::label_bins() #7554
Conversation
…intervals (stop assuming contiguity).
Co-authored-by: David <[email protected]>
I think it's better to be explicit here. Any choice we make would feel arbitrary. Better to make the caller be explicit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unsure about adding the cudf/binning
directory. Do we expect other "binning" functionality? Maybe histograms?
That said, I don't have a great suggestion of where else to put it. Maybe cudf/transform
?
That's probably a question for @harrism or @kkraus14. The only other related pandas function I'm aware of is Even if we did choose to implement |
TLDR: I think that the group for this algorithm should be called "labeling". I also think we should consider renaming it to First, look at the file https://github.com/rapidsai/cudf/blob/branch-0.19/cpp/include/doxygen_groups.h and see if this new algorithm fits into one of those algorithmic groups. My first reaction was that building a histogram belongs in the "reordering" group, like sorted_order. But then I read your documentation and realized that this doesn't actually reorder, it just labels the element with their bin ID. This output can't even be directly used to reorder into bins. To do that you would have to use the bin labels as a key in a key-index sort and then use the sorted indices to gather into the new order. So it's not reordering. My next thought is that this is similar to the new (upcoming) join APIs which will just return a gather mask. But it's not because it doesn't return a gather mask, just something that could be used (see above) to compute a gather mask. Even so, this bin membership does sort of look like a type of join. But in general, when I look at an algorithm I try to generalize what it does so that when we add other algorithms that also do that thing we can group them. It feels like this is a labeling algorithm. And it doesn't actually "bin" the inputs (doesn't reorder them or help you reorder them), it simply labels the elements by their bin IDs. So I think in libcudf's tradition of explicit naming and generic, participle-based group naming, Another way to think of it is that it's just a vectorized lower_bound -- for each element, it finds the lower_bound of the element in the sorted array of bin edges. The twist is that the bins have explicit left and right edges rather than being contiguous. |
Please remember to update the PR title since this is what gets copied into the CHANGELOG.md |
@gpucibot merge |
This PR resolves #7517, implementing a binning feature in
libcudf
to support pandas.cut incudf
.