Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Implement cudf::label_bins() #7554

Merged
merged 87 commits into from
Mar 24, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Mar 10, 2021

This PR resolves #7517, implementing a binning feature in libcudf to support pandas.cut in cudf.

vyasr added 30 commits March 5, 2021 10:32
@jrhemstad
Copy link
Contributor

@jrhemstad should we consider reordering the arguments so that we can give defaults to the two inclusion flags? While we may need this libcudf API to support the Python cudf API, I have to imagine that most users of this C++ API will want contiguous half-open (or perhaps half-closed) bins. On the other hand, this may not be that important if we don't anticipate any significant consumers other than cudf.

I think it's better to be explicit here. Any choice we make would feel arbitrary. Better to make the caller be explicit.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unsure about adding the cudf/binning directory. Do we expect other "binning" functionality? Maybe histograms?

That said, I don't have a great suggestion of where else to put it. Maybe cudf/transform
?

@vyasr
Copy link
Contributor Author

vyasr commented Mar 22, 2021

I unsure about adding the cudf/binning directory. Do we expect other "binning" functionality? Maybe histograms?

That said, I don't have a great suggestion of where else to put it. Maybe cudf/transform
?

That's probably a question for @harrism or @kkraus14. The only other related pandas function I'm aware of is pandas.qcut, which splits into quantiles, i.e. instead of specifying a number of bins or bin locations the user specifies either a number of percentiles or the percentile boundaries. I don't know if implementing that is a priority or even on the roadmap; offhand I'd guess it's something for which we don't have a timeline but which we might do eventually.

Even if we did choose to implement qcut eventually, depending on how critical performance is we might implement that purely in the Cython/Python layer of cudf rather than in libcudf. Since it requires determining the quantile values first, and pandas has DataFrame.quantile, the simplest implementation of pandas.qcut in cudf would be implementing something like cudf.DataFrame.quantile (AFAIK this doesn't exist yet in cudf) to compute the quantiles and then just calling cudf::bin with the predetermined bin edges. Then we wouldn't need any new libcudf code at all. Whether or not that's the direction we'd actually want to take is something @kkraus14 would have a better idea about.

@harrism
Copy link
Member

harrism commented Mar 22, 2021

TLDR: I think that the group for this algorithm should be called "labeling". I also think we should consider renaming it to label_bins.

First, look at the file https://github.com/rapidsai/cudf/blob/branch-0.19/cpp/include/doxygen_groups.h and see if this new algorithm fits into one of those algorithmic groups.

My first reaction was that building a histogram belongs in the "reordering" group, like sorted_order. But then I read your documentation and realized that this doesn't actually reorder, it just labels the element with their bin ID. This output can't even be directly used to reorder into bins. To do that you would have to use the bin labels as a key in a key-index sort and then use the sorted indices to gather into the new order. So it's not reordering.

My next thought is that this is similar to the new (upcoming) join APIs which will just return a gather mask. But it's not because it doesn't return a gather mask, just something that could be used (see above) to compute a gather mask. Even so, this bin membership does sort of look like a type of join.

But in general, when I look at an algorithm I try to generalize what it does so that when we add other algorithms that also do that thing we can group them. It feels like this is a labeling algorithm. And it doesn't actually "bin" the inputs (doesn't reorder them or help you reorder them), it simply labels the elements by their bin IDs. So I think in libcudf's tradition of explicit naming and generic, participle-based group naming, label_bins would be a good name and labeling would be a good group.

Another way to think of it is that it's just a vectorized lower_bound -- for each element, it finds the lower_bound of the element in the sorted array of bin edges. The twist is that the bins have explicit left and right edges rather than being contiguous.

@vyasr vyasr requested a review from jrhemstad March 23, 2021 22:01
@harrism
Copy link
Member

harrism commented Mar 23, 2021

Please remember to update the PR title since this is what gets copied into the CHANGELOG.md

@jrhemstad jrhemstad changed the title [REVIEW] Implement cudf::bin() [REVIEW] Implement cudf::label_bins() Mar 23, 2021
@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2aa9f5b into rapidsai:branch-0.19 Mar 24, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 26, 2021
This PR is a follow-up to #7554 and exposes the feature implemented there via Cython for consumption in cudf's Python API.

Authors:
  - Vyas Ramasubramani (@vyasr)

Approvers:
  - Keith Kraus (@kkraus14)

URL: #7724
@vyasr vyasr deleted the feature/libcudf_bin branch January 14, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement cudf::bin
7 participants