Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement cudf::bin #7517

Closed
jrhemstad opened this issue Mar 5, 2021 · 5 comments · Fixed by #7554
Closed

[FEA] Implement cudf::bin #7517

jrhemstad opened this issue Mar 5, 2021 · 5 comments · Fixed by #7554
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Mar 5, 2021

Is your feature request related to a problem? Please describe.

I would like a libcudf API that would allow for supporting the functionality of Pandas cut.

Describe the solution you'd like

There are a few ways that Pandas supports cut, but I think the most generic way we can implement it is something like this:

enum class inclusive{YES, NO};

/**
 * @brief Labels elements based on membership in the specified bins.
 *
 * A bin `i` is defined by `left_edges[i], right_edges[i]`. Whether the edges are inclusive or
 * not is determined by `left_inclusive` and `right_inclusive`, respectively.
 *
 * A value `input[j]` belongs to bin `i` if `value[j]` is contained in the range `left_edges[i],
 * right_edges[i]` (with the specified inclusiveness) and `label[j] == i`. If  `input[j]` does not
 * belong to any bin, then `label[j]` is NULL.
 *
 * If two or more bins overlap, behavior is undefined.
 *
 * NULL elements in `input` belong to no bin and their corresponding label is NULL.
 *
 * @throws if `input.type() == left_edges.type() == right_edges.type()` is violated.
 * @throws if `left_edges.size() != right_edges.size()`
 *
 * @param input The input elements to label according to the specified bins
 * @param left_edges Values of the left edge of each bin
 * @param left_inclusive Whether or not the left edge is inclusive
 * @param right_edges Value of the ridge edge of each bin
 * @param right_inclusive Whether or not the right edge is inclusive
 * @return The labels of the elements in `input` according to the specified bins
 */
std::unique_ptr<column> bin(column_view const& input, 
                            column_view const& left_edges,
                            inclusive left_inclusive
                            column_view const& right_edges,
                            inclusive right_inclusive
                            rmm::mr::device_memory_resource * mr = rmm::get_current_device_resource());
@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Mar 5, 2021
@vyasr
Copy link
Contributor

vyasr commented Mar 8, 2021

@jrhemstad I've implemented a working prototype with the basic binning functionality, but before I move forward I'd like to suggest a couple of changes to the API:

  1. Is there much benefit to specifying left and right edges independently? Having both introduces UB if the user provides overlapping edges, and doesn't allow anything that a single set of edges wouldn't AFAICT. Assuming the expected use case is a user manually creating a column of edges prior to calling this function, I propose to instead accept a single argument edges (or bin_edges) of size num_bins + 1 that includes both boundaries. In addition to avoiding ill-defined behavior, it's also a bit simpler and is a minor reduction in device memory usage (although the latter point is probably irrelevant since in any realistic use case num_bins << num_inputs).
  2. Specifying left and right inclusivity using this API also introduces the possibility of some undesirable behavior. Unless we enforce that left_inclusive != right_inclusive, we would either double count points on the boundary (if both bounds are inclusive) or entirely miss them (if both bounds are exclusive). What we really want is for the edges to specify a sequence of half-open intervals with the option to flip between which side is open and which is closed.
  3. Additionally, we need a way to support something like the include_lowest argument of pd.cut. Note that pd.cut has a bit of problem in this respect, because include_lowest=True only makes sense when right=True (in which case the upper bound of each interval is inclusive while the lower bound is exclusive). So far as I know, there's no corresponding include_highest for when right=False, which means that there's no way to specify the set of intervals [0, 1), [1, 2)... [9, 10]. While this isn't really that limiting, it does introduce an awkward asymmetry in the API. To resolve points (2) and (3), I'd suggest replacing [left|right]_inclusive with two different arguments (I'm fine with whatever naming, just using names that are hopefully clear here): closed_side=[LEFT|RIGHT] (with LEFT/RIGHT defined by an enum, but we could also use a simple boolean flag) and include_axis_bounds=[true|false] to indicate whether we want both boundaries of the axis to be included regardless of the direction of each individual interval.

@jrhemstad
Copy link
Contributor Author

  1. I'd defer to @kkraus14 here. My understanding is that the input for the bin boundaries is usually going to be an IntervalIndex whose shape will more naturally match two independent input columns. Obviously you could always interleave two columns into a single column, but that necessitates an additional kernel and intermediate materialization. I don't really care either way.

  2. Talking with @kkraus14 I think we need to allow left_inclusive and right_inclusive to be independent as both edges can be inclusive if bins are non-overlapping.

  3. I'm not really following what include_lowest does or the impact on the API. One thing to note is that in libcudf we do not beholden ourselves to matching 100% exact Pandas behavior in a single feature so long as there is a way to provide that behavior via composition of other libcudf functions.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 8, 2021

  1. The bins do not need to be contiguous, my bins for example could be (left and right inclusive): [(0, 1), (3, 4), (7, 8)]. Items that do not fall into any bin should be returned as null.

  2. Similarly to above, the bins do not need to be contiguous, so we need the ability to specify the inclusivity on both sides as they could both be inclusive, or neither could be inclusive.

  3. We shouldn't strive to match pandas.cut exactly, especially the include_lowest option. The bins columns should be treated uniformly where we shouldn't try to build a mechanism to specify an inclusivity on a per row basis as opposed to for the column entirely. From the python side we could raise if include_lowest is set to an option that conflicts with the input IntervalIndex for example.

@vyasr
Copy link
Contributor

vyasr commented Mar 8, 2021

@kkraus14 thanks for the info. Just met with Jake and we ironed out that key point that bins don't need to be contiguous, which I was missing. Under that constraint the original API seems like what we want, and I'm happy to forgo include_lowest in libcudf. pandas seems to handle this by just extending the lowest bin by a small amount, so we can mimic that when constructing the IntervalIndex in Python if necessary.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 8, 2021

I think for the flavor of contiguous bins, because the libcudf api takes in column_view objects, someone can still handle it relatively efficiently by having a single column of num_bins + 1 size and then constructing the two column_view objects of size num_bins from that underlying column.

rapids-bot bot pushed a commit that referenced this issue Mar 24, 2021
This PR resolves #7517, implementing a binning feature in `libcudf` to support [pandas.cut](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) in `cudf`.

Authors:
  - Vyas Ramasubramani (@vyasr)

Approvers:
  - AJ Schmidt (@ajschmidt8)
  - David (@davidwendt)
  - Jake Hemstad (@jrhemstad)
  - Nghia Truong (@ttnghia)

URL: #7554
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants