Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COLLECT_LIST support returning empty output columns. #8279

Merged
merged 3 commits into from
May 21, 2021

Conversation

mythrocks
Copy link
Contributor

Fixes the group-by portion of #7611.

When COLLECT_LIST() or COLLECT_SET() aggregations are called on a grouped input, if the input column is empty, then one sees the following failure:

C++ exception with description "cuDF failure at: .../cpp/src/column/column_factories.cpp:67: 
make_empty_column is invalid to call on nested types" thrown in the test body.

The operation should have resulted in an empty LIST column. make_empty_column() does not support LIST types (in part because the data_type parameter does not capture the types of the child columns).

This commit fixes this by constructing the output column from the specified values input, but only for COLLECT_LIST() and COLLECT_SET(); other aggregation types are unchanged.

@mythrocks mythrocks requested a review from a team as a code owner May 18, 2021 23:38
@mythrocks mythrocks self-assigned this May 18, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 18, 2021
@mythrocks mythrocks added bug Something isn't working non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels May 18, 2021
@harrism harrism changed the title COLLECT_LIST must support returning empty output columns. COLLECT_LIST support returning empty output columns. May 19, 2021
@codecov
Copy link

codecov bot commented May 19, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@2da8473). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #8279   +/-   ##
===============================================
  Coverage                ?   82.84%           
===============================================
  Files                   ?      105           
  Lines                   ?    17865           
  Branches                ?        0           
===============================================
  Hits                    ?    14800           
  Misses                  ?     3065           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2da8473...2cc49f4. Read the comment docs.

Copy link
Contributor

@codereport codereport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mythrocks
Copy link
Contributor Author

gpuCI/cudf/build/python — Build failure/error

I'm running in circles through the CI links, trying to figure out what the failure is. It doesn't seem obvious. I'll re-kick it in time for the second code review.

cpp/src/groupby/groupby.cu Outdated Show resolved Hide resolved
Handle aggregations that return empty list columns.
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 20, 2021
@ttnghia
Copy link
Contributor

ttnghia commented May 20, 2021

Currently, we have this factory function to make an empty column in column_factories.*:

/**
 * @brief Creates an empty column of the specified @p type
 *
 * An empty column contains zero elements and no validity mask.
 *
 * @param[in] type The column data type
 * @return Empty column with desired type
 */
std::unique_ptr<column> make_empty_column(data_type type);

I would like to have a similar function for other types as well, putting at the same place with the existing function (column_factories.*). Because we need to know type of children columns for lists/structs column, I would suggest to have these new functions (or something better):

template<class ChildTypes>
std::unique_ptr<column> make_empty_column(data_type type, ChildTypes const& child_types);

template<>
std::unique_ptr<column> make_empty_column<std::vector<type>>(data_type type, std::vector<type> const& child_types);

So, for constructing an empty column of nested type, just pass in the additional array of child types. That array can be an array of array of array of... of cudf type. The first function above recursively calls itself. The second is the base case.

Having the factory APIs above, we can use it elsewhere in many other places.

PS 1: The proposed functions above are not good enough as they cannot handle uneven nested levels. We can use a customized struct instead of std::vector.

PS 2: Maybe this is out of scope of this PR. We can make it a new PR for the next release instead.

@mythrocks
Copy link
Contributor Author

We can use a customized struct instead of std::vector...

I was about to mention that nesting need not be even, till I read this part.

I know there have been prior discussions on conveying nested type information. I think the consensus was to populate child columns all the way down , and use exemplars when constructing empties (a la empty_like()).

PS 2: Maybe this is out of scope of this PR.

Thanks, I concur. :/

@ttnghia
Copy link
Contributor

ttnghia commented May 20, 2021

Reference: #8178

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5c6b92a into rapidsai:branch-21.06 May 21, 2021
@mythrocks mythrocks deleted the gby-collect-empty-list branch May 21, 2021 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants