Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement interleave_columns for lists with arbitrary nested type #9130

Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Aug 26, 2021

This PR adds more support for lists column in the interleave_column API. In particular, it adds nested types support for the list entries. As such, now we can call interleave_column on a lists column of any type such as lists of structs, lists of lists, lists of lists of lists and so on. In addition, this PR also does a simple refactor of the existing overload functions with a new style of SFINAE implementation.

Closes #9106.

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Aug 26, 2021
@ttnghia ttnghia self-assigned this Aug 26, 2021
@codecov
Copy link

codecov bot commented Aug 26, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d29c441). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9130   +/-   ##
===============================================
  Coverage                ?   10.83%           
===============================================
  Files                   ?      114           
  Lines                   ?    19101           
  Branches                ?        0           
===============================================
  Hits                    ?     2070           
  Misses                  ?    17031           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d29c441...ca656db. Read the comment docs.

@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 27, 2021
@ttnghia ttnghia marked this pull request as ready for review August 27, 2021 00:26
@ttnghia ttnghia requested a review from a team as a code owner August 27, 2021 00:26
@jrhemstad
Copy link
Contributor

Clever implementation to make use of the fact that gather already handles all the nastiness of arbitrarily nested columns.

@sperlingxx
Copy link
Contributor

Learnt some advanced skills on how to use cuDF developer API (like gather) from this PR :)

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

CUDF_EXPECTS(entry_type == child_col.type(),
"The types of entries in the input columns must be the same.");
}

if (input.num_rows() == 0) { return cudf::empty_like(input.column(0)); }
if (input.num_columns() == 1) { return std::make_unique<column>(*(input.begin()), stream, mr); }

// For nested types, we rely on the `concatenate_and_gather` method, which costs more memory due
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be another good candidate for a hypothetical multi_gather function if we had it. Each entry in the gather map would be a pair {column_index, row_index} selecting rows from different columns in an input table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dave. The idea is great. I have filed an issue so we can track it here: #9175

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 2, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4556b5b into rapidsai:branch-21.10 Sep 2, 2021
@ttnghia ttnghia deleted the interleave_columns_for_lists branch September 3, 2021 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support interleave_columns for lists of lists and lists of structs
5 participants