Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds explode API #7607

Merged
merged 17 commits into from
Mar 18, 2021
Merged

Adds explode API #7607

merged 17 commits into from
Mar 18, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Mar 16, 2021

Closes #2975

This PR introduces explode API, which flattens list columns and turns list elements into rows. Example:

>>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]])
>>> s
0    [1, 2, 3]
1           []
2         None
3       [4, 5]
dtype: list
>>> s.explode()
0       1
0       2
0       3
1    <NA>
2    <NA>
3       4
3       5
dtype: int64

Supersedes #7538

@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 16, 2021
@isVoid isVoid added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Mar 16, 2021
@isVoid isVoid mentioned this pull request Mar 16, 2021
@isVoid isVoid marked this pull request as ready for review March 17, 2021 05:55
@isVoid isVoid requested a review from a team as a code owner March 17, 2021 05:55
@isVoid isVoid added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 17, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Mar 17, 2021

rerun tests

@isVoid isVoid self-assigned this Mar 17, 2021
@codecov
Copy link

codecov bot commented Mar 17, 2021

Codecov Report

Merging #7607 (e5e83b1) into branch-0.19 (7871e7a) will increase coverage by 0.62%.
The diff coverage is 93.33%.

❗ Current head e5e83b1 differs from pull request most recent head 6ce751a. Consider uploading reports for the commit 6ce751a to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7607      +/-   ##
===============================================
+ Coverage        81.86%   82.49%   +0.62%     
===============================================
  Files              101      101              
  Lines            16884    17397     +513     
===============================================
+ Hits             13822    14351     +529     
+ Misses            3062     3046      -16     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.45% <ø> (+0.59%) ⬆️
python/cudf/cudf/core/series.py 91.69% <ø> (+0.90%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.53% <ø> (+0.08%) ⬆️
python/cudf/cudf/utils/cudautils.py 52.94% <ø> (+2.55%) ⬆️
python/cudf/cudf/utils/dtypes.py 89.88% <ø> (+0.37%) ⬆️
python/dask_cudf/dask_cudf/io/orc.py 91.04% <ø> (+0.13%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.83% <87.50%> (-0.20%) ⬇️
python/cudf/cudf/core/column/column.py 87.86% <90.00%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <90.32%> (-2.12%) ⬇️
python/cudf/cudf/core/frame.py 89.23% <91.83%> (+0.21%) ⬆️
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b766c5...6ce751a. Read the comment docs.

@kkraus14
Copy link
Collaborator

Pandas also allows calling explode against non-list columns which seems to return the Series / DataFrame by copy. We may want to handle the same for compatibility.

@isVoid isVoid requested a review from kkraus14 March 17, 2021 23:26
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
Comment on lines +599 to +600
res._data.multiindex = self._data.multiindex
res._data._level_names = self._data._level_names
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not have a helper function that handles all of the "gotchas" in this?

Copy link
Contributor Author

@isVoid isVoid Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with @shwina, and can't think of any.. ._data is constructed inside from_unique_ptr, which has no information about multiindex and _level_names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not ideal. One alternative I've considered in the past is a custom ColumnMeta type containing information about column names:

class ColumnMeta:
    names: Tuple[Any]
    multiindex: bool
    level_names: Tuple[Any]

And we pass objects of that type as arguments to ._from_unique_ptr:

cdef from_unique_ptr(unique_ptr[table] c_tbl, ColumnMeta data_meta, ColumnMeta index_meta=None):
    pass

Maybe a bit out of scope for this PR though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely out of scope for this PR, but I like it a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ColumnMeta owned and maintained by some class? Or is it a message interface that gets created on the fly and passes around classes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question -- I think we should have a separate discussion about this outside the context of this PR. Happy to sync offline and then raise an issue?

@isVoid isVoid requested a review from kkraus14 March 18, 2021 04:59
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@isVoid isVoid requested a review from kkraus14 March 18, 2021 06:04
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor doc issue, otherwise LGTM!

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/series.py Outdated Show resolved Hide resolved
@galipremsagar
Copy link
Contributor

Minor doc fixes needed, else LGTM

@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ec5364c into rapidsai:branch-0.19 Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support exploding nested type columns
4 participants