Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python bindings for explode #7538

Closed

Conversation

firestarman
Copy link
Contributor

This PR is to add Python bindings for explode. Since the native has already supported this feature over a table.

closes #2975

Signed-off-by: Firestarman [email protected]

since native already supports this feature over a table.

Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman requested a review from a team as a code owner March 9, 2021 07:02
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 9, 2021
@@ -41,3 +42,24 @@ def tile(Table source_table, size_type count):
column_names=source_table._column_names,
index_names=source_table._index_names
)


def explode(Table input_table, explode_column_name, ignore_index, nlevels):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need explode_outer to properly match Pandas behavior here:

import pandas as pd

test = pd.Series([[1, 2, 3], [], None, [4, 5]])
print(test.explode())
0       1
0       2
0       3
1     NaN
2    None
3       4
3       5
dtype: object

Comment on lines +8450 to +8458
@pytest.mark.xfail(
reason="nulls are dropped by cudf, but pandas casts it to NaN"
)
def test_explode_with_nulls():
gdf = cudf.DataFrame({
"a": [[1, 2, 3], [4, 5], None],
"b": [11, 22, 33],
"c": [111, 222, 333]
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't silently return different results than Pandas here. We need to wait for explode_outer support in libcudf and use that instead.

Also need to cover empty lists here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reivew. will draft this to wait for explode_outer support in libcudf.

@kkraus14 kkraus14 added feature request New feature or request non-breaking Non-breaking change labels Mar 9, 2021
@@ -7425,6 +7425,60 @@ def equals(self, other):
return False
return super().equals(other)

def explode(self, column, ignore_index=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this implementation could be moved down to frame to share an implementation between DataFrame and Series as well.

Parameters
----------
column : str or tuple
Column to explode. Now only supports one column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to indicate it only supports one column, passing a tuple is because names of columns are allowed to be tuples.

@firestarman firestarman marked this pull request as draft March 9, 2021 07:23
"but given multiple columns or no column"
)
else:
raise TypeError("column should be str or tuple of str")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the Pandas docstring says it needs to be a string, Pandas happily allows any arbitrary object that it allows to be a column name here, which includes numbers and other objects.

else:
raise TypeError("column should be str or tuple of str")
if exp_column not in self._column_names:
raise ValueError("Can not find the column: " + exp_column)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to raise a KeyError as opposed to a ValueError to match Pandas semantics properly

Comment on lines +7462 to +7473
if isinstance(column, str):
exp_column = column
elif isinstance(column, tuple):
if len(column) == 1:
exp_column = column[0]
if not isinstance(exp_column, str):
raise TypeError("column should be str or tuple of str")
else:
raise ValueError(
"Now only supports one column,"
"but given multiple columns or no column"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be reworked to handle the generic types Pandas allows for column names. If someone gives ('a',) that's an entirely different name than 'a'.

@isVoid isVoid self-assigned this Mar 15, 2021
@isVoid isVoid mentioned this pull request Mar 16, 2021
@isVoid
Copy link
Contributor

isVoid commented Mar 16, 2021

Superseded by #7606

@isVoid isVoid closed this Mar 16, 2021
@isVoid isVoid mentioned this pull request Mar 16, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 18, 2021
Closes #2975 

This PR introduces `explode` API, which flattens list columns and turns list elements into rows. Example:

```python
>>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]])
>>> s
0    [1, 2, 3]
1           []
2         None
3       [4, 5]
dtype: list
>>> s.explode()
0       1
0       2
0       3
1    <NA>
2    <NA>
3       4
3       5
dtype: int64
```

Supersedes #7538

Authors:
  - Michael Wang (@isVoid)

Approvers:
  - Keith Kraus (@kkraus14)
  - GALI PREM SAGAR (@galipremsagar)

URL: #7607
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support exploding nested type columns
3 participants