-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python bindings for explode #7538
Add Python bindings for explode #7538
Conversation
since native already supports this feature over a table. Signed-off-by: Firestarman <[email protected]>
@@ -41,3 +42,24 @@ def tile(Table source_table, size_type count): | |||
column_names=source_table._column_names, | |||
index_names=source_table._index_names | |||
) | |||
|
|||
|
|||
def explode(Table input_table, explode_column_name, ignore_index, nlevels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need explode_outer
to properly match Pandas behavior here:
import pandas as pd
test = pd.Series([[1, 2, 3], [], None, [4, 5]])
print(test.explode())
0 1
0 2
0 3
1 NaN
2 None
3 4
3 5
dtype: object
@pytest.mark.xfail( | ||
reason="nulls are dropped by cudf, but pandas casts it to NaN" | ||
) | ||
def test_explode_with_nulls(): | ||
gdf = cudf.DataFrame({ | ||
"a": [[1, 2, 3], [4, 5], None], | ||
"b": [11, 22, 33], | ||
"c": [111, 222, 333] | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't silently return different results than Pandas here. We need to wait for explode_outer
support in libcudf and use that instead.
Also need to cover empty lists here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reivew. will draft this to wait for explode_outer
support in libcudf.
Signed-off-by: Firestarman <[email protected]>
@@ -7425,6 +7425,60 @@ def equals(self, other): | |||
return False | |||
return super().equals(other) | |||
|
|||
def explode(self, column, ignore_index=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this implementation could be moved down to frame
to share an implementation between DataFrame
and Series
as well.
Parameters | ||
---------- | ||
column : str or tuple | ||
Column to explode. Now only supports one column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to indicate it only supports one column, passing a tuple is because names of columns are allowed to be tuples.
"but given multiple columns or no column" | ||
) | ||
else: | ||
raise TypeError("column should be str or tuple of str") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the Pandas docstring says it needs to be a string, Pandas happily allows any arbitrary object that it allows to be a column name here, which includes numbers and other objects.
else: | ||
raise TypeError("column should be str or tuple of str") | ||
if exp_column not in self._column_names: | ||
raise ValueError("Can not find the column: " + exp_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to raise a KeyError
as opposed to a ValueError
to match Pandas semantics properly
if isinstance(column, str): | ||
exp_column = column | ||
elif isinstance(column, tuple): | ||
if len(column) == 1: | ||
exp_column = column[0] | ||
if not isinstance(exp_column, str): | ||
raise TypeError("column should be str or tuple of str") | ||
else: | ||
raise ValueError( | ||
"Now only supports one column," | ||
"but given multiple columns or no column" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be reworked to handle the generic types Pandas allows for column names. If someone gives ('a',)
that's an entirely different name than 'a'
.
Superseded by #7606 |
Closes #2975 This PR introduces `explode` API, which flattens list columns and turns list elements into rows. Example: ```python >>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]]) >>> s 0 [1, 2, 3] 1 [] 2 None 3 [4, 5] dtype: list >>> s.explode() 0 1 0 2 0 3 1 <NA> 2 <NA> 3 4 3 5 dtype: int64 ``` Supersedes #7538 Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - GALI PREM SAGAR (@galipremsagar) URL: #7607
This PR is to add Python bindings for
explode
. Since the native has already supported this feature over a table.closes #2975
Signed-off-by: Firestarman [email protected]