New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Python bindings for explode #7538

Closed

firestarman wants to merge 2 commits into rapidsai:branch-0.19 from firestarman:explode-python-bindings

Contributor

firestarman commented Mar 9, 2021

This PR is to add Python bindings for explode. Since the native has already supported this feature over a table.

closes #2975

Signed-off-by: Firestarman [email protected]


          Add Python bindings for explode

f3aa7b2

since native already supports this feature over a table.

Signed-off-by: Firestarman <[email protected]>

firestarman requested a review from a team as a code owner

March 9, 2021 07:02

firestarman requested review from kkraus14 and isVoid

March 9, 2021 07:02

github-actions bot added the Python label

kkraus14 requested changes

View reviewed changes

python/cudf/cudf/_lib/reshape.pyx

@@ @@ -41,3 +42,24 @@ def tile(Table source_table, size_type count): @@
                       column_names=source_table._column_names,
                       index_names=source_table._index_names
                   )
+              def explode(Table input_table, explode_column_name, ignore_index, nlevels):

Collaborator

kkraus14 Mar 9, 2021

I believe we need explode_outer to properly match Pandas behavior here:

import pandas as pd

test = pd.Series([[1, 2, 3], [], None, [4, 5]])
print(test.explode())

0       1
0       2
0       3
1     NaN
2    None
3       4
3       5
dtype: object

python/cudf/cudf/tests/test_dataframe.py

Comment on lines +8450 to +8458

+              @pytest.mark.xfail(
+                  reason="nulls are dropped by cudf, but pandas casts it to NaN"
+              )
+              def test_explode_with_nulls():
+                  gdf = cudf.DataFrame({
+                      "a": [[1, 2, 3], [4, 5], None],
+                      "b": [11, 22, 33],
+                      "c": [111, 222, 333]
+                  })

Collaborator

kkraus14 Mar 9, 2021

We can't silently return different results than Pandas here. We need to wait for explode_outer support in libcudf and use that instead.

Also need to cover empty lists here as well.

Contributor Author

firestarman Mar 9, 2021

Thanks for reivew. will draft this to wait for explode_outer support in libcudf.

kkraus14 added feature request non-breaking labels


          use self instead of class name

0662ac6

Signed-off-by: Firestarman <[email protected]>

kkraus14 reviewed

View reviewed changes

python/cudf/cudf/core/dataframe.py

@@ @@ -7425,6 +7425,60 @@ def equals(self, other): @@
                               return False
                       return super().equals(other)
+                  def explode(self, column, ignore_index=False):

Collaborator

kkraus14 Mar 9, 2021

Ideally this implementation could be moved down to frame to share an implementation between DataFrame and Series as well.

kkraus14 reviewed

View reviewed changes

python/cudf/cudf/core/dataframe.py

+                      Parameters
+                      ----------
+                      column : str or tuple
+                          Column to explode. Now only supports one column

Collaborator

kkraus14 Mar 9, 2021

We don't need to indicate it only supports one column, passing a tuple is because names of columns are allowed to be tuples.

firestarman marked this pull request as draft

March 9, 2021 07:23

kkraus14 reviewed

View reviewed changes

python/cudf/cudf/core/dataframe.py

+                                  "but given multiple columns or no column"
+                              )
+                      else:
+                          raise TypeError("column should be str or tuple of str")

Collaborator

kkraus14 Mar 9, 2021

Even though the Pandas docstring says it needs to be a string, Pandas happily allows any arbitrary object that it allows to be a column name here, which includes numbers and other objects.

kkraus14 reviewed

View reviewed changes

python/cudf/cudf/core/dataframe.py

+                      else:
+                          raise TypeError("column should be str or tuple of str")
+                      if exp_column not in self._column_names:
+                          raise ValueError("Can not find the column: " + exp_column)

Collaborator

kkraus14 Mar 9, 2021

This needs to raise a KeyError as opposed to a ValueError to match Pandas semantics properly

kkraus14 reviewed

View reviewed changes

python/cudf/cudf/core/dataframe.py

Comment on lines +7462 to +7473

+                      if isinstance(column, str):
+                          exp_column = column
+                      elif isinstance(column, tuple):
+                          if len(column) == 1:
+                              exp_column = column[0]
+                              if not isinstance(exp_column, str):
+                                  raise TypeError("column should be str or tuple of str")
+                          else:
+                              raise ValueError(
+                                  "Now only supports one column,"
+                                  "but given multiple columns or no column"
+                              )

Collaborator

kkraus14 Mar 9, 2021

This needs to be reworked to handle the generic types Pandas allows for column names. If someone gives ('a',) that's an entirely different name than 'a'.

isVoid self-assigned this

isVoid mentioned this pull request

Add explode API #7606

Closed

Contributor

isVoid commented Mar 16, 2021

Superseded by #7606

isVoid closed this

isVoid mentioned this pull request

Adds explode API #7607

Merged

rapids-bot bot pushed a commit that referenced this pull request


          Adds explode API (#7607)

ec5364c

Closes #2975 

This PR introduces `explode` API, which flattens list columns and turns list elements into rows. Example:

```python
>>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]])
>>> s
0    [1, 2, 3]
1           []
2         None
3       [4, 5]
dtype: list
>>> s.explode()
0       1
0       2
0       3
1    <NA>
2    <NA>
3       4
3       5
dtype: int64
```

Supersedes #7538

Authors:
  - Michael Wang (@isVoid)

Approvers:
  - Keith Kraus (@kkraus14)
  - GALI PREM SAGAR (@galipremsagar)

URL: #7607

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request non-breaking Python