Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explode Series with Dask-cuDF #8872

Closed

Conversation

sarahyurick
Copy link
Contributor

Addresses feature request #8660. Invoking explode() on a cuDF or Dask Series outputs a DataFrame.

Similar to #8729, except now instead of having to call a_cudf_series.struct.explode(), we can do just a_cudf_series.explode(). Added Dask functionality allows for a_dask_cudf_series.compute().explode().

Minor detail - I'm not a huge fan of how I dealt with datatypes here. In order to obtain the values within the Series, I use to_arrow() (as seen in #8675), but I need to convert the PyArrow datatypes back to the original datatypes. So I convert them to strings and after creating the DataFrame, iterate through the columns to recast them.

@sarahyurick sarahyurick requested a review from a team as a code owner July 27, 2021 20:48
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 27, 2021
@sarahyurick sarahyurick added dask Dask issue feature request New feature or request non-breaking Non-breaking change labels Jul 27, 2021
Comment on lines +6040 to +6044
for row in self.to_arrow():
row_results = [str(row[col]) for col in cols]
results.append(row_results)

out = cudf.DataFrame(results, columns=cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the structure of the out you want here? The device to host transfer + nested loop may be quite slow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea was that it would essentially look the same as the struct.explode() functionality that's already been implemented, so like:
image

Copy link
Member

@beckernick beckernick Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to accomplish that without going to the CPU:

import cudfs = cudf.Series([
    {"a":5, "b":10},
    {"a":3, "b":7},
    {"a":-3, "b":11},
])
​
results = []
for key in s.dtype.fields:
    results.append(s.struct.field(key))
​
out = cudf.concat(results, axis=1)
out.columns = s.dtype.fields
print(out)
   a   b
0  5  10
1  3   7
2 -3   11

Separately, we may want to think more broadly about what we want the behavior to be for functionality that "kind of" exists in pandas. Pandas doesn't have struct columns, but does allow exploding of an object column containing dictionaries. However, the explode does not behave like exploding a struct column in Hive, Spark, etc. Instead, it behaves like exploding a list column (which it doesn't technically have either), where every element becomes a new row in a single column. This is a traditional, rather than a lateral, explode.

Because of that, we might need to special case an explode operator in dask-cuDF anyway, rather than rely on Dask to appropriately delegate to the cuDF explode from the existing one in Dask.DataFrame.

For now, I'd suggest we consider holding off on series.explode() natively doing a "lateral explode" for struct columns, and instead building the lateral view explode functionality as dask_series.struct.explode() once we land #8658

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @shwina @VibhuJawa (as they might disagree)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I don't think we want the Series.explode() to support this just yet (unless Pandas does so). dask_series.struct.explode() can work off of series.struct.explode().

Copy link
Contributor Author

@sarahyurick sarahyurick Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Pandas, you just get something like:

import pandas as pd

s = pd.Series([{"a": 1, "b": "x"},
                {"a": 2, "b": "y"},
                {"a": 3, "b": "z"},
                {"a": 4, "b": "a"}])
s.explode()
0    a
0    b
1    a
1    b
2    a
2    b
3    a
3    b
dtype: object

Copy link
Member

@beckernick beckernick Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example 👍 .

This is the "traditional explode" mentioned above and what we do with lists. The "lateral explode" is particularly common for structs, as the field names often map to actual features.

We may want to support this kind of traditional explode for structs, but in general this would be less common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - I'll close this PR then, as it has a different functionality than the desired dask_series.struct.explode()

@sarahyurick sarahyurick changed the title Explode struct column into multiple columns with Dask-cuDF Explode Series with Dask-cuDF Jul 29, 2021
rapids-bot bot pushed a commit that referenced this pull request Sep 22, 2021
Closes #8660 

Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series.

Example: 
```python
>>> import cudf, dask_cudf as dgd
>>> ds = dgd.from_cudf(cudf.Series(
...     [{'a': 42, 'b': 'str1', 'c': [-1]},
...      {'a': 0,  'b': 'str2', 'c': [400, 500]},
...      {'a': 7,  'b': '',     'c': []}]), npartitions=2)
>>> ds.struct.explode().compute()
    a     b           c
0  42  str1        [-1]
1   0  str2  [400, 500]
2   7                []
```

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #9086
@sarahyurick sarahyurick deleted the dask_cudf_explode branch September 21, 2022 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants