Explode Series with Dask-cuDF #8872

sarahyurick · 2021-07-27T20:48:47Z

Addresses feature request #8660. Invoking explode() on a cuDF or Dask Series outputs a DataFrame.

Similar to #8729, except now instead of having to call a_cudf_series.struct.explode(), we can do just a_cudf_series.explode(). Added Dask functionality allows for a_dask_cudf_series.compute().explode().

Minor detail - I'm not a huge fan of how I dealt with datatypes here. In order to obtain the values within the Series, I use to_arrow() (as seen in #8675), but I need to convert the PyArrow datatypes back to the original datatypes. So I convert them to strings and after creating the DataFrame, iterate through the columns to recast them.

beckernick · 2021-07-27T21:12:32Z

python/cudf/cudf/core/series.py

+            for row in self.to_arrow():
+                row_results = [str(row[col]) for col in cols]
+                results.append(row_results)
+
+            out = cudf.DataFrame(results, columns=cols)


What is the structure of the out you want here? The device to host transfer + nested loop may be quite slow

the idea was that it would essentially look the same as the struct.explode() functionality that's already been implemented, so like:

You may be able to accomplish that without going to the CPU:

import cudf s = cudf.Series([ {"a":5, "b":10}, {"a":3, "b":7}, {"a":-3, "b":11}, ]) results = [] for key in s.dtype.fields: results.append(s.struct.field(key)) out = cudf.concat(results, axis=1) out.columns = s.dtype.fields print(out) a b 0 5 10 1 3 7 2 -3 11

Separately, we may want to think more broadly about what we want the behavior to be for functionality that "kind of" exists in pandas. Pandas doesn't have struct columns, but does allow exploding of an object column containing dictionaries. However, the explode does not behave like exploding a struct column in Hive, Spark, etc. Instead, it behaves like exploding a list column (which it doesn't technically have either), where every element becomes a new row in a single column. This is a traditional, rather than a lateral, explode.

Because of that, we might need to special case an explode operator in dask-cuDF anyway, rather than rely on Dask to appropriately delegate to the cuDF explode from the existing one in Dask.DataFrame.

For now, I'd suggest we consider holding off on series.explode() natively doing a "lateral explode" for struct columns, and instead building the lateral view explode functionality as dask_series.struct.explode() once we land #8658

cc @shwina @VibhuJawa (as they might disagree)

I agree, I don't think we want the Series.explode() to support this just yet (unless Pandas does so). dask_series.struct.explode() can work off of series.struct.explode().

in Pandas, you just get something like:

import pandas as pd s = pd.Series([{"a": 1, "b": "x"}, {"a": 2, "b": "y"}, {"a": 3, "b": "z"}, {"a": 4, "b": "a"}]) s.explode() 0 a 0 b 1 a 1 b 2 a 2 b 3 a 3 b dtype: object

Thanks for the example 👍 .

This is the "traditional explode" mentioned above and what we do with lists. The "lateral explode" is particularly common for structs, as the field names often map to actual features.

We may want to support this kind of traditional explode for structs, but in general this would be less common

sounds good - I'll close this PR then, as it has a different functionality than the desired dask_series.struct.explode()

Closes #8660 Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series. Example: ```python >>> import cudf, dask_cudf as dgd >>> ds = dgd.from_cudf(cudf.Series( ... [{'a': 42, 'b': 'str1', 'c': [-1]}, ... {'a': 0, 'b': 'str2', 'c': [400, 500]}, ... {'a': 7, 'b': '', 'c': []}]), npartitions=2) >>> ds.struct.explode().compute() a b c 0 42 str1 [-1] 1 0 str2 [400, 500] 2 7 [] ``` Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9086

sarahyurick added 2 commits July 27, 2021 16:34

adds explode() and tests

1437434

Merge branch 'rapidsai:branch-21.10' into dask_cudf_explode

0165b9a

sarahyurick requested a review from a team as a code owner July 27, 2021 20:48

sarahyurick requested review from cwharris and vyasr July 27, 2021 20:48

github-actions bot added the Python Affects Python cuDF API. label Jul 27, 2021

sarahyurick added dask Dask issue feature request New feature or request non-breaking Non-breaking change labels Jul 27, 2021

beckernick reviewed Jul 27, 2021

View reviewed changes

style update

1d243f2

sarahyurick changed the title ~~Explode struct column into multiple columns with Dask-cuDF~~ Explode Series with Dask-cuDF Jul 29, 2021

sarahyurick closed this Jul 29, 2021

isVoid mentioned this pull request Aug 20, 2021

Add dseries.struct.explode #9086

Merged

sarahyurick deleted the dask_cudf_explode branch September 21, 2022 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explode Series with Dask-cuDF #8872

Explode Series with Dask-cuDF #8872

sarahyurick commented Jul 27, 2021

beckernick Jul 27, 2021

sarahyurick Jul 27, 2021

beckernick Jul 27, 2021 •

edited

Loading

beckernick Jul 27, 2021

shwina Jul 28, 2021

sarahyurick Jul 29, 2021 •

edited

Loading

beckernick Jul 29, 2021 •

edited

Loading

sarahyurick Jul 29, 2021

Explode Series with Dask-cuDF #8872

Explode Series with Dask-cuDF #8872

Conversation

sarahyurick commented Jul 27, 2021

beckernick Jul 27, 2021

Choose a reason for hiding this comment

sarahyurick Jul 27, 2021

Choose a reason for hiding this comment

beckernick Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

beckernick Jul 27, 2021

Choose a reason for hiding this comment

shwina Jul 28, 2021

Choose a reason for hiding this comment

sarahyurick Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

beckernick Jul 29, 2021 • edited Loading

Choose a reason for hiding this comment

sarahyurick Jul 29, 2021

Choose a reason for hiding this comment

beckernick Jul 27, 2021 •

edited

Loading

sarahyurick Jul 29, 2021 •

edited

Loading

beckernick Jul 29, 2021 •

edited

Loading