Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support exploding nested type columns #2975

Closed
beckernick opened this issue Oct 4, 2019 · 3 comments · Fixed by #7607
Closed

[FEA] Support exploding nested type columns #2975

beckernick opened this issue Oct 4, 2019 · 3 comments · Fixed by #7607
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS

Comments

@beckernick
Copy link
Member

beckernick commented Oct 4, 2019

When processing a nested type column, I'd like to be able to explode the column into a non-nested type column, like in Spark-sql or pandas. Spark API doc.

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

df = pd.DataFrame()
df['col1'] = [0, 1]
df['col2'] = [[12,12,38], [42,93]]

sdf = sqlContext.createDataFrame(df)
sdf.withColumn("exploded", F.explode("col2")).select("exploded").show()
+--------+
|exploded|
+--------+
|      12|
|      12|
|      38|
|      42|
|      93|
+--------+

Pandas:

import pandas as pd

df = pd.DataFrame()
df['col1'] = [0, 1]
df['col2'] = [[12,12,38], [42,93]]

df.col2.explode()
0    12
0    12
0    38
1    42
1    93
Name: col2, dtype: object
@beckernick beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019
@beckernick beckernick changed the title [FEA] Support exploding nested type column [FEA] Support exploding nested type columns Oct 4, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS labels Feb 25, 2021
@kkraus14
Copy link
Collaborator

@kkraus14
Copy link
Collaborator

Relevant libcudf PR that implemented the functionality: https://github.com/rapidsai/cudf/pull/7140/files

@skirui-source
Copy link
Contributor

Looks like Marlene is already working on this issue:
#7227

@kkraus14 kkraus14 reopened this Mar 1, 2021
@kkraus14 kkraus14 assigned marlenezw and unassigned skirui-source Mar 1, 2021
@kkraus14 kkraus14 assigned isVoid and unassigned marlenezw Mar 15, 2021
This was referenced Mar 16, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 18, 2021
Closes #2975 

This PR introduces `explode` API, which flattens list columns and turns list elements into rows. Example:

```python
>>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]])
>>> s
0    [1, 2, 3]
1           []
2         None
3       [4, 5]
dtype: list
>>> s.explode()
0       1
0       2
0       3
1    <NA>
2    <NA>
3       4
3       5
dtype: int64
```

Supersedes #7538

Authors:
  - Michael Wang (@isVoid)

Approvers:
  - Keith Kraus (@kkraus14)
  - GALI PREM SAGAR (@galipremsagar)

URL: #7607
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants