[FEA] Support exploding nested type columns #2975

beckernick · 2019-10-04T18:11:24Z

When processing a nested type column, I'd like to be able to explode the column into a non-nested type column, like in Spark-sql or pandas. Spark API doc.

Pyspark:

import pandas as pd

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F

sc = SparkContext()
sqlContext = SQLContext(sc)

df = pd.DataFrame()
df['col1'] = [0, 1]
df['col2'] = [[12,12,38], [42,93]]

sdf = sqlContext.createDataFrame(df)
sdf.withColumn("exploded", F.explode("col2")).select("exploded").show()
+--------+
|exploded|
+--------+
|      12|
|      12|
|      38|
|      42|
|      93|
+--------+

Pandas:

import pandas as pd

df = pd.DataFrame()
df['col1'] = [0, 1]
df['col2'] = [[12,12,38], [42,93]]

df.col2.explode()
0    12
0    12
0    38
1    42
1    93
Name: col2, dtype: object

The text was updated successfully, but these errors were encountered:

kkraus14 · 2021-02-25T19:47:36Z

Relevant Pandas API: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.explode.html

kkraus14 · 2021-02-25T19:50:41Z

Relevant libcudf PR that implemented the functionality: https://github.com/rapidsai/cudf/pull/7140/files

skirui-source · 2021-02-26T22:43:40Z

Looks like Marlene is already working on this issue:
#7227

@isVoid

Closes #2975 This PR introduces `explode` API, which flattens list columns and turns list elements into rows. Example: ```python >>> s = cudf.Series([[1, 2, 3], [], None, [4, 5]]) >>> s 0 [1, 2, 3] 1 [] 2 None 3 [4, 5] dtype: list >>> s.explode() 0 1 0 2 0 3 1 <NA> 2 <NA> 3 4 3 5 dtype: int64 ``` Supersedes #7538 Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - GALI PREM SAGAR (@galipremsagar) URL: #7607

beckernick added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2019

beckernick changed the title ~~[FEA] Support exploding nested type column~~ [FEA] Support exploding nested type columns Oct 4, 2019

This was referenced Aug 18, 2020

[FEA] Support posexplode in nested type columns. #6025

Closed

[FEA] Audit GenerateExec NVIDIA/spark-rapids#228

Closed

revans2 mentioned this issue Sep 3, 2020

[FEA] memory efficient explode and pos_explode implementations #6151

Closed

revans2 mentioned this issue Nov 16, 2020

[FEA] explode() can take expressions that generate arrays NVIDIA/spark-rapids#1125

Closed

kkraus14 added Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS labels Feb 25, 2021

kkraus14 assigned skirui-source Feb 25, 2021

skirui-source closed this as completed Feb 26, 2021

kkraus14 reopened this Mar 1, 2021

kkraus14 assigned marlenezw and unassigned skirui-source Mar 1, 2021

firestarman mentioned this issue Mar 9, 2021

Add Python bindings for explode #7538

Closed

kkraus14 assigned isVoid and unassigned marlenezw Mar 15, 2021

This was referenced Mar 16, 2021

Add explode API #7606

Closed

Adds explode API #7607

Merged

rapids-bot bot closed this as completed in #7607 Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support exploding nested type columns #2975

[FEA] Support exploding nested type columns #2975

beckernick commented Oct 4, 2019 •

edited

Loading

kkraus14 commented Feb 25, 2021

kkraus14 commented Feb 25, 2021

skirui-source commented Feb 26, 2021

[FEA] Support exploding nested type columns #2975

[FEA] Support exploding nested type columns #2975

Comments

beckernick commented Oct 4, 2019 • edited Loading

kkraus14 commented Feb 25, 2021

kkraus14 commented Feb 25, 2021

skirui-source commented Feb 26, 2021

beckernick commented Oct 4, 2019 •

edited

Loading