[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

tgravescs · 2023-06-07T18:16:05Z

Is your feature request related to a problem? Please describe.
A customer job was falling back ObjectHashAggregate tot he CPU. Within that the user was using the expression org.apache.spark.sql.functions.flatten, which we don't support. It would be nice to add support for this.

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.flatten.html

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html

Tasks

Give feedback

JNI: Add JNI for lists::concatenate_list_elements rapidsai/cudf#13547
Additional libcudf work: Enable nested types for lists::concatenate_list_elements rapidsai/cudf#13545
Plugin work: Support flatten SQL function #8555
Options

The text was updated successfully, but these errors were encountered:

revans2 · 2023-06-07T19:51:47Z

The concept of this looks really simple to implement. It takes an Array[Array[SOMETHING]] and turns it into an Array[SOMETHING]. It looks like all we would have to do is to take the data column that holds SOMETHING, and update the offsets from the top level array to point to the beginning and end elements pointed to by the entries in the child array.

For example If I had something like

[[1,2, 3], [4, 5, 6]]
[[],[]]
[[7],[8,9]]

It would have a data column of
1, 2, 3, 4, 5, 6, 7, 8, 9
It would have a child offsets column of
0, 3, 6, 6, 6, 7, 9
And a top level offsets column of
0, 2, 4, 6

We would then do a simple lookup kernel only on the offsets columns.

The top level offset of 0 points to the second level offset of 0.
The top level offset of 2 points to the second level offset of 6.
The top level offset of 4 points to the second level offset of 6.
The top level offset of 6 points to the second level offset of 9.

So the result would keep the data column the same
1, 2, 3, 4, 5, 6, 7, 8, 9
But would just have a new offset column that we just computed
0, 6, 6, 9

Which would result in

[1, 2, 3, 4, 5, 6],
[],
[7, 8, 9]

The one hard part is that nulls in the second level arrays turn the output column to a null.

[[1, 2, 3], null],
[[4], [5, 6]]

results in

null,
[4, 5, 6]

That does not look too hard to make work in a similar way to how we calculated the offsets. The big problem would be cleaning up the non-empty nulls afterwards. Which is not that big of a deal.

ttnghia · 2023-06-09T19:10:47Z

This was already implemented in cudf: https://github.com/rapidsai/cudf/blob/branch-23.08/cpp/include/cudf/lists/combine.hpp#L95
In the best case, we just need JNI.

ttnghia · 2023-06-24T03:09:58Z

This should be closed by #8555.

tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 7, 2023

ttnghia self-assigned this Jun 9, 2023

mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 9, 2023

ttnghia closed this as completed Jun 24, 2023

ttnghia mentioned this issue Jun 24, 2023

Support flatten SQL function #8555

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

tgravescs commented Jun 7, 2023 •

edited by ttnghia

Loading

Tasks

revans2 commented Jun 7, 2023

ttnghia commented Jun 9, 2023

ttnghia commented Jun 24, 2023

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

Comments

tgravescs commented Jun 7, 2023 • edited by ttnghia Loading

Tasks

revans2 commented Jun 7, 2023

ttnghia commented Jun 9, 2023

ttnghia commented Jun 24, 2023

tgravescs commented Jun 7, 2023 •

edited by ttnghia

Loading