Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

Closed
3 tasks done
tgravescs opened this issue Jun 7, 2023 · 3 comments · Fixed by #8555
Closed
3 tasks done

[FEA] Add support for org.apache.spark.sql.functions.flatten #8525

tgravescs opened this issue Jun 7, 2023 · 3 comments · Fixed by #8555
Assignees
Labels
feature request New feature or request

Comments

@tgravescs
Copy link
Collaborator

tgravescs commented Jun 7, 2023

Is your feature request related to a problem? Please describe.
A customer job was falling back ObjectHashAggregate tot he CPU. Within that the user was using the expression org.apache.spark.sql.functions.flatten, which we don't support. It would be nice to add support for this.

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.flatten.html

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html

Tasks

@tgravescs tgravescs added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jun 7, 2023
@revans2
Copy link
Collaborator

revans2 commented Jun 7, 2023

The concept of this looks really simple to implement. It takes an Array[Array[SOMETHING]] and turns it into an Array[SOMETHING]. It looks like all we would have to do is to take the data column that holds SOMETHING, and update the offsets from the top level array to point to the beginning and end elements pointed to by the entries in the child array.

For example If I had something like

[[1,2, 3], [4, 5, 6]]
[[],[]]
[[7],[8,9]]

It would have a data column of
1, 2, 3, 4, 5, 6, 7, 8, 9
It would have a child offsets column of
0, 3, 6, 6, 6, 7, 9
And a top level offsets column of
0, 2, 4, 6

We would then do a simple lookup kernel only on the offsets columns.

The top level offset of 0 points to the second level offset of 0.
The top level offset of 2 points to the second level offset of 6.
The top level offset of 4 points to the second level offset of 6.
The top level offset of 6 points to the second level offset of 9.

So the result would keep the data column the same
1, 2, 3, 4, 5, 6, 7, 8, 9
But would just have a new offset column that we just computed
0, 6, 6, 9

Which would result in

[1, 2, 3, 4, 5, 6],
[],
[7, 8, 9]

The one hard part is that nulls in the second level arrays turn the output column to a null.

[[1, 2, 3], null],
[[4], [5, 6]]

results in

null,
[4, 5, 6]

That does not look too hard to make work in a similar way to how we calculated the offsets. The big problem would be cleaning up the non-empty nulls afterwards. Which is not that big of a deal.

@ttnghia
Copy link
Collaborator

ttnghia commented Jun 9, 2023

This was already implemented in cudf: https://github.com/rapidsai/cudf/blob/branch-23.08/cpp/include/cudf/lists/combine.hpp#L95
In the best case, we just need JNI.

@ttnghia ttnghia self-assigned this Jun 9, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jun 9, 2023
@ttnghia
Copy link
Collaborator

ttnghia commented Jun 24, 2023

This should be closed by #8555.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants