You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
A customer job was falling back ObjectHashAggregate tot he CPU. Within that the user was using the expression org.apache.spark.sql.functions.flatten, which we don't support. It would be nice to add support for this.
The concept of this looks really simple to implement. It takes an Array[Array[SOMETHING]] and turns it into an Array[SOMETHING]. It looks like all we would have to do is to take the data column that holds SOMETHING, and update the offsets from the top level array to point to the beginning and end elements pointed to by the entries in the child array.
For example If I had something like
[[1,2, 3], [4, 5, 6]]
[[],[]]
[[7],[8,9]]
It would have a data column of
1, 2, 3, 4, 5, 6, 7, 8, 9
It would have a child offsets column of
0, 3, 6, 6, 6, 7, 9
And a top level offsets column of
0, 2, 4, 6
We would then do a simple lookup kernel only on the offsets columns.
The top level offset of 0 points to the second level offset of 0.
The top level offset of 2 points to the second level offset of 6.
The top level offset of 4 points to the second level offset of 6.
The top level offset of 6 points to the second level offset of 9.
So the result would keep the data column the same
1, 2, 3, 4, 5, 6, 7, 8, 9
But would just have a new offset column that we just computed
0, 6, 6, 9
Which would result in
[1, 2, 3, 4, 5, 6],
[],
[7, 8, 9]
The one hard part is that nulls in the second level arrays turn the output column to a null.
[[1, 2, 3], null],
[[4], [5, 6]]
results in
null,
[4, 5, 6]
That does not look too hard to make work in a similar way to how we calculated the offsets. The big problem would be cleaning up the non-empty nulls afterwards. Which is not that big of a deal.
Is your feature request related to a problem? Please describe.
A customer job was falling back ObjectHashAggregate tot he CPU. Within that the user was using the expression org.apache.spark.sql.functions.flatten, which we don't support. It would be nice to add support for this.
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.flatten.html
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html
Tasks
The text was updated successfully, but these errors were encountered: