Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support "WindowGroupLimit" optimization on GPU [databricks] (#10500)
Fixes #8208. This commit adds support for `WindowGroupLimitExec` to run on GPU. This optimization was added in Apache Spark 3.5, to reduce the number of rows that participate in shuffles, for queries that contain filters on the result of ranking functions. For example: ```sql SELECT foo, bar FROM ( SELECT foo, bar, RANK() OVER (PARTITION BY foo ORDER BY bar) AS rnk FROM mytable ) WHERE rnk < 10 ``` Such a query would require a shuffle to bring all rows in a window-group to be made available in the same task. In Spark 3.5, an optimization was added in [SPARK-37099](https://issues.apache.org/jira/browse/SPARK-37099) to take advantage of the `rnk < 10` predicate to reduce shuffle load. Specifically, since only 9 (i.e. 10-1) ranks participate in the window function, only those many rows need be shuffled into the task, per input batch. By pre-filtering rows that can't possibly satisfy the condition, the number of shuffled records can be reduced. The GPU implementation (i.e. `GpuWindowGroupLimitExec`) differs slightly from the CPU implementation, because it needs to execute on the entire input column batch. As a result, `GpuWindowGroupLimitExec` runs the rank scan on each input batch, and then filters out ranks that exceed the limit specified in the predicate (`rnk < 10`). After the shuffle, the `RANK()` is calculated again by `GpuRunningWindowExec`, to produce the final result. The current implementation addresses `RANK()` and `DENSE_RANK` window functions. Other ranking functions (like `ROW_NUMBER()`) can be added at a later date. Signed-off-by: MithunR <[email protected]>
- Loading branch information