[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function #32396

peter-toth · 2021-04-29T12:38:50Z

What changes were proposed in this pull request?

This PR adds a new rule PullOutGroupingExpressions to pull out complex grouping expressions to a Project node under an Aggregate. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions.

Why are the changes needed?

If aggregate expressions (without aggregate functions) in an Aggregate node are complex then the Optimizer can optimize out grouping expressions from them and so making aggregate expressions invalid.

Here is a simple example:

SELECT not(t.id IS NULL) , count(*)
FROM t
GROUP BY t.id IS NULL

In this case the BooleanSimplification rule does this:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
!Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]   Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L]
 +- Project [value#219 AS id#222]                                                                 +- Project [value#219 AS id#222]
    +- LocalRelation [value#219]                                                                     +- LocalRelation [value#219]

where NOT isnull(id#222) is optimized to isnotnull(id#222) and so it no longer refers to any grouping expression.

Before this PR:

== Optimized Logical Plan ==
Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L]
+- Project [value#219 AS id#222]
   +- LocalRelation [value#219]

and running the query throws an error:

Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]
java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L]

After this PR:

== Optimized Logical Plan ==
Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L]
+- Project [isnull(value#219) AS _groupingexpression#233]
   +- LocalRelation [value#219]

and the query works.

Does this PR introduce any user-facing change?

Yes, the query works.

How was this patch tested?

Added new UT.

…gate expressions without aggregate function

peter-toth · 2021-04-29T12:41:26Z

@sigmod, @cloud-fan this is the alternative PR to #31913

peter-toth · 2021-04-29T12:44:39Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala

@@ -405,14 +407,6 @@ class ComplexTypesSuite extends PlanTest with ExpressionEvalHelper {
    val arrayAggRel = relation.groupBy(
      CreateArray(Seq('nullable_id)))(GetArrayItem(CreateArray(Seq('nullable_id)), 0))
    checkRule(arrayAggRel, arrayAggRel)
-


This can be removed now. It is optimized to:

Aggregate [*id#0L], [CASE WHEN (0 = *id#0L) THEN (*id#0L + 1) END AS a#0L] +- LocalRelation <empty>, [*id#0L, nullable_id#0L]

cloud-fan · 2021-04-29T12:50:23Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala

+      case a: Aggregate if a.resolved =>
+        val complexGroupingExpressionMap = mutable.LinkedHashMap.empty[Expression, NamedExpression]
+        val newGroupingExpressions = a.groupingExpressions
+          .filterNot(AggregateExpression.containsAggregate)


is this needed? IIUC the analyzer guarantees that grouping expression can't contain aggregate expressions.

Ah, you are right. I run some experiments to make this rule part of the Analyzer as PullOutNondeterministic is, but realized that it I would require more changes and reverted. I forgot to remove this. Fixed in: ba2f0c7

It seem like you only removed it from the following map method, but this filterNot can also be removed.

Indeed, thanks. I made some other mistakes too in that commit.

Should be ok after bfb85de

cloud-fan · 2021-04-29T12:50:29Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala

+        val newGroupingExpressions = a.groupingExpressions
+          .filterNot(AggregateExpression.containsAggregate)
+          .map {
+            case e if AggregateExpression.isAggregate(e) => e


Fixed in ba2f0c7.

...lyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala

cloud-fan · 2021-04-29T12:56:44Z

The code change looks good. Can we run a TPCDS benchmark to make sure there is no perf regression?

SparkQA · 2021-04-29T13:18:32Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42601/

peter-toth · 2021-04-29T15:47:19Z

The code change looks good. Can we run a TPCDS benchmark to make sure there is no perf regression?

I will run it and post the results soon.

SparkQA · 2021-04-29T16:34:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42605/

SparkQA · 2021-04-29T16:38:37Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42605/

SparkQA · 2021-04-29T17:00:52Z

Test build #138081 has finished for PR 32396 at commit 5a6367b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-29T19:33:41Z

Test build #138085 has finished for PR 32396 at commit ba2f0c7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-30T22:26:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42637/

SparkQA · 2021-04-30T22:30:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42637/

SparkQA · 2021-05-01T02:17:26Z

Test build #138116 has finished for PR 32396 at commit bfb85de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan

LGTM if TPCDS result shows no perf regression

sigmod

Thanks, Peter!
LGTM -- I just have one minor comment.

sigmod · 2021-05-01T06:14:29Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala

+          }
+
+          val newAggregateExpressions = a.aggregateExpressions
+            .map(replaceComplexGroupingExpressions(_).asInstanceOf[NamedExpression])


Can it be done with a.transformExpressions similar to this one:
https://github.com/databricks/runtime/blob/8aff141a8545c2ea1759230482928e25684acbe2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/PullOutNondeterministic.scala#L39-L41

We need to do a manual tree traversal if we want to stop recursion earlier, e.g. case _ if AggregateExpression.isAggregate(e) => e

Does the following one work?

a.transformExpressionsWithPruning(e => !(AggregateExpression.isAggregate(e) || e.fordable)) {
....
}

Hmm, yes, this could work with some explicit casting.

But this would traverse on a.groupingExpressions too which is not needed.

But this would traverse on a.groupingExpressions too which is not needed.

You're right. I think the following would behave the same as the manual recursion:

a.aggregateExpressions.map(_.transformWithPruning(e => !(AggregateExpression.isAggregate(e) || e.fordable))({
// the first two original case branches can be skipped here.
.....
}).asInstanceOf....)

Anyway, it's just my small preference -- it seems neater to use framework functions if it works. Feel free to merge whatever you feel comfortable with.

I think I'm leaving this PR as it is now.

But tested that peter-toth@ed374fe could work, just I need to cast TreePatternBits to Expression.
Although, I wonder if it would make sense to split plan and expression pruning in the future like this: peter-toth@d817fc7 and so this pruning (and probably there are other similar use cases where we want to stop traversal) became simpler: peter-toth@d817fc7#diff-57201016f79912c165715811d7f7f37e2acbef2ae7b241c3c8a0b928d0052eb5R61

Thanks a lot for exploring this, Peter! I'll think more about such use cases.

peter-toth · 2021-05-01T08:03:02Z

LGTM if TPCDS result shows no perf regression

Running it, will post the results today.

I forgot to make PullOutGroupingExpressions non-excludable. Fixed it in fcc4005

SparkQA · 2021-05-01T08:28:12Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42646/

SparkQA · 2021-05-01T12:09:56Z

Test build #138125 has finished for PR 32396 at commit fcc4005.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

peter-toth · 2021-05-01T17:59:58Z

TPCDS benchmark on scaleFactor=5 data looks good, no significant change: TPCDSQueryBenchmark-results.txt

cloud-fan · 2021-05-02T05:52:25Z

thanks, merging to master!

peter-toth · 2021-05-02T08:54:32Z

Thanks all for the review.

maropu · 2021-05-04T07:57:47Z

NOTE: rather, it seems this change's improved TPCDS performance, e.g., 250230 => 228723 in q23a (sf=20). Nice work, @peter-toth

peter-toth · 2021-05-04T08:09:19Z

Thanks @maropu for the extended test.

… expressions from aggregate expressions without aggregate function (#941) * [SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function ### What changes were proposed in this pull request? This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions. ### Why are the changes needed? If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid. Here is a simple example: ``` SELECT not(t.id IS NULL) , count(*) FROM t GROUP BY t.id IS NULL ``` In this case the `BooleanSimplification` rule does this: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification === !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] +- Project [value#219 AS id#222] +- Project [value#219 AS id#222] +- LocalRelation [value#219] +- LocalRelation [value#219] ``` where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression. Before this PR: ``` == Optimized Logical Plan == Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L] +- Project [value#219 AS id#222] +- LocalRelation [value#219] ``` and running the query throws an error: ``` Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] ``` After this PR: ``` == Optimized Logical Plan == Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L] +- Project [isnull(value#219) AS _groupingexpression#233] +- LocalRelation [value#219] ``` and the query works. ### Does this PR introduce _any_ user-facing change? Yes, the query works. ### How was this patch tested? Added new UT. Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2. Authored-by: Peter Toth <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit cfc0495) * [SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]> (cherry picked from commit a235c3b) Co-authored-by: Peter Toth <[email protected]>

[SPARK-34581][SQL] Don't optimize out grouping expressions from aggre…

5a6367b

…gate expressions without aggregate function

github-actions bot added the SQL label Apr 29, 2021

peter-toth mentioned this pull request Apr 29, 2021

[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function #31913

Closed

peter-toth commented Apr 29, 2021

View reviewed changes

cloud-fan reviewed Apr 29, 2021

View reviewed changes

...lyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PullOutGroupingExpressions.scala Show resolved Hide resolved

minor fixes

ba2f0c7

fix the fix

bfb85de

cloud-fan approved these changes May 1, 2021

View reviewed changes

sigmod reviewed May 1, 2021

View reviewed changes

make PullOutGroupingExpressions non-excludable

fcc4005

sigmod approved these changes May 1, 2021

View reviewed changes

cloud-fan closed this in cfc0495 May 2, 2021

[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function #32396

[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function #32396

Conversation

peter-toth commented Apr 29, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

peter-toth commented Apr 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 29, 2021

SparkQA commented Apr 29, 2021

peter-toth commented Apr 29, 2021

SparkQA commented Apr 29, 2021

SparkQA commented Apr 29, 2021

SparkQA commented Apr 29, 2021

SparkQA commented Apr 29, 2021

SparkQA commented Apr 30, 2021

SparkQA commented Apr 30, 2021

SparkQA commented May 1, 2021

cloud-fan left a comment

Choose a reason for hiding this comment

sigmod left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth May 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peter-toth commented May 1, 2021 • edited Loading

SparkQA commented May 1, 2021

SparkQA commented May 1, 2021

peter-toth commented May 1, 2021

cloud-fan commented May 2, 2021

peter-toth commented May 2, 2021

maropu commented May 4, 2021

peter-toth commented May 4, 2021

peter-toth May 1, 2021 •

edited

Loading

peter-toth commented May 1, 2021 •

edited

Loading