[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. #24209

dilipbiswal · 2019-03-25T19:22:26Z

What changes were proposed in this pull request?

In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.

Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.

Here is an example:

val df = spark.sql("select * from t where sum(ta) > 0")
df.explain(true)
df.show()

Resulting exception:

Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
	at scala.Option.getOrElse(Option.scala:138)

Checked the behaviour of other database and all of them return an exception:
Postgress

select * from foo where max(c1) > 0;
Error
ERROR: aggregate functions are not allowed in WHERE Position: 25

DB2

db2 => select * from foo where max(c1) > 0;
SQL0120N  Invalid use of an aggregate function or OLAP function.

Oracle

select * from foo where max(c1) > 0;
ORA-00934: group function is not allowed here

MySql

select * from foo where max(c1) > 0;
Invalid use of group function

Update
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.

How was this patch tested?

Added tests in AnalysisErrorSuite and group-by.sql

SparkQA · 2019-03-25T23:24:29Z

Test build #103924 has finished for PR 24209 at commit 61ec443.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-26T00:05:45Z

cc @cloud-fan @maropu @chakravarthiT

maropu · 2019-03-26T00:33:40Z

To just make sure this is a right approach, one question; CheckAnalysis is the best place to fix this issue? For example, analyzer itself throws an exception when filter has generators?

scala> sql("select * from x where explode(array(1, 2, 3)) < id").show
org.apache.spark.sql.AnalysisException: Generators are not supported outside the SELECT clause, but got: 'Filter (explode(array(1, 2, 3)) < id#10L);
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$17.applyOrElse(Analyzer.scala:1796)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$17.applyOrElse(Analyzer.scala:1751)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)

dilipbiswal · 2019-03-26T00:39:37Z

@maropu In my understanding, the goal was to handle most of semantic errors in checkAnalysis. In cases when its not do-able, they are handled in analyzer itself. Thats my understanding. What do you think Wenchen ?
Update: One such example is in PR

cloud-fan · 2019-03-26T00:46:25Z

I vaguely remember we did a similar check for window functions, how was that done? cc @rednaxelafx

dilipbiswal · 2019-03-26T02:54:04Z

@cloud-fan I checked the code. The validation for presence of window expression(s) in FILTER and HAVING is done in analyzer in rule ExtractWindowExpressions. I don't know about the difficulty in doing the check in checkAnalysis. Perhaps, we run into issues while processing the window expression later on in analyzer.

However, for our current case, i think it makes sense to do it in checkAnalysis since HAVING can have aggregate and is planned as a Filter. I think somewhere in Analyzer we turn this into a Filter over Aggregate with Filter referencing to the output of Aggregate as a named expression. So after we do all that work, when we come to checkAnalysis and we still have aggregate expressions in the Filter node, we raise the error. Please let me know ..

dilipbiswal · 2019-03-26T04:20:59Z

@cloud-fan For you reference this pr has added the check.

cloud-fan · 2019-03-26T04:27:25Z

can you check #23658 ?

dilipbiswal · 2019-03-26T04:57:04Z

@cloud-fan Thanks for pointing me to this. I had seen it before but had completely forgotten about it :-). So Wenchen, in my understanding, this PR ensures that the integrity of the plan is not broken in the optimizer between application of rules. Is there something in specific you wanted me to focus on ? One thing i noticed is that here we reject plans which have aggregate expressions and plan node is not Aggregate. In this PR, we check for only Filter .. Did you want to target the fix to other operators as well just like the integrity check pr ? I feel its better to do check specific to problematic operators so we can issue targeted error message. What do you think ?

cloud-fan · 2019-03-26T05:02:42Z

I'm wondering why we still hit this problem after #23658

dilipbiswal · 2019-03-26T05:09:03Z

@cloud-fan Oh.. because.. the integrity check is done only in testing mode, right ? So its basically an internal thing ?
So when i try the following test, i do get an integrity check error :

test("test agg") {
    withSQLConf("spark.testing" -> "true") {
      withTempView("a", "b") {
        Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("a")
        Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("b")

        sql("select * from a where max(id) > 0").show

      }
    }
  }

cloud-fan · 2019-03-26T05:16:42Z

cc @rednaxelafx @gatorsmile shall we make that check official?

gatorsmile · 2019-03-27T01:26:19Z

I think we should verify it in the analyzer stage.

The plan integrity verification is just for ensuring the optimizer rules do not change the plan and then break the integrity.

gatorsmile · 2019-03-27T01:27:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -172,16 +176,15 @@ trait CheckAnalysis extends PredicateHelper {
            failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
              s"conditions: $condition")

+          case Filter(condition, _) if condition.find(isAggregateExpression(_)).isDefined =>


call plan integrity checking here?

@gatorsmile Thanks for reviewing. I will try it and get back.

@gatorsmile @cloud-fan Just made change to make use of the integrity check method. I had to refactor it out as its a private method in optimizer. Is this what you had in mind ? Please let me know.

SparkQA · 2019-03-28T00:03:35Z

Test build #104030 has finished for PR 24209 at commit 6df1ac1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-28T02:44:33Z

Test build #104031 has finished for PR 24209 at commit af419e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-28T07:37:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper {
            failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
              s"conditions: $condition")

+          case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) =>


why only check filter here? I think we should do it for any operator.

@cloud-fan I thought, having a targeted error message for the user would be better. If we do it generically, we would say "Filter" instead of "WHERE clause", right ? Should i handle for Filter here and have a general catcher as the last case statement to issue a generic error message like "The operator xxxx contains an unsupported expression type"?

"The operator xxxx contains an unsupported expression type" sounds good

Then we can move it closer to the general case?

@cloud-fan Sure.

cloud-fan · 2019-03-28T07:39:00Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1;
+
+-- Error when aggregate expressions are in where clause directly
+SELECT count(*) FROM test_agg WHERE count(*) > 1;


can we test HAVING as well?

@cloud-fan ok.. i will add having.

cloud-fan · 2019-03-28T07:39:40Z

...ore/src/test/resources/sql-tests/results/subquery/negative-cases/invalid-correlation.sql.out

@@ -70,9 +70,10 @@ Resolved attribute(s) t2b#x missing from min(t2a)#x,t2c#x in operator !Filter t2
 SELECT t1a 
 FROM   t1
 GROUP  BY 1
-HAVING EXISTS (SELECT 1 
+HAVING EXISTS (SELECT t2a


why this change?

@cloud-fan because, the child subquery plan can not be resolved due to missing attributes.

+- Project [1#81] : +- !Filter (t2a#9 < min((outer(t1a#6) + t2a#9))#86) : +- Aggregate [1], [1 AS 1#81, min((outer(t1a#6) + t2a#9)) AS min((outer(t1a#6) + t2a#9))#86] : +- SubqueryAlias `t2` : +- Project [t2a#9, t2b#10, t2c#11] : +- SubqueryAlias `t2` : +- LocalRelation [t2a#9, t2b#10, t2c#11]

But we do expect analysis exception for this query, don't we?

@cloud-fan Yeah.. this particular test is in subquery suite and checking for a specific analysis expression that disallows a combination of local and outer references inside an aggregate expression. However, this test had the aggregate expression in the Filter clause which we are disallowing now. So i changed it to use "having" instead of "filter". But when i did that i hit a different analysis exception due to missing attributes (this particular test in question is trying to test a different analysis exception). So i fixed the projection here to include the missing attribute so we get a desired subquery related exception.

makes sense

SparkQA · 2019-03-29T05:04:20Z

Test build #104065 has finished for PR 24209 at commit 49f04d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-29T21:46:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper {
            failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
              s"conditions: $condition")

+          case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) =>


do we still need this?

@cloud-fan Actually we don't need it. However, since its a user facing message, i feel, its better if we issue targeted error message. I was thinking, we we should add more individual cases as we learn about them here while still having a generic catcher.
If you think otherwise, then i will remove it. Let me know, please.

cloud-fan · 2019-03-29T21:46:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala

+   * - A Generator but the plan is not Generate
+   * Returns true when this operator hosts illegal expressions. This can happen when
+   * 1. The input query from users contain invalid expressions.
+   *    Example : SELECT * FROM tab WHERE max(c1) > 0


BTW how about SELECT max(c1) FROM tab WHERE max(c1) > 0? Does it work?

@cloud-fan No, this case does not work. But this should work and i have a test case for this.

SELECT max FROM (SELECT max(v1) AS max FROM tab) WHERE max > 1

SparkQA · 2019-03-30T11:33:04Z

Test build #104107 has finished for PR 24209 at commit b0ab3e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-04-01T02:31:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+               """.stripMargin)
+
+          case other if PlanHelper.specialExpressionInUnsupportedOperator(other) =>
+            failAnalysis(s"The query operator  `${other.nodeName}` contains " +


nit: remove an unnecessary single space.

@maropu Will remove.

maropu · 2019-04-01T02:39:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -52,29 +52,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
   * Returns true when all operators are integral.
   */
  private def checkSpecialExpressionIntegrity(plan: LogicalPlan): Boolean = {


It seems this private function is called by isPlanIntegral only, so we need to define this separately?

@maropu Ok... Let me remove it.

SparkQA · 2019-04-01T11:14:38Z

Test build #104156 has finished for PR 24209 at commit a743b49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-04-01T15:17:49Z

@cloud-fan does this look ok now ?

viirya · 2019-04-01T15:27:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala

+import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
+
+/**
+ * [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer.


super nit: Contains -> contains.

Don't need to wait for another round test for this.

viirya · 2019-04-01T15:28:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala

+/**
+ * [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer.
+ * It can also be container of methods that are common across multiple rules in Analyzer
+ * and optimizer.


super nit: optimizer -> Optimizer. As consistent to above line.

mgaido91 · 2019-04-01T15:46:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala

+   *    Example : SELECT * FROM tab WHERE max(c1) > 0
+   * 2. Query rewrites inadvertently produce plans that are invalid.
+   */
+  def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Boolean = {


what if we return the unsupported operator here and we report the specific one which is problematic to the user, rather than printing the whole filter condition? I think this would help the user to understand which specific thing caused the issue and it would be more user friendly.

@mgaido91 The caller already knows about the problematic operator, no ? About the problematic expression, i suppose we could return the first occurence of the problematic expression and report it in the error text.

yes, sorry, I meant expression, not operator. Sorry for the mistake.

i suppose we could return the first occurence of the problematic expression and report it in the error text.

Yes, this is what I meant. I think it would be easier for the user to understand and fix the problem, especially if the filter condition is very big and hard to be read as a whole.
It would be even better if we could return the list of all the problematic expressions if possible though, in order to avoid the user to iterate fixing->trying->fixing->...

SparkQA · 2019-04-01T19:53:14Z

Test build #104166 has finished for PR 24209 at commit c9d1f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-02T04:40:29Z

Test build #104176 has finished for PR 24209 at commit a3f4b91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91

only few minor comments, otherwise LGTM, thanks.

mgaido91 · 2019-04-02T08:05:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

+                 |The query operator `${other.nodeName}` contains one or more unsupported
+                 |expression types Aggregate, Window or Generate.
+                 |Invalid expressions: [${invalidExprSqls.mkString(", ")}]
+                 |


nit: can we remove this line? and also close the string on the line above, in order to avoid weird things like those ; in the middle of nowhere in the output

@mgaido91 thanks.. will remove. By closing on the line above you mean like ..

|Invalid expressions: [${invalidExprSqls.mkString(", ")}]""".stripMargin

?

yes, I meant that

mgaido91 · 2019-04-02T08:05:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/PlanHelper.scala

+   *    Example : SELECT * FROM tab WHERE max(c1) > 0
+   * 2. Query rewrites inadvertently produce plans that are invalid.
+   */
+  def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Seq[Expression] = {


nit: specialExpressions?

@mgaido91 Will change.

SparkQA · 2019-04-02T17:32:07Z

Test build #104217 has finished for PR 24209 at commit 15b0a98.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-02T22:14:49Z

Test build #104219 has finished for PR 24209 at commit bbab4c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-04-03T04:54:48Z

@cloud-fan Please let me know if this looks okay now ?

cloud-fan · 2019-04-03T05:05:18Z

thanks, merging to master!

dilipbiswal · 2019-04-03T05:10:57Z

@cloud-fan Thank you very much !!

dilipbiswal changed the title ~~[SPARK-27255] Aggregate functions should not be allowed in WHERE~~ [SPARK-27255][SQL] Aggregate functions should not be allowed in WHERE Mar 25, 2019

gatorsmile reviewed Mar 27, 2019

View reviewed changes

cloud-fan reviewed Mar 28, 2019

View reviewed changes

dilipbiswal changed the title ~~[SPARK-27255][SQL] Aggregate functions should not be allowed in WHERE~~ [SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. Mar 29, 2019

cloud-fan reviewed Mar 29, 2019

View reviewed changes

maropu reviewed Apr 1, 2019

View reviewed changes

viirya reviewed Apr 1, 2019

View reviewed changes

mgaido91 reviewed Apr 1, 2019

View reviewed changes

mgaido91 reviewed Apr 2, 2019

View reviewed changes

dilipbiswal added 9 commits April 2, 2019 10:52

[SPARK-27255] Aggregate functions should not be allowed in WHERE

77b64db

Review

2bc8e3a

minor

2ed1360

Code review

5fac069

Code review

a86114d

Code review

3ae6cd0

Code review

f2e5174

Code review

f664d1c

code review

bbab4c1

dilipbiswal force-pushed the SPARK-27255 branch from 15b0a98 to bbab4c1 Compare April 2, 2019 17:53

cloud-fan closed this in 3286bff Apr 3, 2019

[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. #24209

[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. #24209

Conversation

dilipbiswal commented Mar 25, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 25, 2019

dilipbiswal commented Mar 26, 2019

maropu commented Mar 26, 2019

dilipbiswal commented Mar 26, 2019 • edited Loading

cloud-fan commented Mar 26, 2019

dilipbiswal commented Mar 26, 2019 • edited Loading

dilipbiswal commented Mar 26, 2019

cloud-fan commented Mar 26, 2019

dilipbiswal commented Mar 26, 2019

cloud-fan commented Mar 26, 2019

dilipbiswal commented Mar 26, 2019 • edited Loading

cloud-fan commented Mar 26, 2019

gatorsmile commented Mar 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 28, 2019

SparkQA commented Mar 28, 2019

Choose a reason for hiding this comment

dilipbiswal Mar 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 29, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 1, 2019

dilipbiswal commented Apr 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Apr 1, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Apr 1, 2019

SparkQA commented Apr 2, 2019

mgaido91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 2, 2019

SparkQA commented Apr 2, 2019

dilipbiswal commented Apr 3, 2019

cloud-fan commented Apr 3, 2019

dilipbiswal commented Apr 3, 2019

dilipbiswal commented Mar 25, 2019 •

edited

Loading

dilipbiswal commented Mar 26, 2019 •

edited

Loading

dilipbiswal commented Mar 26, 2019 •

edited

Loading

dilipbiswal commented Mar 26, 2019 •

edited

Loading

dilipbiswal Mar 28, 2019 •

edited

Loading

dilipbiswal Mar 29, 2019 •

edited

Loading

mgaido91 Apr 1, 2019 •

edited

Loading