Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. #24209

Closed
wants to merge 9 commits into from

Conversation

dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Mar 25, 2019

What changes were proposed in this pull request?

In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.

Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.

Here is an example:

val df = spark.sql("select * from t where sum(ta) > 0")
df.explain(true)
df.show()

Resulting exception:

Exception in thread "main" java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(cast(input[0, int, false] as bigint))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
	at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
	at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
	at scala.Option.getOrElse(Option.scala:138)

Checked the behaviour of other database and all of them return an exception:
Postgress

select * from foo where max(c1) > 0;
Error
ERROR: aggregate functions are not allowed in WHERE Position: 25

DB2

db2 => select * from foo where max(c1) > 0;
SQL0120N  Invalid use of an aggregate function or OLAP function.  

Oracle

select * from foo where max(c1) > 0;
ORA-00934: group function is not allowed here

MySql

select * from foo where max(c1) > 0;
Invalid use of group function

Update
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.

How was this patch tested?

Added tests in AnalysisErrorSuite and group-by.sql

@dilipbiswal dilipbiswal changed the title [SPARK-27255] Aggregate functions should not be allowed in WHERE [SPARK-27255][SQL] Aggregate functions should not be allowed in WHERE Mar 25, 2019
@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103924 has finished for PR 24209 at commit 61ec443.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

cc @cloud-fan @maropu @chakravarthiT

@maropu
Copy link
Member

maropu commented Mar 26, 2019

To just make sure this is a right approach, one question; CheckAnalysis is the best place to fix this issue? For example, analyzer itself throws an exception when filter has generators?

scala> sql("select * from x where explode(array(1, 2, 3)) < id").show
org.apache.spark.sql.AnalysisException: Generators are not supported outside the SELECT clause, but got: 'Filter (explode(array(1, 2, 3)) < id#10L);
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$17.applyOrElse(Analyzer.scala:1796)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator$$anonfun$apply$17.applyOrElse(Analyzer.scala:1751)
  at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Mar 26, 2019

@maropu In my understanding, the goal was to handle most of semantic errors in checkAnalysis. In cases when its not do-able, they are handled in analyzer itself. Thats my understanding. What do you think Wenchen ?
Update: One such example is in PR

@cloud-fan
Copy link
Contributor

I vaguely remember we did a similar check for window functions, how was that done? cc @rednaxelafx

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Mar 26, 2019

@cloud-fan I checked the code. The validation for presence of window expression(s) in FILTER and HAVING is done in analyzer in rule ExtractWindowExpressions. I don't know about the difficulty in doing the check in checkAnalysis. Perhaps, we run into issues while processing the window expression later on in analyzer.

However, for our current case, i think it makes sense to do it in checkAnalysis since HAVING can have aggregate and is planned as a Filter. I think somewhere in Analyzer we turn this into a Filter over Aggregate with Filter referencing to the output of Aggregate as a named expression. So after we do all that work, when we come to checkAnalysis and we still have aggregate expressions in the Filter node, we raise the error. Please let me know ..

@dilipbiswal
Copy link
Contributor Author

@cloud-fan For you reference this pr has added the check.

@cloud-fan
Copy link
Contributor

can you check #23658 ?

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks for pointing me to this. I had seen it before but had completely forgotten about it :-). So Wenchen, in my understanding, this PR ensures that the integrity of the plan is not broken in the optimizer between application of rules. Is there something in specific you wanted me to focus on ? One thing i noticed is that here we reject plans which have aggregate expressions and plan node is not Aggregate. In this PR, we check for only Filter .. Did you want to target the fix to other operators as well just like the integrity check pr ? I feel its better to do check specific to problematic operators so we can issue targeted error message. What do you think ?

@cloud-fan
Copy link
Contributor

I'm wondering why we still hit this problem after #23658

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Mar 26, 2019

@cloud-fan Oh.. because.. the integrity check is done only in testing mode, right ? So its basically an internal thing ?
So when i try the following test, i do get an integrity check error :

test("test agg") {
    withSQLConf("spark.testing" -> "true") {
      withTempView("a", "b") {
        Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("a")
        Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("b")

        sql("select * from a where max(id) > 0").show

      }
    }
  }

@cloud-fan
Copy link
Contributor

cc @rednaxelafx @gatorsmile shall we make that check official?

@gatorsmile
Copy link
Member

I think we should verify it in the analyzer stage.

The plan integrity verification is just for ensuring the optimizer rules do not change the plan and then break the integrity.

@@ -172,16 +176,15 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
s"conditions: $condition")

case Filter(condition, _) if condition.find(isAggregateExpression(_)).isDefined =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call plan integrity checking here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thanks for reviewing. I will try it and get back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile @cloud-fan Just made change to make use of the integrity check method. I had to refactor it out as its a private method in optimizer. Is this what you had in mind ? Please let me know.

@SparkQA
Copy link

SparkQA commented Mar 28, 2019

Test build #104030 has finished for PR 24209 at commit 6df1ac1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2019

Test build #104031 has finished for PR 24209 at commit af419e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
s"conditions: $condition")

case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only check filter here? I think we should do it for any operator.

Copy link
Contributor Author

@dilipbiswal dilipbiswal Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I thought, having a targeted error message for the user would be better. If we do it generically, we would say "Filter" instead of "WHERE clause", right ? Should i handle for Filter here and have a general catcher as the last case statement to issue a generic error message like "The operator xxxx contains an unsupported expression type"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The operator xxxx contains an unsupported expression type" sounds good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can move it closer to the general case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Sure.

SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1;

-- Error when aggregate expressions are in where clause directly
SELECT count(*) FROM test_agg WHERE count(*) > 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test HAVING as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan ok.. i will add having.

@@ -70,9 +70,10 @@ Resolved attribute(s) t2b#x missing from min(t2a)#x,t2c#x in operator !Filter t2
SELECT t1a
FROM t1
GROUP BY 1
HAVING EXISTS (SELECT 1
HAVING EXISTS (SELECT t2a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan because, the child subquery plan can not be resolved due to missing attributes.

+- Project [1#81]
:     +- !Filter (t2a#9 < min((outer(t1a#6) + t2a#9))#86)
:        +- Aggregate [1], [1 AS 1#81, min((outer(t1a#6) + t2a#9)) AS min((outer(t1a#6) + t2a#9))#86]
:           +- SubqueryAlias `t2`
:              +- Project [t2a#9, t2b#10, t2c#11]
:                 +- SubqueryAlias `t2`
:                    +- LocalRelation [t2a#9, t2b#10, t2c#11]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do expect analysis exception for this query, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Yeah.. this particular test is in subquery suite and checking for a specific analysis expression that disallows a combination of local and outer references inside an aggregate expression. However, this test had the aggregate expression in the Filter clause which we are disallowing now. So i changed it to use "having" instead of "filter". But when i did that i hit a different analysis exception due to missing attributes (this particular test in question is trying to test a different analysis exception). So i fixed the projection here to include the missing attribute so we get a desired subquery related exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

@SparkQA
Copy link

SparkQA commented Mar 29, 2019

Test build #104065 has finished for PR 24209 at commit 49f04d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal changed the title [SPARK-27255][SQL] Aggregate functions should not be allowed in WHERE [SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. Mar 29, 2019
@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " +
s"conditions: $condition")

case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Actually we don't need it. However, since its a user facing message, i feel, its better if we issue targeted error message. I was thinking, we we should add more individual cases as we learn about them here while still having a generic catcher.
If you think otherwise, then i will remove it. Let me know, please.

* - A Generator but the plan is not Generate
* Returns true when this operator hosts illegal expressions. This can happen when
* 1. The input query from users contain invalid expressions.
* Example : SELECT * FROM tab WHERE max(c1) > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW how about SELECT max(c1) FROM tab WHERE max(c1) > 0? Does it work?

Copy link
Contributor Author

@dilipbiswal dilipbiswal Mar 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan No, this case does not work. But this should work and i have a test case for this.

SELECT max FROM (SELECT max(v1) AS max FROM tab) WHERE max > 1

@SparkQA
Copy link

SparkQA commented Mar 30, 2019

Test build #104107 has finished for PR 24209 at commit b0ab3e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

""".stripMargin)

case other if PlanHelper.specialExpressionInUnsupportedOperator(other) =>
failAnalysis(s"The query operator `${other.nodeName}` contains " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove an unnecessary single space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Will remove.

@@ -52,29 +52,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
* Returns true when all operators are integral.
*/
private def checkSpecialExpressionIntegrity(plan: LogicalPlan): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this private function is called by isPlanIntegral only, so we need to define this separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Ok... Let me remove it.

@SparkQA
Copy link

SparkQA commented Apr 1, 2019

Test build #104156 has finished for PR 24209 at commit a743b49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan does this look ok now ?

import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression

/**
* [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: Contains -> contains.

Don't need to wait for another round test for this.

/**
* [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer.
* It can also be container of methods that are common across multiple rules in Analyzer
* and optimizer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: optimizer -> Optimizer. As consistent to above line.

* Example : SELECT * FROM tab WHERE max(c1) > 0
* 2. Query rewrites inadvertently produce plans that are invalid.
*/
def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we return the unsupported operator here and we report the specific one which is problematic to the user, rather than printing the whole filter condition? I think this would help the user to understand which specific thing caused the issue and it would be more user friendly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 The caller already knows about the problematic operator, no ? About the problematic expression, i suppose we could return the first occurence of the problematic expression and report it in the error text.

Copy link
Contributor

@mgaido91 mgaido91 Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sorry, I meant expression, not operator. Sorry for the mistake.

i suppose we could return the first occurence of the problematic expression and report it in the error text.

Yes, this is what I meant. I think it would be easier for the user to understand and fix the problem, especially if the filter condition is very big and hard to be read as a whole.
It would be even better if we could return the list of all the problematic expressions if possible though, in order to avoid the user to iterate fixing->trying->fixing->...

@SparkQA
Copy link

SparkQA commented Apr 1, 2019

Test build #104166 has finished for PR 24209 at commit c9d1f59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104176 has finished for PR 24209 at commit a3f4b91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only few minor comments, otherwise LGTM, thanks.

|The query operator `${other.nodeName}` contains one or more unsupported
|expression types Aggregate, Window or Generate.
|Invalid expressions: [${invalidExprSqls.mkString(", ")}]
|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we remove this line? and also close the string on the line above, in order to avoid weird things like those ; in the middle of nowhere in the output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 thanks.. will remove. By closing on the line above you mean like ..

|Invalid expressions: [${invalidExprSqls.mkString(", ")}]""".stripMargin

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I meant that

* Example : SELECT * FROM tab WHERE max(c1) > 0
* 2. Query rewrites inadvertently produce plans that are invalid.
*/
def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Seq[Expression] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specialExpressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Will change.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104217 has finished for PR 24209 at commit 15b0a98.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 2, 2019

Test build #104219 has finished for PR 24209 at commit bbab4c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Please let me know if this looks okay now ?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3286bff Apr 3, 2019
@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you very much !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants