-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27255][SQL] Report error when illegal expressions are hosted by a plan operator. #24209
Conversation
Test build #103924 has finished for PR 24209 at commit
|
To just make sure this is a right approach, one question;
|
I vaguely remember we did a similar check for window functions, how was that done? cc @rednaxelafx |
@cloud-fan I checked the code. The validation for presence of window expression(s) in FILTER and HAVING is done in analyzer in rule However, for our current case, i think it makes sense to do it in checkAnalysis since HAVING can have aggregate and is planned as a Filter. I think somewhere in Analyzer we turn this into a Filter over Aggregate with Filter referencing to the output of Aggregate as a named expression. So after we do all that work, when we come to checkAnalysis and we still have aggregate expressions in the Filter node, we raise the error. Please let me know .. |
@cloud-fan For you reference this pr has added the check. |
can you check #23658 ? |
@cloud-fan Thanks for pointing me to this. I had seen it before but had completely forgotten about it :-). So Wenchen, in my understanding, this PR ensures that the integrity of the plan is not broken in the optimizer between application of rules. Is there something in specific you wanted me to focus on ? One thing i noticed is that here we reject plans which have aggregate expressions and plan node is not Aggregate. In this PR, we check for only Filter .. Did you want to target the fix to other operators as well just like the integrity check pr ? I feel its better to do check specific to problematic operators so we can issue targeted error message. What do you think ? |
I'm wondering why we still hit this problem after #23658 |
@cloud-fan Oh.. because.. the integrity check is done only in testing mode, right ? So its basically an internal thing ?
|
cc @rednaxelafx @gatorsmile shall we make that check official? |
I think we should verify it in the analyzer stage. The plan integrity verification is just for ensuring the optimizer rules do not change the plan and then break the integrity. |
@@ -172,16 +176,15 @@ trait CheckAnalysis extends PredicateHelper { | |||
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " + | |||
s"conditions: $condition") | |||
|
|||
case Filter(condition, _) if condition.find(isAggregateExpression(_)).isDefined => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call plan integrity checking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Thanks for reviewing. I will try it and get back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile @cloud-fan Just made change to make use of the integrity check method. I had to refactor it out as its a private method in optimizer. Is this what you had in mind ? Please let me know.
Test build #104030 has finished for PR 24209 at commit
|
Test build #104031 has finished for PR 24209 at commit
|
@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper { | |||
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " + | |||
s"conditions: $condition") | |||
|
|||
case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only check filter here? I think we should do it for any operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I thought, having a targeted error message for the user would be better. If we do it generically, we would say "Filter" instead of "WHERE clause", right ? Should i handle for Filter here and have a general catcher as the last case statement to issue a generic error message like "The operator xxxx contains an unsupported expression type"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The operator xxxx contains an unsupported expression type" sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can move it closer to the general case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Sure.
SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1; | ||
|
||
-- Error when aggregate expressions are in where clause directly | ||
SELECT count(*) FROM test_agg WHERE count(*) > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we test HAVING as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan ok.. i will add having.
@@ -70,9 +70,10 @@ Resolved attribute(s) t2b#x missing from min(t2a)#x,t2c#x in operator !Filter t2 | |||
SELECT t1a | |||
FROM t1 | |||
GROUP BY 1 | |||
HAVING EXISTS (SELECT 1 | |||
HAVING EXISTS (SELECT t2a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan because, the child subquery plan can not be resolved due to missing attributes.
+- Project [1#81]
: +- !Filter (t2a#9 < min((outer(t1a#6) + t2a#9))#86)
: +- Aggregate [1], [1 AS 1#81, min((outer(t1a#6) + t2a#9)) AS min((outer(t1a#6) + t2a#9))#86]
: +- SubqueryAlias `t2`
: +- Project [t2a#9, t2b#10, t2c#11]
: +- SubqueryAlias `t2`
: +- LocalRelation [t2a#9, t2b#10, t2c#11]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we do expect analysis exception for this query, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Yeah.. this particular test is in subquery suite and checking for a specific analysis expression that disallows a combination of local and outer references inside an aggregate expression. However, this test had the aggregate expression in the Filter clause which we are disallowing now. So i changed it to use "having" instead of "filter". But when i did that i hit a different analysis exception due to missing attributes (this particular test in question is trying to test a different analysis exception). So i fixed the projection here to include the missing attribute so we get a desired subquery related exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
Test build #104065 has finished for PR 24209 at commit
|
@@ -172,13 +172,17 @@ trait CheckAnalysis extends PredicateHelper { | |||
failAnalysis("Null-aware predicate sub-queries cannot be used in nested " + | |||
s"conditions: $condition") | |||
|
|||
case f @ Filter(condition, _) if PlanHelper.specialExpressionInUnsupportedOperator(f) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Actually we don't need it. However, since its a user facing message, i feel, its better if we issue targeted error message. I was thinking, we we should add more individual cases as we learn about them here while still having a generic catcher.
If you think otherwise, then i will remove it. Let me know, please.
* - A Generator but the plan is not Generate | ||
* Returns true when this operator hosts illegal expressions. This can happen when | ||
* 1. The input query from users contain invalid expressions. | ||
* Example : SELECT * FROM tab WHERE max(c1) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW how about SELECT max(c1) FROM tab WHERE max(c1) > 0
? Does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan No, this case does not work. But this should work and i have a test case for this.
SELECT max FROM (SELECT max(v1) AS max FROM tab) WHERE max > 1
Test build #104107 has finished for PR 24209 at commit
|
""".stripMargin) | ||
|
||
case other if PlanHelper.specialExpressionInUnsupportedOperator(other) => | ||
failAnalysis(s"The query operator `${other.nodeName}` contains " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove an unnecessary single space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Will remove.
@@ -52,29 +52,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog) | |||
* Returns true when all operators are integral. | |||
*/ | |||
private def checkSpecialExpressionIntegrity(plan: LogicalPlan): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this private function is called by isPlanIntegral
only, so we need to define this separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu Ok... Let me remove it.
Test build #104156 has finished for PR 24209 at commit
|
@cloud-fan does this look ok now ? |
import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression | ||
|
||
/** | ||
* [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: Contains -> contains.
Don't need to wait for another round test for this.
/** | ||
* [[PlanHelper]] Contains utility methods that can be used by Analyzer and Optimizer. | ||
* It can also be container of methods that are common across multiple rules in Analyzer | ||
* and optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: optimizer -> Optimizer. As consistent to above line.
* Example : SELECT * FROM tab WHERE max(c1) > 0 | ||
* 2. Query rewrites inadvertently produce plans that are invalid. | ||
*/ | ||
def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we return the unsupported operator here and we report the specific one which is problematic to the user, rather than printing the whole filter condition? I think this would help the user to understand which specific thing caused the issue and it would be more user friendly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 The caller already knows about the problematic operator, no ? About the problematic expression, i suppose we could return the first occurence of the problematic expression and report it in the error text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry, I meant expression, not operator. Sorry for the mistake.
i suppose we could return the first occurence of the problematic expression and report it in the error text.
Yes, this is what I meant. I think it would be easier for the user to understand and fix the problem, especially if the filter condition is very big and hard to be read as a whole.
It would be even better if we could return the list of all the problematic expressions if possible though, in order to avoid the user to iterate fixing->trying->fixing->...
Test build #104166 has finished for PR 24209 at commit
|
Test build #104176 has finished for PR 24209 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only few minor comments, otherwise LGTM, thanks.
|The query operator `${other.nodeName}` contains one or more unsupported | ||
|expression types Aggregate, Window or Generate. | ||
|Invalid expressions: [${invalidExprSqls.mkString(", ")}] | ||
| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we remove this line? and also close the string on the line above, in order to avoid weird things like those ;
in the middle of nowhere in the output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 thanks.. will remove. By closing on the line above you mean like ..
|Invalid expressions: [${invalidExprSqls.mkString(", ")}]""".stripMargin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I meant that
* Example : SELECT * FROM tab WHERE max(c1) > 0 | ||
* 2. Query rewrites inadvertently produce plans that are invalid. | ||
*/ | ||
def specialExpressionInUnsupportedOperator(plan: LogicalPlan): Seq[Expression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: specialExpressions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Will change.
Test build #104217 has finished for PR 24209 at commit
|
Test build #104219 has finished for PR 24209 at commit
|
@cloud-fan Please let me know if this looks okay now ? |
thanks, merging to master! |
@cloud-fan Thank you very much !! |
What changes were proposed in this pull request?
In the PR, we raise an AnalysisError when we detect the presense of aggregate expressions in where clause. Here is the problem description from the JIRA.
Aggregate functions should not be allowed in WHERE clause. But Spark SQL throws an exception when generating codes. It is supposed to throw an exception during parsing or analyzing.
Here is an example:
Resulting exception:
Checked the behaviour of other database and all of them return an exception:
Postgress
DB2
Oracle
MySql
Update
This PR has been enhanced to report error when expressions such as Aggregate, Window, Generate are hosted by operators where they are invalid.
How was this patch tested?
Added tests in AnalysisErrorSuite and group-by.sql