-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572
Conversation
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.
} | ||
checkAnalysis(query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best way to view this block of code changes is using a diff
with -b
. The main part is to call checkAnalysis
for both PredicateSubquery and ScalaSubquery.
if (!parent.exists()) { | ||
assert(parent.mkdirs(), "Could not create directory: " + parent) | ||
} | ||
stringToFile(resultFile, goldenOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition is ported from the code from PR-16467 of SPARK-19017 reviewed by @hvanhovell.
This change is required after the introduction of test files in sub-directories by SPARK-18871.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been merged.
Test build #71314 has finished for PR 16572 at commit
|
cc @hvanhovell. |
Note that the diff is better to read using githubs |
@@ -117,66 +117,72 @@ trait CheckAnalysis extends PredicateHelper { | |||
failAnalysis(s"Window specification $s is not valid because $m") | |||
case None => w | |||
} | |||
case s @ ScalarSubquery(query, conditions, _) | |||
|
|||
case e @ PredicateSubquery(query, _, _, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to add a catch all SubqueryExpression
case after the ScalarSubquery case, instead of adding one specifically aimed at a predicate subquery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// Also implement a crude way of masking expression IDs in the error message | ||
// with a generic pattern "###". | ||
(StructType(Seq.empty), | ||
Seq(a.getClass.getName, a.getSimpleMessage.replaceAll("#[0-9]+", "###"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the same regex/replacement as on line 223?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two small comments. Looks good overall.
Note the way the plans inside subqueries are not treated as part of the tree traversal is a common problem. Besides this problem, another was reported in SPARK-19093. Also the way Spark needs to implement a nested call to the Optimizer via the rule One possible solution is to overwrite the tree traversal ( |
Thank you for your time reviewing this PR. |
Test build #71960 has finished for PR 16572 at commit
|
@nsyca it would be nice to make subquery expression part of tree traversal. It does seem risky to me, a lot functions using traversal maintain some state and do not account for the existence of subqueries, so I am a bit weary of trying this out (as it might break stuff in very subtle ways). |
…in a subquery does not yield an error ## What changes were proposed in this pull request? This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery. ## How was this patch tested? Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery. ```` -- TC 01.01 -- The column t2b in the SELECT of the subquery is invalid -- because it is neither an aggregate function nor a GROUP BY column. select t1a, t2b from t1, t2 where t1b = t2c and t2b = (select max(avg) from (select t2b, avg(t2b) avg from t2 where t2a = t1.t1b ) ) ; -- TC 01.02 -- Invalid due to the column t2b not part of the output from table t2. select * from t1 where t1a in (select min(t2a) from t2 group by t2c having t2c in (select max(t3c) from t3 group by t3b having t3b > t2b )) ; ```` Author: Nattavut Sutyanyong <[email protected]> Closes #16572 from nsyca/18863. (cherry picked from commit f1ddca5) Signed-off-by: Herman van Hovell <[email protected]>
LGTM - merging to master/2.1/2.0. Thanks! |
@hvanhovell, I agree it does look risky with this approach. There are a lot of dependencies here. I am pitching in the idea to get your initial thought. Let me do some background and I will share once I have a better idea on this approach. Thanks again for your time. |
…in a subquery does not yield an error ## What changes were proposed in this pull request? This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery. ## How was this patch tested? Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery. ```` -- TC 01.01 -- The column t2b in the SELECT of the subquery is invalid -- because it is neither an aggregate function nor a GROUP BY column. select t1a, t2b from t1, t2 where t1b = t2c and t2b = (select max(avg) from (select t2b, avg(t2b) avg from t2 where t2a = t1.t1b ) ) ; -- TC 01.02 -- Invalid due to the column t2b not part of the output from table t2. select * from t1 where t1a in (select min(t2a) from t2 group by t2c having t2c in (select max(t3c) from t3 group by t3b having t3b > t2b )) ; ```` Author: Nattavut Sutyanyong <[email protected]> Closes apache#16572 from nsyca/18863.
…in a subquery does not yield an error ## What changes were proposed in this pull request? This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery. ## How was this patch tested? Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery. ```` -- TC 01.01 -- The column t2b in the SELECT of the subquery is invalid -- because it is neither an aggregate function nor a GROUP BY column. select t1a, t2b from t1, t2 where t1b = t2c and t2b = (select max(avg) from (select t2b, avg(t2b) avg from t2 where t2a = t1.t1b ) ) ; -- TC 01.02 -- Invalid due to the column t2b not part of the output from table t2. select * from t1 where t1a in (select min(t2a) from t2 group by t2c having t2c in (select max(t3c) from t3 group by t3b having t3b > t2b )) ; ```` Author: Nattavut Sutyanyong <[email protected]> Closes apache#16572 from nsyca/18863.
What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.
How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.