[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

nsyca · 2017-01-13T11:41:12Z

What changes were proposed in this pull request?

This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

How was this patch tested?

Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;

…rrect results ## What changes were proposed in this pull request? This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase. ## How was this patch tested? ./dev/run-tests a new unit test on the problematic pattern.

nsyca · 2017-01-13T11:44:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

            }
+            checkAnalysis(query)


The best way to view this block of code changes is using a diff with -b. The main part is to call checkAnalysis for both PredicateSubquery and ScalaSubquery.

nsyca · 2017-01-13T11:49:43Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+      if (!parent.exists()) {
+        assert(parent.mkdirs(), "Could not create directory: " + parent)
+      }
+      stringToFile(resultFile, goldenOutput)


This addition is ported from the code from PR-16467 of SPARK-19017 reviewed by @hvanhovell.

This change is required after the introduction of test files in sub-directories by SPARK-18871.

This has been merged.

SparkQA · 2017-01-13T14:39:28Z

Test build #71314 has finished for PR 16572 at commit 24397cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nsyca · 2017-01-16T15:49:36Z

cc @hvanhovell.

hvanhovell · 2017-01-24T22:46:56Z

Note that the diff is better to read using githubs w=1 flag: https://github.com/apache/spark/pull/16572/files?w=1

hvanhovell · 2017-01-24T23:17:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -117,66 +117,72 @@ trait CheckAnalysis extends PredicateHelper {
                failAnalysis(s"Window specification $s is not valid because $m")
              case None => w
            }
-          case s @ ScalarSubquery(query, conditions, _)
+
+          case e @ PredicateSubquery(query, _, _, _) =>


It might be better to add a catch all SubqueryExpression case after the ScalarSubquery case, instead of adding one specifically aimed at a predicate subquery.

hvanhovell · 2017-01-24T23:22:09Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala

+        // Also implement a crude way of masking expression IDs in the error message
+        // with a generic pattern "###".
+        (StructType(Seq.empty),
+          Seq(a.getClass.getName, a.getSimpleMessage.replaceAll("#[0-9]+", "###")))


why not use the same regex/replacement as on line 223?

hvanhovell

I have two small comments. Looks good overall.

nsyca · 2017-01-24T23:32:47Z

Note the way the plans inside subqueries are not treated as part of the tree traversal is a common problem. Besides this problem, another was reported in SPARK-19093. Also the way Spark needs to implement a nested call to the Optimizer via the rule OptimizedSubqueries to work on those plans is another instance.

One possible solution is to overwrite the tree traversal (transformUp, transformDown) of class LogicalPlan to include those subqueries' plans.

nsyca · 2017-01-25T02:01:44Z

Thank you for your time reviewing this PR.

SparkQA · 2017-01-25T04:29:19Z

Test build #71960 has finished for PR 16572 at commit 010d27a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-01-25T16:03:22Z

@nsyca it would be nice to make subquery expression part of tree traversal. It does seem risky to me, a lot functions using traversal maintain some state and do not account for the existence of subqueries, so I am a bit weary of trying this out (as it might break stuff in very subtle ways).

…in a subquery does not yield an error ## What changes were proposed in this pull request? This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery. ## How was this patch tested? Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery. ```` -- TC 01.01 -- The column t2b in the SELECT of the subquery is invalid -- because it is neither an aggregate function nor a GROUP BY column. select t1a, t2b from t1, t2 where t1b = t2c and t2b = (select max(avg) from (select t2b, avg(t2b) avg from t2 where t2a = t1.t1b ) ) ; -- TC 01.02 -- Invalid due to the column t2b not part of the output from table t2. select * from t1 where t1a in (select min(t2a) from t2 group by t2c having t2c in (select max(t3c) from t3 group by t3b having t3b > t2b )) ; ```` Author: Nattavut Sutyanyong <[email protected]> Closes #16572 from nsyca/18863. (cherry picked from commit f1ddca5) Signed-off-by: Herman van Hovell <[email protected]>

hvanhovell · 2017-01-25T16:09:13Z

LGTM - merging to master/2.1/2.0. Thanks!

nsyca · 2017-01-25T16:12:47Z

@hvanhovell, I agree it does look risky with this approach. There are a lot of dependencies here. I am pitching in the idea to get your initial thought. Let me do some background and I will share once I have a better idea on this approach.

Thanks again for your time.

…in a subquery does not yield an error ## What changes were proposed in this pull request? This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery. ## How was this patch tested? Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery. ```` -- TC 01.01 -- The column t2b in the SELECT of the subquery is invalid -- because it is neither an aggregate function nor a GROUP BY column. select t1a, t2b from t1, t2 where t1b = t2c and t2b = (select max(avg) from (select t2b, avg(t2b) avg from t2 where t2a = t1.t1b ) ) ; -- TC 01.02 -- Invalid due to the column t2b not part of the output from table t2. select * from t1 where t1a in (select min(t2a) from t2 group by t2c having t2c in (select max(t3c) from t3 group by t3b having t3b > t2b )) ; ```` Author: Nattavut Sutyanyong <[email protected]> Closes apache#16572 from nsyca/18863.

nsyca added 22 commits July 29, 2016 17:43

New positive test cases

edca333

Fix unit test case failure

64184fd

blocking TABLESAMPLE

29f82b0

Fixing code styling

ac43ab4

Correcting Scala test style

631d396

One (last) attempt to correct the Scala style tests

7eb9b2d

Merge remote-tracking branch 'upstream/master'

1387cf5

Merge remote-tracking branch 'upstream/master'

3faa2d5

Merge remote-tracking branch 'upstream/master'

a308634

first fix (incomplete)

2f463de

first attempt

6e2f686

Merge remote-tracking branch 'upstream/master'

f1524b9

Merge branch 'master' into 18863

6dfa8e5

New test cases

e9bdde6

Masking exprIDs

deec874

Merge remote-tracking branch 'upstream/master'

5c36dce

Merge branch 'master' into 18863

98cbd60

reverse back accidental change

bcae336

port from SPARK-19017

51f7fb9

remove unrelated comment

24397cf

nsyca commented Jan 13, 2017

View reviewed changes

hvanhovell reviewed Jan 24, 2017

View reviewed changes

hvanhovell requested changes Jan 24, 2017

View reviewed changes

nsyca added 4 commits January 24, 2017 20:38

address comment apache#1

ced19c7

Merge remote-tracking branch 'upstream/master'

862b2b8

Merge branch 'master' into 18863

203ad7d

remove blank line

010d27a

asfgit closed this in f1ddca5 Jan 25, 2017

nsyca deleted the 18863 branch March 14, 2017 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

nsyca commented Jan 13, 2017

nsyca Jan 13, 2017

nsyca Jan 13, 2017

hvanhovell Jan 24, 2017

SparkQA commented Jan 13, 2017

nsyca commented Jan 16, 2017

hvanhovell commented Jan 24, 2017

hvanhovell Jan 24, 2017

nsyca Jan 25, 2017

hvanhovell Jan 24, 2017

nsyca Jan 25, 2017

hvanhovell left a comment •

edited

Loading

nsyca commented Jan 24, 2017

nsyca commented Jan 25, 2017

SparkQA commented Jan 25, 2017

hvanhovell commented Jan 25, 2017

hvanhovell commented Jan 25, 2017

nsyca commented Jan 25, 2017

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

Conversation

nsyca commented Jan 13, 2017

What changes were proposed in this pull request?

How was this patch tested?

nsyca Jan 13, 2017

Choose a reason for hiding this comment

nsyca Jan 13, 2017

Choose a reason for hiding this comment

hvanhovell Jan 24, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 13, 2017

nsyca commented Jan 16, 2017

hvanhovell commented Jan 24, 2017

hvanhovell Jan 24, 2017

Choose a reason for hiding this comment

nsyca Jan 25, 2017

Choose a reason for hiding this comment

hvanhovell Jan 24, 2017

Choose a reason for hiding this comment

nsyca Jan 25, 2017

Choose a reason for hiding this comment

hvanhovell left a comment • edited Loading

Choose a reason for hiding this comment

nsyca commented Jan 24, 2017

nsyca commented Jan 25, 2017

SparkQA commented Jan 25, 2017

hvanhovell commented Jan 25, 2017

hvanhovell commented Jan 25, 2017

nsyca commented Jan 25, 2017

hvanhovell left a comment •

edited

Loading