Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error #16572

Closed
wants to merge 26 commits into from

Conversation

nsyca
Copy link
Contributor

@nsyca nsyca commented Jan 13, 2017

What changes were proposed in this pull request?

This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

How was this patch tested?

Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;

nsyca added 22 commits July 29, 2016 17:43
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
…rrect results

## What changes were proposed in this pull request?

This patch fixes the incorrect results in the rule ResolveSubquery in Catalyst's Analysis phase.

## How was this patch tested?
./dev/run-tests
a new unit test on the problematic pattern.
}
checkAnalysis(query)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best way to view this block of code changes is using a diff with -b. The main part is to call checkAnalysis for both PredicateSubquery and ScalaSubquery.

if (!parent.exists()) {
assert(parent.mkdirs(), "Could not create directory: " + parent)
}
stringToFile(resultFile, goldenOutput)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition is ported from the code from PR-16467 of SPARK-19017 reviewed by @hvanhovell.

This change is required after the introduction of test files in sub-directories by SPARK-18871.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been merged.

@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71314 has finished for PR 16572 at commit 24397cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 16, 2017

cc @hvanhovell.

@hvanhovell
Copy link
Contributor

Note that the diff is better to read using githubs w=1 flag: https://github.com/apache/spark/pull/16572/files?w=1

@@ -117,66 +117,72 @@ trait CheckAnalysis extends PredicateHelper {
failAnalysis(s"Window specification $s is not valid because $m")
case None => w
}
case s @ ScalarSubquery(query, conditions, _)

case e @ PredicateSubquery(query, _, _, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to add a catch all SubqueryExpression case after the ScalarSubquery case, instead of adding one specifically aimed at a predicate subquery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// Also implement a crude way of masking expression IDs in the error message
// with a generic pattern "###".
(StructType(Seq.empty),
Seq(a.getClass.getName, a.getSimpleMessage.replaceAll("#[0-9]+", "###")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the same regex/replacement as on line 223?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two small comments. Looks good overall.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 24, 2017

Note the way the plans inside subqueries are not treated as part of the tree traversal is a common problem. Besides this problem, another was reported in SPARK-19093. Also the way Spark needs to implement a nested call to the Optimizer via the rule OptimizedSubqueries to work on those plans is another instance.

One possible solution is to overwrite the tree traversal (transformUp, transformDown) of class LogicalPlan to include those subqueries' plans.

@nsyca
Copy link
Contributor Author

nsyca commented Jan 25, 2017

Thank you for your time reviewing this PR.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71960 has finished for PR 16572 at commit 010d27a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@nsyca it would be nice to make subquery expression part of tree traversal. It does seem risky to me, a lot functions using traversal maintain some state and do not account for the existence of subqueries, so I am a bit weary of trying this out (as it might break stuff in very subtle ways).

asfgit pushed a commit that referenced this pull request Jan 25, 2017
…in a subquery does not yield an error

## What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

## How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

````
-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;
````

Author: Nattavut Sutyanyong <[email protected]>

Closes #16572 from nsyca/18863.

(cherry picked from commit f1ddca5)
Signed-off-by: Herman van Hovell <[email protected]>
@asfgit asfgit closed this in f1ddca5 Jan 25, 2017
@hvanhovell
Copy link
Contributor

LGTM - merging to master/2.1/2.0. Thanks!

@nsyca
Copy link
Contributor Author

nsyca commented Jan 25, 2017

@hvanhovell, I agree it does look risky with this approach. There are a lot of dependencies here. I am pitching in the idea to get your initial thought. Let me do some background and I will share once I have a better idea on this approach.

Thanks again for your time.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…in a subquery does not yield an error

## What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

## How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

````
-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;
````

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16572 from nsyca/18863.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…in a subquery does not yield an error

## What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

## How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

````
-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;
````

Author: Nattavut Sutyanyong <[email protected]>

Closes apache#16572 from nsyca/18863.
@nsyca nsyca deleted the 18863 branch March 14, 2017 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants