-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20094][SQL] Preventing push down of IN subquery to Join operator #17428
Conversation
Test build #75210 has finished for PR 17428 at commit
|
Test build #75212 has finished for PR 17428 at commit
|
also cc @cloud-fan |
@@ -90,6 +90,7 @@ trait PredicateHelper { | |||
* Returns true iff `expr` could be evaluated as a condition within join. | |||
*/ | |||
protected def canEvaluateWithinJoin(expr: Expression): Boolean = expr match { | |||
case l: ListQuery => false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the following case e: SubqueryExpression
prevent it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some comments to explain why ListQuery
should always return false even the children is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Now pulling up IN predicates is deferred to RewritePredicateSubquery
. As a result, ListQuery
's children can be empty before that rule and falls into case e: SubqueryExpression
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. The comment looks good.
Test build #75257 has finished for PR 17428 at commit
|
@@ -289,6 +289,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT | |||
case m: Map[_, _] => m | |||
case d: DataType => d // Avoid unpacking Structs | |||
case seq: Traversable[_] => seq.map(recursiveTransform) | |||
case ExistenceJoin(exists: Attribute) => | |||
// `exists` must be an Attribute in ExistenceJoin | |||
ExistenceJoin(transformExpression(exists).asInstanceOf[Attribute]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason we need to cast the return type of transformExpression(<instance_of_Attribute>)
to type Attribute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ExistenceJoin
only accepts Attribute
as its parameter type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of transformExpression
of an instance of Attribute
not returning a value of type Attribute
or its subtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change necessary? This smells for two reasons:
QueryPlan
shouldn't have know about such a detail. This means something is wrong with the abstraction.- Why do we want to change the attribute in an ExistenceJoin? It is produced by the join, and we really should not replace it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I feel the same. Fortunately, after modification as suggested by @nsyca , this change can also be removed. Because I move the test case to FilterPushdownSuite
, we don't have to touch the RewritePredicateSubquery
rule.
case l: ListQuery => | ||
// Now pulling up IN predicates is deferred to `RewritePredicateSubquery`. As a result, | ||
// `ListQuery`'s children can be empty before that rule and falls into the next case. | ||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested text in the comment:
[NOT] IN () expression cannot be evaluated as part of a Join operator. Currently the only way to evaluate a [NOT] IN subquery is to convert it to a LeftSemi or LeftAnti by RewritePredicateSubquery
rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
.join(w, ExistenceJoin(exists), Some("z.c".attr === "w.d".attr)) | ||
.where("x.a".attr > 1 || exists) | ||
.select("x.a".attr, "x.b".attr, "x.c".attr, "z.a".attr, "z.b".attr, "z.c".attr) | ||
.join(y, Inner, Some("y.d".attr === "z.a".attr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side question. It's not a review comment.
Is there any mechanism in this test suite that ensures that the plan Join( Join(x, z), y) is picked by the join reorder algorithm? Isn't a Join( Join(y, z), x) is also a viable plan if the join predicate (y.d = z.a) is estimated to be more selective than the predicate sets between x and z?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test suite is for JoinReorder
rule. This rule is not cost based and it's not related to estimation. It only searches for joinable tables and put cartesian product to the end of the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was why Optimizer does not pick a plan of Join( Join(y, z), x).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess your test case depends on the fact that currently we don't have an optimization to push down table y below the join of (x, z).
Essentially, this is not a problem specifically to ReorderJoin
as implied from the title of the PR. The problem is at the definition canEvaluateWithinJoin
. I can demonstrate the same problem without the table y in this test case.
A slight modification of your test case as shown below also reveals the problem.
val queryPlan = x.join(z)
.where(("x.b".attr === "z.b".attr) &&
("x.a".attr > 1 || "z.c".attr.in(ListQuery(w.select("w.d".attr)))))
val expectedPlan = x.join(z, Inner, Some("x.b".attr === "z.b".attr))
.join(w, ExistenceJoin(exists), Some("z.c".attr === "w.d".attr))
.where("x.a".attr > 1 || exists)
.select("x.a".attr, "x.b".attr, "x.c".attr, "z.a".attr, "z.b".attr, "z.c".attr)
which has a call to canEvaluateWithinJoin
from the rule PushPredicateThroughJoin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest the following:
- Change the title of the PR to "Preventing push down of IN subquery to Join operator"
- Update the test case to its bare minimum by removing the join to table y
- Change the title of the test case to reflect the new test
On the last point, removing the "reorder inner joins - " should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, thanks for the suggestion. Also I will move the test case to FilterPushdownSuite
since now it will not go into ReorderJoin
rule.
Test build #75265 has finished for PR 17428 at commit
|
c32599d
to
bf7d202
Compare
LGTM - pending jenkins. |
Test build #75308 has finished for PR 17428 at commit
|
Merging to master. Thanks! |
Test build #75309 has finished for PR 17428 at commit
|
What changes were proposed in this pull request?
TPCDS q45 fails becuase:
ReorderJoin
collects all predicates and try to put them into join condition when creating ordered join. If a predicate with an IN subquery (ListQuery
) is in a join condition instead of a filter condition,RewritePredicateSubquery.rewriteExistentialExpr
would fail to convert the subquery to anExistenceJoin
, and thus result in error.We should prevent push down of IN subquery to Join operator.
How was this patch tested?
Add a new test case in
FilterPushdownSuite
.