-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23405] Generate additional constraints for Join's children #20670
Conversation
Test build #87648 has finished for PR 20670 at commit
|
@SparkQA i think this error is not caused by my patch. retest this please |
Test build #87651 has finished for PR 20670 at commit
|
This is still lacking detail about 'why'. It's not my area either. I think you should not have reopened this. |
@srowen i redescribe the problem. Now i hive a small table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That description makes more sense, but then that's how this PR/JIRA should be described from the start. Does the order number column have a non-null constraint? sounds like that filter can be added to other tables in certain types of joins like an inner join or left join.
I don't know enough to review this change though.
@@ -29,12 +29,26 @@ trait QueryPlanConstraints { self: LogicalPlan => | |||
*/ | |||
lazy val constraints: ExpressionSet = { | |||
if (conf.constraintPropagationEnabled) { | |||
var relevantOutPutSet: AttributeSet = outputSet | |||
constraints.foreach { | |||
case eq @ EqualTo(l: Attribute, r: Attribute) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eq
isn't used
var relevantOutPutSet: AttributeSet = outputSet | ||
constraints.foreach { | ||
case eq @ EqualTo(l: Attribute, r: Attribute) => | ||
if (l.references.subsetOf(relevantOutPutSet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid computing each subsetOf
twice here.
case eq @ EqualTo(l: Attribute, r: Attribute) => | ||
if (l.references.subsetOf(relevantOutPutSet) | ||
&& !r.references.subsetOf(relevantOutPutSet)) { | ||
relevantOutPutSet = relevantOutPutSet.++(r.references) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ++
syntax, rather than write it as a method invocation.
Good catch! This is a real problem, but the fix looks hacky. By definition, I think
Then we can call |
Agree with that @cloud-fan proposed to have constraints for a plan and the children. However, that requires a relative wider change as well as a fine set of test cases, please don't be hesitate to ask for help if you run into any issues working on this. |
Also, a better title for this PR would be:
|
@cloud-fan @srowen @jiangxb1987 i have changed the code and title , please help me review. Thanks. |
You shall also add test cases. |
Test build #87691 has finished for PR 20670 at commit
|
@@ -27,16 +27,15 @@ trait QueryPlanConstraints { self: LogicalPlan => | |||
* example, if this set contains the expression `a = 2` then that expression is guaranteed to | |||
* evaluate to `true` for all rows produced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment belongs to constraints
not allConstraints
LGTM except we should add a test |
Test build #87726 has finished for PR 20670 at commit
|
Test build #87727 has finished for PR 20670 at commit
|
* An [[ExpressionSet]] that contains an additional set of constraints about equality | ||
* constraints and `isNotNull` constraints. | ||
*/ | ||
lazy val allConstraints: ExpressionSet = ExpressionSet(validConstraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be guarded by constraintPropagationEnabled
ExpressionSet(Set.empty) | ||
} | ||
} | ||
|
||
/** | ||
* An [[ExpressionSet]] that contains invariants about the rows output by this operator. For | ||
* example, if this set contains the expression `a = 2` then that expression is guaranteed to | ||
* evaluate to `true` for all rows produced. | ||
*/ | ||
lazy val constraints: ExpressionSet = { | ||
if (conf.constraintPropagationEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we don't need this if.
@@ -192,4 +192,17 @@ class InferFiltersFromConstraintsSuite extends PlanTest { | |||
|
|||
comparePlans(Optimize.execute(original.analyze), correct.analyze) | |||
} | |||
|
|||
test("SPARK-23405:single left-semi join, filter out nulls on either side on equi-join keys") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: SPARK-23405: left-semi equal-join should filter out null join keys on both sides
val x = testRelation.subquery('x) | ||
val y = testRelation.subquery('y) | ||
val originalQuery = x.join(y, LeftSemi, | ||
condition = Some("x.a".attr === "y.a".attr)).analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can create a val condition = Some("x.a".attr === "y.a".attr)
to reduce duplicated code
LGTM except several minor comments |
@@ -22,21 +22,30 @@ import org.apache.spark.sql.catalyst.expressions._ | |||
|
|||
trait QueryPlanConstraints { self: LogicalPlan => | |||
|
|||
/** | |||
* An [[ExpressionSet]] that contains an additional set of constraints about equality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not acute, we may have various kinds of constraints.
LGTM only nits |
Test build #87766 has finished for PR 20670 at commit
|
val left = x.where(IsNotNull('a)) | ||
val right = y.where(IsNotNull('a)) | ||
val correctAnswer = left.join(right, LeftSemi, condition) | ||
.analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't need to be in a new line
Test build #87772 has finished for PR 20670 at commit
|
Test build #87804 has finished for PR 20670 at commit
|
*/ | ||
lazy val allConstraints: ExpressionSet = ExpressionSet(validConstraints | ||
.union(inferAdditionalConstraints(validConstraints)) | ||
.union(constructIsNotNullConstraints(validConstraints))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indents
* An [[ExpressionSet]] that contains an additional set of constraints, such as equality | ||
* constraints and `isNotNull` constraints, etc. | ||
*/ | ||
lazy val allConstraints: ExpressionSet = ExpressionSet(validConstraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need if (conf.constraintPropagationEnabled)
@gatorsmile thanks. |
LGTM |
Test build #87817 has finished for PR 20670 at commit
|
retest this please |
Test build #87836 has finished for PR 20670 at commit
|
thanks, merging to master! |
…'s children ## What changes were proposed in this pull request? The existing query constraints framework has 2 steps: 1. propagate constraints bottom up. 2. use constraints to infer additional filters for better data pruning. For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints `a = 1`, the join condition is `left.a = right.a`, then we can infer `right.a = 1` to the right side and prune the right side a lot. However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here. This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition. This reverts apache#20670 , covers apache#20717 and apache#20816 This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors mgaido91 maryannxue KaiXinXiaoLei ! ## How was this patch tested? new tests Author: Wenchen Fan <[email protected]> Closes apache#21083 from cloud-fan/join.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
I run a sql:
select ls.cs_order_number from ls left semi join catalog_sales cs on ls.cs_order_number = cs.cs_order_number
, Thels
table is a small table ,and the number is one. Thecatalog_sales
table is a big table, and the number is 10 billion. The task will be hang up. And i find the many null values ofcs_order_number
in thecatalog_sales
table. I think the null value should be removed in the logical plan.Now, use this patch, the plan will be:
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.