[SPARK-21979][SQL]Improve QueryPlanConstraints framework #19201

gengliangwang · 2017-09-12T09:13:14Z

What changes were proposed in this pull request?

Improve QueryPlanConstraints framework, make it robust and simple.
In #15319, constraints for expressions like a = f(b, c) is resolved.
However, for expressions like

a = f(b, c) && c = g(a, b)

The current QueryPlanConstraints framework will produce non-converging constraints.
Essentially, the problem is caused by having both the name and child of aliases in the same constraint set. We infer constraints, and push down constraints as predicates in filters, later on these predicates are propagated as constraints, etc..
Simply using the alias names only can resolve these problems. The size of constraints is reduced without losing any information. We can always get these inferred constraints on child of aliases when pushing down filters.

Also, the EqualNullSafe between name and child in propagating alias is meaningless

allConstraints += EqualNullSafe(e, a.toAttribute)

It just produces redundant constraints.

How was this patch tested?

Unit test

SparkQA · 2017-09-12T12:00:43Z

Test build #81668 has finished for PR 19201 at commit 036e846.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-12T13:17:38Z

Test build #81671 has finished for PR 19201 at commit 7b414fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-12T16:04:54Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala

+  private def replaceConstraints(
+    constraints: Set[Expression],
+    source: Expression,
+    destination: Attribute): Set[Expression] = constraints.map(_ transform {


please use four line indents.

https://github.com/databricks/scala-style-guide#spacing-and-indentation

gatorsmile · 2017-09-12T16:05:47Z

LGTM except a minor comment.

cloud-fan

LGTM except 2 minor comments

cloud-fan · 2017-09-12T15:33:21Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala

-      case _ => // Skip
+  private def eliminateAliasedExpressionInConstraints(constraints: Set[Expression])
+    : Set[Expression] = {
+    val attributesInEqualTo = constraints.flatMap {


make it a set?

It is a set :)

cloud-fan · 2017-09-12T16:34:16Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala

-   * to an selected attribute.
+   * Replace the aliased expression in [[Alias]] with the alias name if both exist in constraints.
+   * Thus non-converging inference can be prevented.
+   * E.g. `a = f(a, b)`,  `a = f(b, c) && c = g(a, b)`.


This example doesn't even have an alias...

jiangxb1987

LGTM

SparkQA · 2017-09-12T19:56:38Z

Test build #81689 has finished for PR 19201 at commit d456876.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-12T20:02:46Z

Thanks! Merged to master.

…onstraints ## What changes were proposed in this pull request? #19201 introduced the following regression: given something like `df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a constraint and infer filters from it when joins are involved, which may lead to noticeable performance degradation. This patch re-enables this optimization by picking up Aliases of Literals in Projection lists as constraints and making sure they're not treated as aliased columns. ## How was this patch tested? Unit test was added. Author: Adrian Ionescu <[email protected]> Closes #20155 from adrian-ionescu/constant_constraints. (cherry picked from commit 51c33bd) Signed-off-by: gatorsmile <[email protected]>

…onstraints ## What changes were proposed in this pull request? #19201 introduced the following regression: given something like `df.withColumn("c", lit(2))`, we're no longer picking up `c === 2` as a constraint and infer filters from it when joins are involved, which may lead to noticeable performance degradation. This patch re-enables this optimization by picking up Aliases of Literals in Projection lists as constraints and making sure they're not treated as aliased columns. ## How was this patch tested? Unit test was added. Author: Adrian Ionescu <[email protected]> Closes #20155 from adrian-ionescu/constant_constraints.

## What changes were proposed in this pull request? Previously, PR apache#19201 fix the problem of non-converging constraints. After that PR apache#19149 improve the loop and constraints is inferred only once. So the problem of non-converging constraints is gone. However, the case below will fail. ``` spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val left = t.withColumn("xid", $"id" + lit(1)).as("x") val right = t.withColumnRenamed("id", "xid").as("y") val df = left.join(right, "xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3)) ``` Because `aliasMap` replace all the aliased child. See the test case in PR for details. This PR is to fix this bug by removing useless code for preventing non-converging constraints. It can be also fixed with apache#20270, but this is much simpler and clean up the code. ## How was this patch tested? Unit test Author: Wang Gengliang <[email protected]> Closes apache#20278 from gengliangwang/FixConstraintSimple.

## What changes were proposed in this pull request? Previously, PR #19201 fix the problem of non-converging constraints. After that PR #19149 improve the loop and constraints is inferred only once. So the problem of non-converging constraints is gone. However, the case below will fail. ``` spark.range(5).write.saveAsTable("t") val t = spark.read.table("t") val left = t.withColumn("xid", $"id" + lit(1)).as("x") val right = t.withColumnRenamed("id", "xid").as("y") val df = left.join(right, "xid").filter("id = 3").toDF() checkAnswer(df, Row(4, 3)) ``` Because `aliasMap` replace all the aliased child. See the test case in PR for details. This PR is to fix this bug by removing useless code for preventing non-converging constraints. It can be also fixed with #20270, but this is much simpler and clean up the code. ## How was this patch tested? Unit test Author: Wang Gengliang <[email protected]> Closes #20278 from gengliangwang/FixConstraintSimple. (cherry picked from commit 8598a98) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---+----+---+ | a| b| c| d| +---+---+----+---+ | 1| 1|null| 0| | 1| 1|null| 1| +---+---+----+---+ ``` `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before #19201, but it is transformed to `(c#10 = null)` since #20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests Closes #22368 from wangyum/SPARK-25368. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 77c9964) Signed-off-by: gatorsmile <[email protected]>

How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---+----+---+ | a| b| c| d| +---+---+----+---+ | 1| 1|null| 0| | 1| 1|null| 1| +---+---+----+---+ ``` `filter($"c".isNotNull)` was transformed to `(null <=> c#10)` before #19201, but it is transformed to `(c#10 = null)` since #20155. This pr revert it to `(null <=> c#10)` to fix this issue. unit tests Closes #22368 from wangyum/SPARK-25368. Authored-by: Yuming Wang <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 77c9964) Signed-off-by: gatorsmile <[email protected]>

gengliangwang added 2 commits September 12, 2017 17:06

improve QueryPlanConstraints

036e846

revise naming and comments

7b414fa

gatorsmile reviewed Sep 12, 2017

View reviewed changes

cloud-fan approved these changes Sep 12, 2017

View reviewed changes

revise as per comments

d456876

jiangxb1987 approved these changes Sep 12, 2017

View reviewed changes

asfgit closed this in 1a98574 Sep 12, 2017

adrian-ionescu mentioned this pull request Sep 30, 2017

[SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict caused by InferFiltersFromConstraints #19149

Closed

adrian-ionescu mentioned this pull request Jan 4, 2018

[SPARK-22961][REGRESSION] Constant columns should generate QueryPlanConstraints #20155

Closed

This was referenced Jan 15, 2018

[SPARK-23079][SQL] Fix query constraints propagation with aliases #20270

Closed

[SPARK-23079][SQL]Fix query constraints propagation with aliases #20278

Closed

wangyum mentioned this pull request Sep 9, 2018

[SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result #22368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21979][SQL]Improve QueryPlanConstraints framework #19201

[SPARK-21979][SQL]Improve QueryPlanConstraints framework #19201

gengliangwang commented Sep 12, 2017

SparkQA commented Sep 12, 2017

SparkQA commented Sep 12, 2017

gatorsmile Sep 12, 2017

gatorsmile commented Sep 12, 2017

cloud-fan left a comment

cloud-fan Sep 12, 2017

gengliangwang Sep 12, 2017

cloud-fan Sep 12, 2017

jiangxb1987 left a comment

SparkQA commented Sep 12, 2017

gatorsmile commented Sep 12, 2017

[SPARK-21979][SQL]Improve QueryPlanConstraints framework #19201

[SPARK-21979][SQL]Improve QueryPlanConstraints framework #19201

Conversation

gengliangwang commented Sep 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 12, 2017

SparkQA commented Sep 12, 2017

gatorsmile Sep 12, 2017

Choose a reason for hiding this comment

gatorsmile commented Sep 12, 2017

cloud-fan left a comment

Choose a reason for hiding this comment

cloud-fan Sep 12, 2017

Choose a reason for hiding this comment

gengliangwang Sep 12, 2017

Choose a reason for hiding this comment

cloud-fan Sep 12, 2017

Choose a reason for hiding this comment

jiangxb1987 left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 12, 2017

gatorsmile commented Sep 12, 2017