[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

codingjaguar · 2015-12-06T03:03:33Z

This PR improves LogicalPlan.sameResult so that semantically equivalent queries with different order of predicates are still matched.

Consider an example:
Query 1: CACHE TABLE first AS SELECT * FROM table A where A.id >100 AND A.id < 200;
Query 2: SELECT * FROM table A where A.id < 200 AND A.id > 100;
Currently in SparkSQL, Query 2 cannot utilize the cache result of query 1, although query 1 and query 2 are the same if ignoring the order of the predicates.
We modified the compare function LogicalPlan.sameResult. The idea is to split the condition of filter into a sequence of expressions and wrap it into a set. Now we can easily compare the sets rather than literally compare the conditions, thus ignoring the order of the predicates.

refactor cleanArgs so that we can reuse cleanExpression().

… jiang.filter-set

cloud-fan · 2015-12-07T03:11:34Z

This is a great feature! Can we implement it in individual expressions instead of centralizing them in LogicalPlan.samResult? A lof of commutative operators need it like And, Multiply, Max, etc. Maybe we can improve Expression.semanticEquals?

codingjaguar · 2015-12-07T16:47:58Z

Thanks for giving feedback! We think it would be nice to support all commutative operators in Expression.semanticEquals, but it doesn't seem to directly help cache matching. LogicalPlan.sameResult doesn't use Expression.semanticEquals, instead it simply checked cleanRight.cleanArgs == cleanLeft.cleanArgs. Here we refactored the code by moving equivalentConditions to PredicateHelper.equivalentPredicates, so that people can also use this feature in other scenarios.
We didn't support other communicative operators because they rarely show up in WHERE clause as root expression. For example, WHERE Max(a.id, 5) is not a meaningful sql statement.

liancheng · 2015-12-08T09:45:44Z

ok to test

SparkQA · 2015-12-08T09:53:56Z

Test build #47328 has finished for PR 10163 at commit da46b1c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class LogicalPlan extends QueryPlan[LogicalPlan] with PredicateHelper with Logging\n

SparkQA · 2015-12-08T18:17:28Z

Test build #47342 has finished for PR 10163 at commit c81aa46.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class LogicalPlan extends QueryPlan[LogicalPlan] with PredicateHelper with Logging\n

SparkQA · 2015-12-08T23:07:25Z

Test build #47364 has finished for PR 10163 at commit cc8fdfe.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaIndexToStringExample\n * abstract class LogicalPlan extends QueryPlan[LogicalPlan] with PredicateHelper with Logging\n

cloud-fan · 2015-12-09T01:26:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

@@ -127,33 +127,41 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging {
      cleanLeft.children.size == cleanRight.children.size && {
      logDebug(
        s"[${cleanRight.cleanArgs.mkString(", ")}] == [${cleanLeft.cleanArgs.mkString(", ")}]")
-      cleanRight.cleanArgs == cleanLeft.cleanArgs


how about we just change this to:

cleanRight.zip(cleanArgs).forall { case (e1: Expression, e2: Expression) => e1 semanticEquals e2 caes (a1, a2) => a1 == a2 }

then we can just improve Expression.sentaicEquals

cloud-fan · 2015-12-09T01:29:18Z

We didn't support other communicative operators because they rarely show up in WHERE clause as root expression.

How about something like WHERE a + b = c? I think it's quite common that we have other expressions inside predicates.

windscope · 2015-12-09T06:11:26Z

To improve semanticEquals, we tried to implement a template function Expression.splitWithCommutativeOperator[T: Manifest](): Seq[Expression] so that we don't need to implement a split function for each commutative operator. However, we cannot perform pattern matching on T.
Should we simply call PredicateHelper.splitConjunctivePredicates in Expression.semanticEquals and implement a few similar split function for other commutative operators?

codingjaguar · 2015-12-09T07:55:39Z

In last change we deleted equivalentPredicate and moved the functionality to Expression.semanticEquals.

cloud-fan · 2015-12-09T08:56:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+        checkSemantic(splitDisjunctivePredicates(left).toSet.toSeq,
+                      splitDisjunctivePredicates(right).toSet.toSeq)
+      case _ => checkSemantic(elements1, elements2)
+    }


Sorry I didn't clarify it clearly. I mean we can override semanticEquals in concrete expressions like Or, And, etc. And we don't need to support all commutative operators at once, you can only finish the predicates parts in this PR and open follow-up PRs for other parts(like Add, Multiply). Let's do it step-by-step :)

SparkQA · 2015-12-09T09:12:34Z

Test build #47419 has finished for PR 10163 at commit 07128b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class Expression extends TreeNode[Expression] with PredicateHelper\n * abstract class LogicalPlan extends QueryPlan[LogicalPlan] with PredicateHelper with Logging\n

Seq[Expression] in semanticEquals.

codingjaguar · 2015-12-09T22:46:37Z

We updated semanticEquals. Now And and Or override semanticEquals and ignore ordering.

SparkQA · 2015-12-09T22:55:14Z

Test build #47446 has finished for PR 10163 at commit 99626a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class Expression extends TreeNode[Expression]\n * case class And(left: Expression, right: Expression) extends BinaryOperator\n * case class Or(left: Expression, right: Expression) extends BinaryOperator\n

SparkQA · 2015-12-10T00:29:10Z

Test build #47452 has finished for PR 10163 at commit 2efca2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * abstract class Expression extends TreeNode[Expression]\n * case class And(left: Expression, right: Expression) extends BinaryOperator\n * case class Or(left: Expression, right: Expression) extends BinaryOperator\n

cloud-fan · 2015-12-10T00:50:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+    // elements1. If they are semantically equivalent, elements1 should be empty at the end.
+    elements1.size == elements2.size && {
+      for (e <- elements2) elements1 = removeFirstSemanticEquivalent(elements1, e)
+      elements1.isEmpty


Sorry I may missed something here, can we just write:

override def semanticEquals(other: Expression): Boolean = other match { case And(otherLeft, otherRight) => (left.semanticEquals(otherLeft) && right.semanticEquals(otherRight)) || (left.semanticEquals(otherRight) && right.semanticEquals(otherLeft)) case _ => false }

Consider this example
e1 = And(a, And(b, c))
e2 = And(And(a,b), c))
They are semantically equivalent, but will return false in your code.
splitConjunctivePredicates will crunch the expression tree into a sequence of (a, b, c).

ah I see, this makes sense.
But I think a better way is to add an optimization rule to turn all predicates into CNF, before we begin to check the semantic, or it will be hard to cover all cases like a || (b && c) == (a || b) && (a || c)

cc @liancheng

Seems there is an open PR that implements CNF normalization. Is there any reason why it hasn't been merged?
#8200

cc @yjshen

AmplabJenkins · 2016-05-23T16:17:27Z

Can one of the admins verify this patch?

rxin · 2016-06-15T22:10:50Z

Thanks for the pull request. I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Feel free to reopen it or create a new one.

codingjaguar and others added 15 commits December 5, 2015 12:38

Add equivalentConditions()

579b5a2

set comparison for projection

0a09698

Fix set conversion bug

85f1ebc

Remove set comparison of projection

1a2b534

Add test case for filter condition order

5fcb85c

Fix style error

8f93c6a

add testcase for SameResultSuite

6eb6fdd

add testcase for OR split filter condition

02fc878

Supported expressions with disjunctive predicates;

bcb6df0

refactor cleanArgs so that we can reuse cleanExpression().

Merge branch 'jiang.filter-set' of github.com:codingjaguar/spark into…

360bb2b

… jiang.filter-set

Merge branch 'jiang.filter-set'

94837d6

Merge branch 'jiang.filter-set' of github.com:codingjaguar/spark into…

0de3d7e

… jiang.filter-set

Removed dead code

13ce03f

Merge branch 'jiang.filter-set' of github.com:codingjaguar/spark into…

9f6df41

… jiang.filter-set

Merge branch 'jiang.filter-set'

e63df88

codingjaguar added 2 commits December 7, 2015 11:33

Refactor change. Move equivalentConditions to PredicateHelper

9043afd

Merge branch 'jiang.filter-set'

da46b1c

windscope added 2 commits December 8, 2015 11:27

Fix style issue

ba29f0b

Merge branch 'jiang.filter-set'

c81aa46

Merge branch 'master' of https://github.com/apache/spark

cc8fdfe

cloud-fan reviewed Dec 9, 2015
View reviewed changes

Integrate the functionality of equivalentPredicate with semanticEquals

b8ba919

codingjaguar added 2 commits December 9, 2015 02:29

Refactor sameResult to utilize semanticEquals

a0e8c4a

Merge branch 'jiang.filter-set'

07128b3

cloud-fan reviewed Dec 9, 2015
View reviewed changes

codingjaguar added 3 commits December 9, 2015 11:27

Rewrite semanticEquals in And and Or to ignore their ordering.

4af3622

Rewrite the code block that compares the equivalency of

99626a4

Seq[Expression] in semanticEquals.

Fix a bug in semanticEqual. Add some comment.

2efca2f

cloud-fan reviewed Dec 10, 2015
View reviewed changes

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

codingjaguar commented Dec 6, 2015

cloud-fan commented Dec 7, 2015

codingjaguar commented Dec 7, 2015

liancheng commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

cloud-fan Dec 9, 2015

cloud-fan commented Dec 9, 2015

windscope commented Dec 9, 2015

codingjaguar commented Dec 9, 2015

cloud-fan Dec 9, 2015

SparkQA commented Dec 9, 2015

codingjaguar commented Dec 9, 2015

SparkQA commented Dec 9, 2015

SparkQA commented Dec 10, 2015

cloud-fan Dec 10, 2015

codingjaguar Dec 10, 2015

cloud-fan Dec 10, 2015

codingjaguar Dec 10, 2015

AmplabJenkins commented May 23, 2016

rxin commented Jun 15, 2016

[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

[SPARK-12161][SQL] Ignore order of predicates in cache matching #10163

Conversation

codingjaguar commented Dec 6, 2015

cloud-fan commented Dec 7, 2015

codingjaguar commented Dec 7, 2015

liancheng commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

SparkQA commented Dec 8, 2015

cloud-fan Dec 9, 2015

Choose a reason for hiding this comment

cloud-fan commented Dec 9, 2015

windscope commented Dec 9, 2015

codingjaguar commented Dec 9, 2015

cloud-fan Dec 9, 2015

Choose a reason for hiding this comment

SparkQA commented Dec 9, 2015

codingjaguar commented Dec 9, 2015

SparkQA commented Dec 9, 2015

SparkQA commented Dec 10, 2015

cloud-fan Dec 10, 2015

Choose a reason for hiding this comment

codingjaguar Dec 10, 2015

Choose a reason for hiding this comment

cloud-fan Dec 10, 2015

Choose a reason for hiding this comment

codingjaguar Dec 10, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 23, 2016

rxin commented Jun 15, 2016