[SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work. #17330

dilipbiswal · 2017-03-17T06:58:47Z

What changes were proposed in this pull request?

The sameResult() method does not work when the logical plan contains subquery expressions.

Before the fix

scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)")
ds: org.apache.spark.sql.DataFrame = [c1: int]

scala> ds.cache
res13: ds.type = [c1: int]

scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true)
== Analyzed Logical Plan ==
c1: int
Project [c1#86]
+- Filter c1#86 IN (list#78 [c1#86])
   :  +- Project [c1#87]
   :     +- Filter (outer(c1#86) = c1#87)
   :        +- SubqueryAlias s2
   :           +- Relation[c1#87] parquet
   +- SubqueryAlias s1
      +- Relation[c1#86] parquet

== Optimized Logical Plan ==
Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
:- Relation[c1#86] parquet
+- Relation[c1#87] parquet

Plan after fix

== Analyzed Logical Plan ==
c1: int
Project [c1#22]
+- Filter c1#22 IN (list#14 [c1#22])
   :  +- Project [c1#23]
   :     +- Filter (outer(c1#22) = c1#23)
   :        +- SubqueryAlias s2
   :           +- Relation[c1#23] parquet
   +- SubqueryAlias s1
      +- Relation[c1#22] parquet

== Optimized Logical Plan ==
InMemoryRelation [c1#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
   +- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight
      :- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
      +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295))))
         +- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>

How was this patch tested?

New tests are added to CachedTableSuite.

SparkQA · 2017-03-17T09:10:25Z

Test build #74729 has finished for PR 17330 at commit dfa76da.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)

dilipbiswal · 2017-03-17T09:12:38Z

cc @hvanhovell @gatorsmile

rxin · 2017-03-17T22:38:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+    val state = Seq(children, this.getClass.getName)
+    state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b)


can we write this imperatively

@rxin Sure Reynold.

SparkQA · 2017-03-18T09:15:30Z

Test build #74772 has finished for PR 17330 at commit 2b9d717.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-03-18T09:22:13Z

retest this please

SparkQA · 2017-03-18T11:34:35Z

Test build #74775 has finished for PR 17330 at commit 2b9d717.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-19T00:48:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
+    // Normalize the outer references in the subquery plan.
+    val subPlan = e.plan.transformAllExpressions {
+      case o @ OuterReference(e) => BindReferences.bindReference(e, attrs, allowFailures = true)


Nit: case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)

@gatorsmile Will change. Thanks.

gatorsmile · 2017-03-19T00:57:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+
+  override def equals(o: Any): Boolean = o match {
+    case n: CanonicalizedSubqueryExpr => expr.semanticEquals(n.expr)
+    case other => false


case CanonicalizedSubqueryExpr(e) => expr.semanticEquals(e) case _ => false

@gatorsmile Will change. Thanks.

gatorsmile · 2017-03-19T01:04:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+
+  /**
+   * Clean the outer references by normalizing them to BindReference in the same way
+   * we clean up the arguments during LogicalPlan.sameResult. This enables to compare two


Also replace SubqueryExpression by CanonicalizedSubqueryExpr

@gatorsmile OK.

gatorsmile · 2017-03-19T01:09:35Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+      assert(getNumInMemoryRelations(cachedDs2) == 1)
+
+      spark.catalog.cacheTable("t1")
+      ds1.unpersist()


How about splitting the test cases to multiple individual ones? The cache will be cleaned for each test case. This can avoid any extra checking, like assert(getNumInMemoryRelations(cachedMissDs) == 0) or ds1.unpersist()

override def afterEach(): Unit = { try { spark.catalog.clearCache() } finally { super.afterEach() } }

@gatorsmile Sure. will do.

gatorsmile · 2017-03-19T01:10:13Z

Generally, it looks good to me. cc @hvanhovell @rxin @cloud-fan

SparkQA · 2017-03-19T09:27:12Z

Test build #74813 has finished for PR 17330 at commit 0c64245.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-20T06:05:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  // re-order them based on hashcode.
+  // TODO : improve the hashcode generation by considering the plan info.
+  override def hashCode(): Int = {
+    val h = Objects.hashCode(expr.children)


can we use expr.semanticHash here?

@cloud-fan Hi wenchen, we are unable to use expr.semanticHash here. I had actually tried to do that. So when we canonicalize the expressions we are re-ordering them based on their hash code and that creates problem. I have a test case that has three subquery expressions (ScalarSubquery, ListQuery and Exists) and while re-ordering them the order becomes unpredictable. So in here when i generate the hashcode, i am generating them conservatively.
Let me know what you think.

why the reordering is unpredictable?

@cloud-fan In my understanding, its because a SubqueryExpression contains a LogicalPlan and the logical plan has all sorts of expression ids that attributes to this unpredictability. In order to remove this, we may have to implement a canonicalized() on SubqueryExpression that also canonicalizes all the expressions on the subquery plan in a recursive fashion. I had marked it as a TODO in my comment. Please let me know if my understanding is wrong.

after canonicalize, the ordering should be deterministic, right?

@cloud-fan Hi Wenchen, let me take an example and perhaps that will help us.

Lets assume there is a Filter operator that has 3 subquery expressions like following.

Filter Scalar-subquery(.., splan1,..) || Exists(.., splan2, ..) || In(.., ListQuery(.., splan3, ..)

During sameResult on this plan, we perform this logic

We clean the args from both source and target plans.
2.1 We change the subquery expression to CanonicalizedSubqueryExpression so that we can do a customized equals that basically does a semantic equals of two subquery expressions.
2.2 Additionally we change the subquery plan (splan1, splan2, splan3) to change the outer attribute references to BindReference. Since these are outer references the inner plan is not aware of these attributes and hence we fix them as part of canonicalize. After this, when the equals method is called on CanonicalizedSubqueryExpression here and the sameResult is called on splan1, splan2, splan3 it works
fine as the outer references have been normalized to BindReference. The local attributes referenced in the plan is handled in sameResult itself.
2.3 As part of cleanArgs, after cleaning the expressions we call canonicalized here. So here we try to re-order 3 CanonicalizedSubqueryExpressions on the basis of their hashCode. When i had the hashCode call expr.semanticHash(), i observed that the ordering of the expressions between source and target is not consistent. I debugged to find that expr.semanticHash() considers the subquery plans as they are the class arguments and since plans have arbitary expression ids (we have only normalized the outer references.. not all of them at this point). Thus, in the current implementation i am excluding the subplans while computing the hashcode and have marked it as a todo.

Please let me know your thoughts and let me know if i have misunderstood any thing.

viirya · 2017-03-22T04:33:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+ * optimizer or planner never sees this expression type during transformation of
+ * plans.
+ */
+case class CanonicalizedSubqueryExpr(expr: SubqueryExpression)


Do we really need this new abstraction?

Actually I don't think we should compare SubqueryExpressions and wonder how to canonicalize them.

Basically, in sameResult, by adding new condition (left.subqueries, right.subqueries).zipped.forall(_ sameResult _), we can achieve the same goal and all added tests are passed.

Btw, all PlanExpression should be erased when doing cleanArg.

@viirya Thank you very much. Let me try this.

@viirya Thank you for your suggestion. I thought about this and went through my notes i had prepared on this while debugging this. The reason i had opted for comparing subquery expressions as opposed to just the plans is i wanted take advantage of expr.canonicalized which re-orders expression nicely to maximize cache hit.
Example - plan1 - Filter In || Exists || Scalar
plan2- Filter Scalar | In | Exists
When we compare the above two plans .. all other things being equal should cause a cache hit. I added a test case now to make sure. One other aspect is in the future the subquery expression may evolve to hold more attributes and not considering them didn't feel safe. The other thing is i suspect we may still have to deal with the outer references in the same way i am handling now (I haven't tried it yet). Please let me know your thoughts.

Sounds reasonable. The benefit of this expression approach is expression canonicalization.

cloud-fan · 2017-03-22T12:00:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  def canonicalize(e: SubqueryExpression, attrs: AttributeSeq): CanonicalizedSubqueryExpr = {
+    // Normalize the outer references in the subquery plan.
+    val subPlan = e.plan.transformAllExpressions {
+      case OuterReference(r) => BindReferences.bindReference(r, attrs, allowFailures = true)


here shall we bind reference for other Attribute in sub query?

@cloud-fan The local attributes would be bound when we call sameResult on the subquery plan as part of Subexpression.equals() ? We certainly can bind them here and if you don't mind, i wanted to explore this in a follow up in an effort to start considering the sub-plan in the hashCode computation (marked as a todo now in comment) as i wanted more testing on that. Please let me know what you think.

if we bind all reference here, can we solve the problem that reordering is nondeterministic?

@cloud-fan Hi Wenchen, If we are able to normalize all the expression ids, then i think we can solve the nondeterministic problem. I had actually tried this for the new test cases i have added and they worked fine. But i didn't go with this as i felt i may have to test this more. In my testing, i only normalized the AttributeReference .. but we can have generated expression ids for other expressions, right , Aliases, SubqueryExpression etc. We also need to normalize these as well, no ?

@cloud-fan Hi Wenchen, I tried to normalize all the expressions in the plan. But still the hashcodes are not stable. On two instances of the same plan i get two different hashcode. Please advice.

cloud-fan · 2017-04-10T05:49:50Z

Hi @dilipbiswal can you update now?

dilipbiswal · 2017-04-10T05:56:22Z

@cloud-fan Sure Wenchen.

…oes not work

cloud-fan · 2017-04-10T07:22:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

@@ -401,8 +401,9 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]] extends TreeNode[PlanT
   * do not use `BindReferences` here as the plan may take the expression as a parameter with type
   * `Attribute`, and replace it with `BoundReference` will cause error.
   */
-  protected def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {
+  def normalizeExprId[T <: Expression](e: T, input: AttributeSeq = allAttributes): T = {


oh, seems we can move this method to object QueryPlan?

@cloud-fan Ok. Sounds good wenchen.

cloud-fan · 2017-04-10T07:30:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

@@ -236,6 +244,12 @@ case class ScalarSubquery(
  override def nullable: Boolean = true
  override def withNewPlan(plan: LogicalPlan): ScalarSubquery = copy(plan = plan)
  override def toString: String = s"scalar-subquery#${exprId.id} $conditionString"
+  override lazy val canonicalized: Expression = {


I think we can just override preCanonicalize to turn exprId to 0

@cloud-fan preCanonicalize is a method in QueryPlan ? Can we override it here ?

oh sorry it's expression

SparkQA · 2017-04-10T09:25:14Z

Test build #75647 has finished for PR 17330 at commit 7346dca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-10T13:21:41Z

Test build #75654 has finished for PR 17330 at commit 22db44a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-10T14:32:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

@@ -59,6 +58,13 @@ abstract class SubqueryExpression(
        children.zip(p.children).forall(p => p._1.semanticEquals(p._2))
    case _ => false
  }
+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+    // Normalize the outer references in the subquery plan.
+    val subPlan = plan.transformAllExpressions {


normalizedPlan

cloud-fan · 2017-04-10T14:33:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  def canonicalize(attrs: AttributeSeq): SubqueryExpression = {
+    // Normalize the outer references in the subquery plan.
+    val subPlan = plan.transformAllExpressions {
+      case OuterReference(r) => QueryPlan.normalizeExprId(r, attrs)


The OuterReference will all be removed, is it expected?

@cloud-fan Actually you r right. Preserving the OuterReference would be good.

SparkQA · 2017-04-10T18:24:24Z

Test build #75664 has finished for PR 17330 at commit f1e63c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-11T12:11:49Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+  test("SPARK-19993 simple subquery caching") {
+    withTempView("t1", "t2") {
+      Seq(1).toDF("c1").createOrReplaceTempView("t1")
+      Seq(1).toDF("c1").createOrReplaceTempView("t2")


where is t2 used?

@cloud-fan sorry... actually i had some of these tests combined and when i split, i forgot to remove this. Will fix it.

cloud-fan · 2017-04-11T12:12:35Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

+  test("SPARK-19993 subquery with cached underlying relation") {
+    withTempView("t1", "t2") {
+      Seq(1).toDF("c1").createOrReplaceTempView("t1")
+      Seq(1).toDF("c1").createOrReplaceTempView("t2")


where is t2 used?

@cloud-fan sorry... actually i had some of these tests combined and when i split, i forgot to remove this. Will fix it.

cloud-fan · 2017-04-11T12:15:01Z

sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala

@@ -76,6 +76,13 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
    sum
  }

+  private def getNumInMemoryTableScanExecs(plan: SparkPlan): Int = {


we need a better name, this actually get in-memory table recursively, which is different from getNumInMemoryRelations

@cloud-fan So we are operating at the physical plan level in this method where as the other method getNumInMemoryRelations operates at a logical plan level. And in here we are simply counting the the InMemoryTableScanExec nodes in the plan. I have changed the function name to getNumInMemoryTablesRecursively. Does it look ok to you ?

cloud-fan · 2017-04-11T12:16:23Z

LGTM except some minor comments about test

SparkQA · 2017-04-11T20:45:40Z

Test build #75710 has finished for PR 17330 at commit 362d62f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-04-12T04:18:11Z

thanks, merging to master!

dilipbiswal · 2017-04-12T04:24:00Z

@cloud-fan @gatorsmile Thanks a lot!!

…ons does not work. ## What changes were proposed in this pull request? The sameResult() method does not work when the logical plan contains subquery expressions. **Before the fix** ```SQL scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)") ds: org.apache.spark.sql.DataFrame = [c1: int] scala> ds.cache res13: ds.type = [c1: int] scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true) == Analyzed Logical Plan == c1: int Project [c1#86] +- Filter c1#86 IN (list#78 [c1#86]) : +- Project [c1#87] : +- Filter (outer(c1#86) = c1#87) : +- SubqueryAlias s2 : +- Relation[c1#87] parquet +- SubqueryAlias s1 +- Relation[c1#86] parquet == Optimized Logical Plan == Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87)) :- Relation[c1#86] parquet +- Relation[c1#87] parquet ``` **Plan after fix** ```SQL == Analyzed Logical Plan == c1: int Project [c1#22] +- Filter c1#22 IN (list#14 [c1#22]) : +- Project [c1#23] : +- Filter (outer(c1#22) = c1#23) : +- SubqueryAlias s2 : +- Relation[c1#23] parquet +- SubqueryAlias s1 +- Relation[c1#22] parquet == Optimized Logical Plan == InMemoryRelation [c1#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas) +- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight :- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int> +- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295)))) +- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int> ``` ## How was this patch tested? New tests are added to CachedTableSuite. Author: Dilip Biswal <[email protected]> Closes apache#17330 from dilipbiswal/subquery_cache_final.

dilipbiswal changed the title ~~[SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work.~~ [SPARK-19993][SQL][WIP] Caching logical plans containing subquery expressions does not work. Mar 17, 2017

rxin reviewed Mar 17, 2017

View reviewed changes

dilipbiswal changed the title ~~[SPARK-19993][SQL][WIP] Caching logical plans containing subquery expressions does not work.~~ [SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work. Mar 19, 2017

gatorsmile reviewed Mar 19, 2017

View reviewed changes

cloud-fan reviewed Mar 20, 2017

View reviewed changes

viirya reviewed Mar 22, 2017

View reviewed changes

cloud-fan reviewed Mar 22, 2017

View reviewed changes

[SPARK-19993] Caching logical plans containing subquery expressions d…

7346dca

…oes not work

dilipbiswal force-pushed the subquery_cache_final branch from 0c64245 to 7346dca Compare April 10, 2017 07:04

cloud-fan reviewed Apr 10, 2017

View reviewed changes

review comments

22db44a

cloud-fan reviewed Apr 10, 2017

View reviewed changes

Review comments

f1e63c8

cloud-fan reviewed Apr 11, 2017

View reviewed changes

code review

362d62f

asfgit closed this in b14bfc3 Apr 12, 2017

[SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work. #17330

[SPARK-19993][SQL] Caching logical plans containing subquery expressions does not work. #17330

Conversation

dilipbiswal commented Mar 17, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 17, 2017

dilipbiswal commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 18, 2017

dilipbiswal commented Mar 18, 2017

SparkQA commented Mar 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 19, 2017

SparkQA commented Mar 19, 2017

Choose a reason for hiding this comment

dilipbiswal Mar 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Mar 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 10, 2017

dilipbiswal commented Apr 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2017

SparkQA commented Apr 10, 2017

cloud-fan Apr 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 11, 2017

SparkQA commented Apr 11, 2017

cloud-fan commented Apr 12, 2017

dilipbiswal commented Apr 12, 2017 • edited Loading

dilipbiswal commented Mar 17, 2017 •

edited

Loading

dilipbiswal Mar 20, 2017 •

edited

Loading

dilipbiswal Mar 20, 2017 •

edited

Loading

viirya Mar 22, 2017 •

edited

Loading

dilipbiswal Mar 22, 2017 •

edited

Loading

dilipbiswal Mar 22, 2017 •

edited

Loading

dilipbiswal Mar 23, 2017 •

edited

Loading

cloud-fan Apr 10, 2017 •

edited

Loading

dilipbiswal commented Apr 12, 2017 •

edited

Loading