[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition #5643

adrian-wang · 2015-04-23T02:47:06Z

When the condition extracted by ExtractEquiJoinKeys contain join Predicate for left semi join, we can not plan it as semiJoin. Such as

SELECT * FROM testData2 x
LEFT SEMI JOIN testData2 y 
ON x.b = y.b
AND x.a >= y.a + 2

Condition x.a >= y.a + 2 can not evaluate on table x, so it throw errors

SparkQA · 2015-04-23T02:52:42Z

Test build #30798 has finished for PR 5643 at commit 6eb62d2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

scwf · 2015-04-23T03:46:03Z

I am not sure it is suitable to broadcast a hashmap contains key and related rows, this maybe much bigger than the old hashset, may cause OOM issue.

adrian-wang · 2015-04-23T04:13:03Z

@scwf Of course we can go the old way when there's no additional conditions.

scwf · 2015-04-23T04:36:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala

  override def execute(): RDD[Row] = {
    val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator
-    val hashSet = new java.util.HashSet[Row]()
+    val hashMap = new java.util.HashMap[Row, scala.collection.mutable.ArrayBuffer[Row]]()


why changed to arraybuffer

This is a consideration for performance. Anyway I'm changing it back to HashSet.

SparkQA · 2015-04-23T04:44:19Z

Test build #30800 has finished for PR 5643 at commit a99f492.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
This patch does not change any dependencies.

scwf · 2015-04-23T05:21:09Z

Yes, i understand that when no additional condition it go the old way. i mean when there are additional conditions, your broadcasting hashmap may be much bigger since you also kept the related rows, which may leads to OOM.
I fixed my PR, please have a look.

SparkQA · 2015-04-23T05:52:20Z

Test build #30805 has finished for PR 5643 at commit cf435db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-23T06:42:00Z

Test build #30808 has finished for PR 5643 at commit 19201e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

scwf · 2015-04-23T06:45:51Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -298,6 +298,13 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
    )
  }

+  test("left semi greater than predicate and equal operator") {


@adrian-wang i suggest you add the case chenghao described in my PR to the unit test.

create a pr for your branch

closed since you have added the test

SparkQA · 2015-04-23T06:56:11Z

Test build #30812 has finished for PR 5643 at commit 41c20d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

scwf · 2015-04-23T07:11:20Z

This LGTM

SparkQA · 2015-04-23T07:25:16Z

Test build #30818 has finished for PR 5643 at commit 75b8a64.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

adrian-wang · 2015-04-23T07:39:23Z

retest this please.

SparkQA · 2015-04-23T09:22:19Z

Test build #30822 has finished for PR 5643 at commit 75b8a64.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-23T12:37:07Z

Test build #30834 has finished for PR 5643 at commit a7c6cc4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

chenghao-intel · 2015-04-24T02:24:13Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala

+            }
+          }
+        }
+      case Some(_) =>


I am wondering if this would be simpler if we use the HashedRelation instead.

SparkQA · 2015-04-24T04:52:51Z

Test build #30905 has finished for PR 5643 at commit d29f9a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

chenghao-intel · 2015-04-24T05:18:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala


  override val buildSide: BuildSide = BuildRight

  override def output: Seq[Attribute] = left.output

+  @transient private lazy val boundCondition =
+    condition.map(newPredicate(_, left.output ++ right.output)).getOrElse((row: Row) => true)


newPredicate(condition.getOrElse(Literal(true)), left.output ++ right.output)?

SparkQA · 2015-04-24T09:04:34Z

Test build #30924 has finished for PR 5643 at commit 90a69ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

adrian-wang · 2015-04-27T11:31:49Z

retest this please.

SparkQA · 2015-04-27T18:19:08Z

Test build #30988 has started for PR 5643 at commit ddadf9f.

adrian-wang · 2015-04-28T00:46:37Z

retest this please.

SparkQA · 2015-04-28T02:45:32Z

Test build #31091 has finished for PR 5643 at commit ddadf9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

scwf · 2015-05-19T06:30:55Z

ping can you update this?

SparkQA · 2015-05-19T08:43:42Z

Test build #33063 has finished for PR 5643 at commit 8ef50d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Sephiroth-Lin · 2015-05-29T07:27:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala

+          val joinedRow = new JoinedRow
+
+          streamIter.filter(current => {
+            val rowBuffer = broadcastedRelation.value.get(joinKeys.currentValue)


we need to apply first before we get currentValue, or will get null for the first row.

marmbrus · 2015-06-17T22:13:29Z

Mind bringing this up to date?

SparkQA · 2015-06-18T01:46:44Z

Test build #35076 has finished for PR 5643 at commit 8b8a992.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class IsNull(child: Expression) extends UnaryExpression with Predicate
- case class IsNotNull(child: Expression) extends UnaryExpression with Predicate

SparkQA · 2015-06-18T02:25:13Z

Test build #35084 has finished for PR 5643 at commit 592794d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-18T02:52:36Z

Test build #35086 has finished for PR 5643 at commit 455b890.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-18T06:04:07Z

Test build #35090 has finished for PR 5643 at commit 15f9707.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-19T08:51:01Z

Test build #35241 has finished for PR 5643 at commit ad0ad59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-06-22T22:53:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala

+            !joinKeys(current).anyNull && broadcastedRelation.value.contains(joinKeys.currentValue)
+          })
+        }
+      case _ =>


Using pattern matching here makes this a little hard to understand as I don't think its very obvious that case _ => implies there is a non equijoin condition and thus we need to build a full hashtable instead of a hash set. Perhaps name the variable nonEquiJoinCondition and use isDefined in an if statement. Some more comments would also be helpful.

marmbrus · 2015-06-22T22:57:25Z

Thanks for working on this and sorry for the delay reviewing. Would be great if this could be updated in time for Spark 1.5!

SparkQA · 2015-07-15T10:50:50Z

Test build #37346 has finished for PR 5643 at commit cc09809.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait HashSemiJoin

adrian-wang · 2015-07-15T11:34:57Z

@marmbrus

marmbrus · 2015-07-17T23:45:09Z

Thanks! Merging to master.

marmbrus · 2015-07-17T23:49:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashSemiJoin.scala

+
+  protected def buildKeyHashSet(
+      buildIter: Iterator[InternalRow],
+      copy: Boolean): java.util.Set[InternalRow] = {


Long term I wonder if its actually a win for us to build just a set instead of using hashed relation everywhere. We have done a bunch optimization on HashedRelation to make it serialize faster.

maybe we need to implement a version of HashedRelation which only stores the keys.

adrian-wang mentioned this pull request Apr 23, 2015

[SPARK-7026][SQL] Fix bugs when there are non equal join predicates for left semi join #5612

Closed

adrian-wang force-pushed the spark7026 branch from 6eb62d2 to a99f492 Compare April 23, 2015 03:01

scwf reviewed Apr 23, 2015
View reviewed changes

chenghao-intel reviewed Apr 24, 2015
View reviewed changes

adrian-wang force-pushed the spark7026 branch from ddadf9f to 8ef50d4 Compare May 19, 2015 06:45

Sephiroth-Lin reviewed May 29, 2015
View reviewed changes

adrian-wang force-pushed the spark7026 branch from 8ef50d4 to 8b8a992 Compare June 18, 2015 01:39

marmbrus reviewed Jun 22, 2015
View reviewed changes

adrian-wang added 6 commits July 13, 2015 22:23

merge commits for rebase

8e0afca

fix style

72baa02

fix style

10bf124

fix rebase

27841de

fix notserializable

575a7c8

refactor semijoin and add plan test

cc09809

adrian-wang force-pushed the spark7026 branch from ad0ad59 to cc09809 Compare July 15, 2015 09:04

asfgit closed this in 1707238 Jul 17, 2015

marmbrus reviewed Jul 17, 2015
View reviewed changes

[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition #5643

[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition #5643

Conversation

adrian-wang commented Apr 23, 2015

SparkQA commented Apr 23, 2015

scwf commented Apr 23, 2015

adrian-wang commented Apr 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2015

scwf commented Apr 23, 2015

SparkQA commented Apr 23, 2015

SparkQA commented Apr 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2015

scwf commented Apr 23, 2015

SparkQA commented Apr 23, 2015

adrian-wang commented Apr 23, 2015

SparkQA commented Apr 23, 2015

SparkQA commented Apr 23, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2015

adrian-wang commented Apr 27, 2015

SparkQA commented Apr 27, 2015

adrian-wang commented Apr 28, 2015

SparkQA commented Apr 28, 2015

scwf commented May 19, 2015

SparkQA commented May 19, 2015

Choose a reason for hiding this comment

marmbrus commented Jun 17, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 18, 2015

SparkQA commented Jun 19, 2015

Choose a reason for hiding this comment

marmbrus commented Jun 22, 2015

SparkQA commented Jul 15, 2015

adrian-wang commented Jul 15, 2015

marmbrus commented Jul 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment