Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7026] [SQL] fix left semi join with equi key and non-equi condition #5643

Closed
wants to merge 6 commits into from

Conversation

adrian-wang
Copy link
Contributor

When the condition extracted by ExtractEquiJoinKeys contain join Predicate for left semi join, we can not plan it as semiJoin. Such as

SELECT * FROM testData2 x
LEFT SEMI JOIN testData2 y 
ON x.b = y.b
AND x.a >= y.a + 2

Condition x.a >= y.a + 2 can not evaluate on table x, so it throw errors

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30798 has finished for PR 5643 at commit 6eb62d2.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@scwf
Copy link
Contributor

scwf commented Apr 23, 2015

I am not sure it is suitable to broadcast a hashmap contains key and related rows, this maybe much bigger than the old hashset, may cause OOM issue.

@adrian-wang
Copy link
Contributor Author

@scwf Of course we can go the old way when there's no additional conditions.

override def execute(): RDD[Row] = {
val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator
val hashSet = new java.util.HashSet[Row]()
val hashMap = new java.util.HashMap[Row, scala.collection.mutable.ArrayBuffer[Row]]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changed to arraybuffer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a consideration for performance. Anyway I'm changing it back to HashSet.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30800 has finished for PR 5643 at commit a99f492.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
  • This patch does not change any dependencies.

@scwf
Copy link
Contributor

scwf commented Apr 23, 2015

Yes, i understand that when no additional condition it go the old way. i mean when there are additional conditions, your broadcasting hashmap may be much bigger since you also kept the related rows, which may leads to OOM.
I fixed my PR, please have a look.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30805 has finished for PR 5643 at commit cf435db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30808 has finished for PR 5643 at commit 19201e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -298,6 +298,13 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll {
)
}

test("left semi greater than predicate and equal operator") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrian-wang i suggest you add the case chenghao described in my PR to the unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a pr for your branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closed since you have added the test

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30812 has finished for PR 5643 at commit 41c20d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@scwf
Copy link
Contributor

scwf commented Apr 23, 2015

This LGTM

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30818 has finished for PR 5643 at commit 75b8a64.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30822 has finished for PR 5643 at commit 75b8a64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30834 has finished for PR 5643 at commit a7c6cc4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

}
}
}
case Some(_) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this would be simpler if we use the HashedRelation instead.

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30905 has finished for PR 5643 at commit d29f9a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.


override val buildSide: BuildSide = BuildRight

override def output: Seq[Attribute] = left.output

@transient private lazy val boundCondition =
condition.map(newPredicate(_, left.output ++ right.output)).getOrElse((row: Row) => true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newPredicate(condition.getOrElse(Literal(true)), left.output ++ right.output)?

@SparkQA
Copy link

SparkQA commented Apr 24, 2015

Test build #30924 has finished for PR 5643 at commit 90a69ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30988 has started for PR 5643 at commit ddadf9f.

@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 28, 2015

Test build #31091 has finished for PR 5643 at commit ddadf9f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@scwf
Copy link
Contributor

scwf commented May 19, 2015

ping can you update this?

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33063 has finished for PR 5643 at commit 8ef50d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val joinedRow = new JoinedRow

streamIter.filter(current => {
val rowBuffer = broadcastedRelation.value.get(joinKeys.currentValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to apply first before we get currentValue, or will get null for the first row.

@marmbrus
Copy link
Contributor

Mind bringing this up to date?

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35076 has finished for PR 5643 at commit 8b8a992.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class IsNull(child: Expression) extends UnaryExpression with Predicate
    • case class IsNotNull(child: Expression) extends UnaryExpression with Predicate

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35084 has finished for PR 5643 at commit 592794d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35086 has finished for PR 5643 at commit 455b890.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2015

Test build #35090 has finished for PR 5643 at commit 15f9707.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35241 has finished for PR 5643 at commit ad0ad59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

!joinKeys(current).anyNull && broadcastedRelation.value.contains(joinKeys.currentValue)
})
}
case _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pattern matching here makes this a little hard to understand as I don't think its very obvious that case _ => implies there is a non equijoin condition and thus we need to build a full hashtable instead of a hash set. Perhaps name the variable nonEquiJoinCondition and use isDefined in an if statement. Some more comments would also be helpful.

@marmbrus
Copy link
Contributor

Thanks for working on this and sorry for the delay reviewing. Would be great if this could be updated in time for Spark 1.5!

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37346 has finished for PR 5643 at commit cc09809.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait HashSemiJoin

@adrian-wang
Copy link
Contributor Author

@marmbrus

@marmbrus
Copy link
Contributor

Thanks! Merging to master.

@asfgit asfgit closed this in 1707238 Jul 17, 2015

protected def buildKeyHashSet(
buildIter: Iterator[InternalRow],
copy: Boolean): java.util.Set[InternalRow] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I wonder if its actually a win for us to build just a set instead of using hashed relation everywhere. We have done a bunch optimization on HashedRelation to make it serialize faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to implement a version of HashedRelation which only stores the keys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants