[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130

c21 · 2020-07-16T05:49:49Z

What changes were proposed in this pull request?

Currently ShuffledHashJoin.outputPartitioning inherits from HashJoin.outputPartitioning, which only preserves stream side partitioning (HashJoin.scala):

override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning

This loses build side partitioning information, and causes extra shuffle if there's another join / group-by after this join.

Example:

withSQLConf(
    SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",
    SQLConf.SHUFFLE_PARTITIONS.key -> "2",
    SQLConf.PREFER_SORTMERGEJOIN.key -> "false") {
  val df1 = spark.range(10).select($"id".as("k1"))
  val df2 = spark.range(30).select($"id".as("k2"))
  Seq("inner", "cross").foreach(joinType => {
    val plan = df1.join(df2, $"k1" === $"k2", joinType).groupBy($"k1").count()
      .queryExecution.executedPlan
    assert(plan.collect { case _: ShuffledHashJoinExec => true }.size === 1)
    // No extra shuffle before aggregate
    assert(plan.collect { case _: ShuffleExchangeExec => true }.size === 2)
  })
}

Current physical plan (having an extra shuffle on k1 before aggregate)

*(4) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L])
+- Exchange hashpartitioning(k1#220L, 2), true, [id=#117]
   +- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L])
      +- *(3) Project [k1#220L]
         +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
            :- Exchange hashpartitioning(k1#220L, 2), true, [id=#109]
            :  +- *(1) Project [id#218L AS k1#220L]
            :     +- *(1) Range (0, 10, step=1, splits=2)
            +- Exchange hashpartitioning(k2#224L, 2), true, [id=#111]
               +- *(2) Project [id#222L AS k2#224L]
                  +- *(2) Range (0, 30, step=1, splits=2)

Ideal physical plan (no shuffle on k1 before aggregate)

*(3) HashAggregate(keys=[k1#220L], functions=[count(1)], output=[k1#220L, count#235L])
+- *(3) HashAggregate(keys=[k1#220L], functions=[partial_count(1)], output=[k1#220L, count#239L])
   +- *(3) Project [k1#220L]
      +- ShuffledHashJoin [k1#220L], [k2#224L], Inner, BuildLeft
         :- Exchange hashpartitioning(k1#220L, 2), true, [id=#107]
         :  +- *(1) Project [id#218L AS k1#220L]
         :     +- *(1) Range (0, 10, step=1, splits=2)
         +- Exchange hashpartitioning(k2#224L, 2), true, [id=#109]
            +- *(2) Project [id#222L AS k2#224L]
               +- *(2) Range (0, 30, step=1, splits=2)

This can be fixed by overriding outputPartitioning method in ShuffledHashJoinExec, similar to SortMergeJoinExec.
In addition, also fix one typo in HashJoin, as that code path is shared between broadcast hash join and shuffled hash join.

Why are the changes needed?

To avoid shuffle (for queries having multiple joins or group-by), for saving CPU and IO.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in JoinSuite.

c21 · 2020-07-16T05:51:13Z

cc @maropu, @cloud-fan, @gatorsmile and @sameeragarwal if you guys can help take a look. Thanks!

SparkQA · 2020-07-16T12:37:22Z

Test build #125951 has finished for PR 29130 at commit dface2a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-17T10:26:05Z

is it covered by #28676 ? cc @imback82

c21 · 2020-07-17T14:27:58Z

@cloud-fan and @imback82 - I was not aware of #28676 before making this PR. After checking #28676, TLDR is I think we are solving different issues and there's no code conflict between these two.

#28676: preserve broadcast hash join build side partitioning for inner join if the stream side is hash partitioned. @imback82 It's a great idea that I never thought about it before. I bet in production, out users should have hit this issue before, but I think our action was just asking them to disable broadcast join (SMJ on small table, instead of broadcasting it - the cost of it is small, as the table should be small enough to be broadcasted), then partitioning info gets propagated through query plan, and the followed shuffle can be saved. But I think #28676 handles the thing automatically on spark side, which should be better.

this PR: preserve shuffled hash join build side partitioning, which is a much smaller trivial change compared to handle broadcast hash join. Because for required children distribution, shuffled hash join is same as sort merge join, shuffled hash join output partitioning should be same as sort merge join (except it cannot handle full outer join case). We found this issue when our users did shuffled hash join on bucketed tables, and had followed join / group-by on build side. We have run this in production for more than one years. There's no config needed for this feature, and IMO it could be enabled by default.

What do you think? Thanks.

cloud-fan · 2020-07-17T14:31:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashJoinExec.scala

@@ -47,6 +47,18 @@ case class ShuffledHashJoinExec(
    "buildDataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size of build side"),
    "buildTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to build hash map"))

+  override def outputPartitioning: Partitioning = joinType match {


This is exactly the same as SMJ. Shall we create a common trait ShuffleJoin to put it?

@cloud-fan - there's an extra case for sort merge join to handle full outer join. I am thinking to handle all other join types in parent trait ShuffleJoin, and override outputPartitioning in SortMergeJoinExec to handle the extra FullOuter? What do you think?

But for me it's kind of weird that ShuffleJoin not handle FullOuter as shuffled FullOuter join is one kind of ShuffleJoin. But if ShuffleJoin handles FullOuter, it seems to be also weird that ShuffledHashJoinExec extends it.

Wondering what do you think? The change itself is easy. Thanks.

Why does shuffle hash join not support FullOuter?

Why does shuffle hash join not support FullOuter?

@cloud-fan sorry if I miss anything, but isn't this true now? Given current spark implementation for hash join, stream side looks up in build side hash map, it can handle non-matching keys from stream side if there's no match in build side hash map. But it cannot handle non-matching keys from build side, as there's no info persisted from stream side.

I feel an interesting followup could be to handle full outer join in shuffled hash join, where when looking up stream side keys from build side HashedRelation. Mark this info inside build side HashedRelation, and after reading all rows from stream side, output all non-matching rows from build side based on modified HashedRelation.

Can we simply move SortMergeJoinExec.outputPartitioning to the parent trait? It works for ShuffledHashJoinExec as well, as the planner guarantees ShuffledHashJoinExec.joinType won't be FullOuter

@cloud-fan I agree. Updated.

imback82

+1 as well

imback82 · 2020-07-17T17:55:36Z

sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

+
+  test("SPARK-32330: Preserve shuffled hash join build side partitioning") {
+    withSQLConf(
+        SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50",


nit: set it to "-1" to make the intention (turning off broadcast join) clear?

@imback82 query planner depends on this config to be carefully tuned here to trigger shuffled hash join.

Ah OK. Thanks!

SparkQA · 2020-07-17T20:51:15Z

Test build #126058 has finished for PR 29130 at commit f9479b6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShuffledJoin extends BaseJoinExec

c21 · 2020-07-18T00:08:13Z

retest this please

SparkQA · 2020-07-18T04:54:01Z

Test build #126081 has finished for PR 29130 at commit f9479b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait ShuffledJoin extends BaseJoinExec

cloud-fan · 2020-07-20T14:38:39Z

thanks, merging to master!

c21 · 2020-07-20T15:10:43Z

Thank you @cloud-fan for review!

Preserve shuffled hash join build side partitioning

dface2a

probot-autolabeler bot added the SQL label Jul 16, 2020

cloud-fan reviewed Jul 17, 2020

View reviewed changes

Create a trait ShuffledJoin for SMJ and SHJ

f9479b6

cloud-fan approved these changes Jul 17, 2020

View reviewed changes

imback82 approved these changes Jul 17, 2020

View reviewed changes

cloud-fan closed this in fe07521 Jul 20, 2020

c21 deleted the shj branch July 20, 2020 19:02

c21 mentioned this pull request Jul 22, 2020

[SPARK-32383][SQL] Preserve hash join (BHJ and SHJ) stream side ordering #29181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130

[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130

c21 commented Jul 16, 2020

c21 commented Jul 16, 2020

SparkQA commented Jul 16, 2020

cloud-fan commented Jul 17, 2020

c21 commented Jul 17, 2020

cloud-fan Jul 17, 2020

c21 Jul 17, 2020 •

edited

Loading

cloud-fan Jul 17, 2020

c21 Jul 17, 2020

cloud-fan Jul 17, 2020

c21 Jul 17, 2020

imback82 left a comment

imback82 Jul 17, 2020

c21 Jul 17, 2020

imback82 Jul 18, 2020

SparkQA commented Jul 17, 2020

c21 commented Jul 18, 2020

SparkQA commented Jul 18, 2020

cloud-fan commented Jul 20, 2020

c21 commented Jul 20, 2020

[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130

[SPARK-32330][SQL] Preserve shuffled hash join build side partitioning #29130

Conversation

c21 commented Jul 16, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Jul 16, 2020

SparkQA commented Jul 16, 2020

cloud-fan commented Jul 17, 2020

c21 commented Jul 17, 2020

Choose a reason for hiding this comment

c21 Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imback82 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 17, 2020

c21 commented Jul 18, 2020

SparkQA commented Jul 18, 2020

cloud-fan commented Jul 20, 2020

c21 commented Jul 20, 2020

c21 Jul 17, 2020 •

edited

Loading