[SPARK-17556][SQL] Executor side broadcast for broadcast joins #15178

viirya · 2016-09-21T08:39:46Z

What changes were proposed in this pull request?

The mechanism of broadcast in Spark is to collect the result of an RDD and then broadcast it. This introduces some extra latency. We can broadcast the RDD directly from executors. This patch implements broadcast from executors, and applies it on broadcast join of Spark SQL.

The advantages of executor-size broadcast:

The data of RDD doesn't need to collect to the driver before broadcasting
The driver isn't the bottleneck of data transmission at the beginning of broadcasting

Design document: https://issues.apache.org/jira/secure/attachment/12831201/executor-side-broadcast.pdf

Major API changes

New API broadcastRDDOnExecutor in SparkContext

It takes two parameters rdd: RDD[T] and mode: BroadcastMode[T]. It will broadcast the content of the rdd between executors without collecting it back to the driver. mode is used to convert the content of the rdd to the broadcasted object.

Besides T, this API has another type parameter U, which is the type of the converted object.
New Broadcast implementation TorrentExecutorBroadcast

Different to TorrentBroadcast, this implementation doesn't divide and store object data waiting to broadcast in the driver. The executors use local and remote fetches to fetch the blocks of the RDD and convert the rdd content to broadcasted object.
BroadcastMode is moved from org.apache.spark.sql.catalyst.plans.physical to org.apache.spark.broadcast

It is added a type parameter T now which is the converted type of the broadcasted object on executors.

Usage: How to use executor side broadcast

To broadcast the result of a RDD, instead of collecting the result back to the driver and broadcasting it, we can use executor side broadcast feature proposed in this proposal.

Prepare the RDD to be broadcast

// To broadcast the RDD on executors,
// we should materialize and cache the result of the RDD
val rdd = sc.parallelize(1 to 4, 2).cache()
rdd.count()

Define how to transform the result of the RDD with BroadcastMode

val mode = new BroadcastMode[Int] {
  override def transform(rows: Array[Int]): Array[Int] = rows
}

Broadcast the RDD and use broadcasted variable

val broadcastedVal = sc.broadcastRDDOnExecutor[Int, Array[Int]](rdd, mode)
val collected = sc.parallelize(1 to 2, 2).map { _ =>
  broadcastedVal.value.reduce(_ + _) // 1 + 2 + 3 + 4 = 10
}.collect()
assert(collected.sum == 20)

How was this patch tested?

Jenkins tests.

SparkQA · 2016-09-21T11:07:56Z

Test build #65709 has finished for PR 15178 at commit 57987d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait BroadcastMode[T]
- case class BroadcastDistribution(mode: BroadcastMode[InternalRow]) extends Distribution
- case class BroadcastPartitioning(mode: BroadcastMode[InternalRow]) extends Partitioning
- case class BroadcastExchangeExec[T: ClassTag](

viirya · 2016-09-23T14:08:15Z

cc @rxin Can you help review this? Thanks.

holdenk · 2016-09-28T15:32:17Z

This is really interesting and might have some interesting improvements for online ML training in structured streaming as well :) I've got a few questions around unpersist behaviour but I'll dig into this PR more next week. Hopefully @rxin or @JoshRosen can also take a look :)

holdenk · 2016-09-28T15:05:46Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
+   * The variable will be sent to each cluster only once.
+   */
+  def broadcastRDDOnExecutor[T: ClassTag, U: ClassTag](


We might want to mark this as a developer API for now?

holdenk · 2016-09-28T15:21:17Z

core/src/main/scala/org/apache/spark/broadcast/TorrentExecutorBroadcast.scala

+   * Remove all persisted state associated with this Torrent broadcast on the executors.
+   */
+  override protected def doUnpersist(blocking: Boolean) {
+    TorrentBroadcast.unpersist(id, removeFromDriver = false, blocking)


Similar comment to as above - does this do what we want?

holdenk · 2016-09-28T15:29:19Z

core/src/main/scala/org/apache/spark/broadcast/TorrentExecutorBroadcast.scala

+              val data = b.data.asInstanceOf[Iterator[T]].toArray
+              // We found the block from remote executors' BlockManager, so put the block
+              // in this executor's BlockManager.
+              if (!bm.putIterator(pieceId, data.toIterator,


So were storing an RDD pieceId here, but I think in unpersist only things with BroadcastBlockId and the correct ID will be removed. Maybe it would be good to add a test around unpersistance to verify its behaving as expected?

For RDD, there is a cleaning mechanism that the persisted pieces will be removed once the RDD is not referred. Because we fetch and use RDD pieces here instead of broadcast pieces in driver side broadcast, I think it should be fine to deliver the cleaning to current mechanism.

One solution might be to store the fetched RDD pieces with broadcast piece ID, so in unpersist we can remove all the fetched pieces. However, then we must consider fetch both RDD piece IDs broadcast IDs from other executors under the BitTorrent-like approach. Thus I would prefer the above way and let current cleaning mechanism do its work.

holdenk · 2016-09-29T12:43:41Z

Actually another (related) question is what happens when the RDD backing the broadcast unpersists? I think it would be good to have tests around this as well.

…actoring.

viirya · 2016-09-29T13:17:44Z

@holdenk If the RDD pieces are fetched done on the executors, it will not affect the broadcasted object. If the fetching is not done, reading blocks will be failed. Let me think if I can add a test for it.

SparkQA · 2016-09-29T15:38:05Z

Test build #66101 has finished for PR 15178 at commit f50cf31.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-09-29T16:53:42Z

@viirya so what I mean is right now the I think executors will fetch the blocks and they might not get cleaned up once the broadcast is destroyed. You could add a test to see if the blocks are everywhere after unpersist. The other question is if someone broadcasts a cached RDD and then unpersists it, I'm worried it might clean up the broadcast blocks on the executor as well. You could add a test here to see if you can use the broadcast after unpersist of the backing RDD (or if we don't want to support that use case add a note about it to the docs and make sure it fails in a clear manner).

viirya · 2016-09-30T05:17:34Z

@holdenk For the first one question, as this executor side broadcast directly uses RDD blocks instead of creating broadcast blocks, once the broadcast is destroyed, only the broadcasted object is cleaned. I will add a test for this.

For the second question, it depends on if the broadcasted object is created on the executors or not. If yes, it wouldn't affect it. If no, there will be failure when the executors trying to fetch the RDD blocks and create the object.

viirya · 2016-09-30T07:21:19Z

@holdenk I added the test cases.

SparkQA · 2016-09-30T09:01:03Z

Test build #66162 has finished for PR 15178 at commit 1339daf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait BroadcastMode[T] extends Serializable

viirya · 2016-09-30T09:04:51Z

retest this please.

SparkQA · 2016-09-30T09:11:30Z

Test build #66163 has finished for PR 15178 at commit 3494728.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait BroadcastMode[T] extends Serializable

SparkQA · 2016-09-30T11:18:12Z

Test build #66165 has finished for PR 15178 at commit 3494728.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait BroadcastMode[T] extends Serializable

holdenk · 2016-09-30T14:12:25Z

So if the primary use of this is inside of SQL then that might be OK (because we can just be very careful about it) - but since we are also exposing it to the user it feels like that these behaviour will probably catch some people by surprise (and at the very least we should document the behaviour). Maybe it would make sense to update the cleaning logic somehow or store the blocks differently so the currently cleaning logic behaves as expected - but it would be really good to hear what @rxin or @JoshRosen think about this because I'm a little uncertain.

viirya · 2016-09-30T15:13:30Z

@holdenk After rethinking this, I have an idea that we can avoid this surprise. Let me refactor this and please review this then. Thanks!

holdenk · 2016-09-30T15:15:03Z

Awesome - so excited to see this :)

…sted.

viirya · 2016-10-01T00:51:40Z

@holdenk This is updated.

Now we only require the RDD to be executor side broadcast needs to be cached and materialized first. Broadcast blocks are created immediately on the executors. When you get the broadcast variable, you can unpersist the RDD and the broadcast variable is ok to be used.

SparkQA · 2016-10-01T02:34:19Z

Test build #66196 has finished for PR 15178 at commit 17b4470.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class RowBroadcastMode extends BroadcastMode[InternalRow]
- case class BroadcastDistribution(mode: RowBroadcastMode) extends Distribution
- case class BroadcastPartitioning(mode: RowBroadcastMode) extends Partitioning

SparkQA · 2016-10-01T04:56:05Z

Test build #66199 has finished for PR 15178 at commit 0440cc7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-10-01T05:00:16Z

retest this please.

SparkQA · 2016-10-01T07:25:05Z

Test build #66204 has finished for PR 15178 at commit 0440cc7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-04T16:13:22Z

Test build #70877 has finished for PR 15178 at commit 1b499d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-02-08T04:29:36Z

ping @rxin Do you have any more thoughts or feedback for this? Thanks.

…utors

SparkQA · 2017-06-13T05:39:39Z

Test build #77963 has finished for PR 15178 at commit 34a49d5.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-13T06:09:55Z

Test build #77967 has finished for PR 15178 at commit 4b31c7e.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-13T06:30:12Z

Test build #77968 has finished for PR 15178 at commit 59d96d7.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-13T06:37:36Z

Test build #77972 has started for PR 15178 at commit ecebb3f.

viirya · 2017-06-13T07:18:38Z

retest this please.

SparkQA · 2017-06-13T09:44:11Z

Test build #77979 has finished for PR 15178 at commit ecebb3f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-13T17:16:28Z

Test build #77992 has finished for PR 15178 at commit ef987ae.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-06-13T23:17:33Z

retest this please.

SparkQA · 2017-06-14T02:17:20Z

Test build #78011 has finished for PR 15178 at commit ef987ae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…utors

viirya · 2017-09-11T08:49:05Z

@rxin Do we still consider to incorporate this broadcast on executor feature? Thanks.

SparkQA · 2017-09-11T11:57:57Z

Test build #81636 has finished for PR 15178 at commit 93ecabb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…utors

viirya · 2017-10-31T08:14:40Z

retest this please.

SparkQA · 2017-10-31T10:57:56Z

Test build #83255 has finished for PR 15178 at commit a1f2faa.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-31T18:00:48Z

The current solution could OOM executors whose memory sizes are normally much smaller than the driver. We might also see the performance regression when the number of partitions is large and the partition size is small.

How about closing this PR now and we can revisit it when we need this feature?

viirya · 2017-11-01T01:45:02Z

Well, it can be discussed for the many aspects of this feature. I agree we can close this for now because this is open for a while and seems no urgent need for it.

amoghmargoor · 2019-03-30T05:32:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala

-          if (dataSize >= (8L << 30)) {
+          // Call persist on the RDD because we want to broadcast the RDD blocks on executors.
+          childRDD = child.execute().mapPartitionsInternal { rowIterator =>
+            rowIterator.map(_.copy())


I know the PR has been closed, but I was interested in understanding the code.
Why is copy of child RDD made before persisting ?

In SparkSQL, the underlying mechanism reuses single object (the row here) when iterating. We must manually copy the row in map before persisting. Otherwise you will get the results with all same rows.

amoghmargoor · 2019-04-11T03:38:45Z

@viirya Thanks for this diff.
We found one issue here, which I wanted to point out just in case somebody wanted to use this patch.
There are references to broadcast.value in BroadcastHashJoinExec which gets executed on Driver. That might bring the RDD values being broadcasted to Driver's block manager too. That happens due to code generation flow. To fix it, we took the shortcut and avoided using one hash join optimization in code gen for cases where keys in build side are unique. Not sure if we can come up with solution where we need not have to sacrifice upon that.

Broadcast on executors.

57987d1

viirya changed the title ~~[SPARK-17556] Executor side broadcast for broadcast joins~~ [SPARK-17556][SQL] Executor side broadcast for broadcast joins Sep 26, 2016

holdenk reviewed Sep 28, 2016

View reviewed changes

Add developer api and a config to enable executor-side broadcast. Ref…

f50cf31

…actoring.

Add test cases. Refactoring.

3494728

viirya force-pushed the broadcast-on-executors branch from 1339daf to 3494728 Compare September 30, 2016 07:23

viirya added 2 commits October 1, 2016 00:09

Make executor side broadast more reliable against RDD getting unpersi…

8744fe9

…sted.

Refactor BroadcastMode.

17b4470

Fix test.

0440cc7

viirya mentioned this pull request Jun 13, 2017

[SPARK-17556] [CORE] [SQL] Executor side broadcast for broadcast joins #15240

Closed

Merge remote-tracking branch 'upstream/master' into broadcast-on-exec…

34a49d5

…utors

viirya force-pushed the broadcast-on-executors branch from 4b31c7e to 59d96d7 Compare June 13, 2017 06:14

Fix typo.

ecebb3f

viirya force-pushed the broadcast-on-executors branch from 59d96d7 to ecebb3f Compare June 13, 2017 06:37

Fix test.

ef987ae

Merge remote-tracking branch 'upstream/master' into broadcast-on-exec…

93ecabb

…utors

Merge remote-tracking branch 'upstream/master' into broadcast-on-exec…

a1f2faa

…utors

viirya closed this Nov 1, 2017

amoghmargoor reviewed Mar 30, 2019

View reviewed changes

jl982 mentioned this pull request Jun 15, 2021

[SPARK-17556] [WIP] executor side broadcast jl982/spark#1

Open

viirya deleted the broadcast-on-executors branch December 27, 2023 18:34

[SPARK-17556][SQL] Executor side broadcast for broadcast joins #15178

[SPARK-17556][SQL] Executor side broadcast for broadcast joins #15178

Conversation

viirya commented Sep 21, 2016 • edited Loading

What changes were proposed in this pull request?

Major API changes

Usage: How to use executor side broadcast

How was this patch tested?

SparkQA commented Sep 21, 2016

viirya commented Sep 23, 2016

holdenk commented Sep 28, 2016

holdenk Sep 28, 2016

Choose a reason for hiding this comment

viirya Sep 29, 2016

Choose a reason for hiding this comment

holdenk Sep 28, 2016 • edited Loading

Choose a reason for hiding this comment

holdenk Sep 28, 2016

Choose a reason for hiding this comment

viirya Sep 29, 2016

Choose a reason for hiding this comment

viirya Sep 29, 2016

Choose a reason for hiding this comment

holdenk commented Sep 29, 2016

viirya commented Sep 29, 2016

SparkQA commented Sep 29, 2016

holdenk commented Sep 29, 2016 • edited Loading

viirya commented Sep 30, 2016

viirya commented Sep 30, 2016

SparkQA commented Sep 30, 2016

viirya commented Sep 30, 2016

SparkQA commented Sep 30, 2016

SparkQA commented Sep 30, 2016

holdenk commented Sep 30, 2016

viirya commented Sep 30, 2016

holdenk commented Sep 30, 2016

viirya commented Oct 1, 2016 • edited Loading

SparkQA commented Oct 1, 2016

SparkQA commented Oct 1, 2016

viirya commented Oct 1, 2016

SparkQA commented Oct 1, 2016

SparkQA commented Jan 4, 2017

viirya commented Feb 8, 2017

SparkQA commented Jun 13, 2017

SparkQA commented Jun 13, 2017

SparkQA commented Jun 13, 2017

SparkQA commented Jun 13, 2017

viirya commented Jun 13, 2017

SparkQA commented Jun 13, 2017

SparkQA commented Jun 13, 2017

viirya commented Jun 13, 2017

SparkQA commented Jun 14, 2017

viirya commented Sep 11, 2017

SparkQA commented Sep 11, 2017

viirya commented Oct 31, 2017

SparkQA commented Oct 31, 2017

gatorsmile commented Oct 31, 2017

viirya commented Nov 1, 2017

amoghmargoor Mar 30, 2019

Choose a reason for hiding this comment

viirya Mar 31, 2019

Choose a reason for hiding this comment

amoghmargoor commented Apr 11, 2019 • edited Loading

viirya commented Sep 21, 2016 •

edited

Loading

holdenk Sep 28, 2016 •

edited

Loading

holdenk commented Sep 29, 2016 •

edited

Loading

viirya commented Oct 1, 2016 •

edited

Loading

amoghmargoor commented Apr 11, 2019 •

edited

Loading