-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17556][SQL] Executor side broadcast for broadcast joins #15178
Conversation
Test build #65709 has finished for PR 15178 at commit
|
cc @rxin Can you help review this? Thanks. |
This is really interesting and might have some interesting improvements for online ML training in structured streaming as well :) I've got a few questions around unpersist behaviour but I'll dig into this PR more next week. Hopefully @rxin or @JoshRosen can also take a look :) |
* [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions. | ||
* The variable will be sent to each cluster only once. | ||
*/ | ||
def broadcastRDDOnExecutor[T: ClassTag, U: ClassTag]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to mark this as a developer API for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
* Remove all persisted state associated with this Torrent broadcast on the executors. | ||
*/ | ||
override protected def doUnpersist(blocking: Boolean) { | ||
TorrentBroadcast.unpersist(id, removeFromDriver = false, blocking) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to as above - does this do what we want?
val data = b.data.asInstanceOf[Iterator[T]].toArray | ||
// We found the block from remote executors' BlockManager, so put the block | ||
// in this executor's BlockManager. | ||
if (!bm.putIterator(pieceId, data.toIterator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So were storing an RDD pieceId here, but I think in unpersist only things with BroadcastBlockId
and the correct ID will be removed. Maybe it would be good to add a test around unpersistance to verify its behaving as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RDD, there is a cleaning mechanism that the persisted pieces will be removed once the RDD is not referred. Because we fetch and use RDD pieces here instead of broadcast pieces in driver side broadcast, I think it should be fine to deliver the cleaning to current mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One solution might be to store the fetched RDD pieces with broadcast piece ID, so in unpersist we can remove all the fetched pieces. However, then we must consider fetch both RDD piece IDs broadcast IDs from other executors under the BitTorrent-like approach. Thus I would prefer the above way and let current cleaning mechanism do its work.
Actually another (related) question is what happens when the RDD backing the broadcast unpersists? I think it would be good to have tests around this as well. |
@holdenk If the RDD pieces are fetched done on the executors, it will not affect the broadcasted object. If the fetching is not done, reading blocks will be failed. Let me think if I can add a test for it. |
Test build #66101 has finished for PR 15178 at commit
|
@viirya so what I mean is right now the I think executors will fetch the blocks and they might not get cleaned up once the broadcast is destroyed. You could add a test to see if the blocks are everywhere after unpersist. The other question is if someone broadcasts a cached RDD and then unpersists it, I'm worried it might clean up the broadcast blocks on the executor as well. You could add a test here to see if you can use the broadcast after unpersist of the backing RDD (or if we don't want to support that use case add a note about it to the docs and make sure it fails in a clear manner). |
@holdenk For the first one question, as this executor side broadcast directly uses RDD blocks instead of creating broadcast blocks, once the broadcast is destroyed, only the broadcasted object is cleaned. I will add a test for this. For the second question, it depends on if the broadcasted object is created on the executors or not. If yes, it wouldn't affect it. If no, there will be failure when the executors trying to fetch the RDD blocks and create the object. |
@holdenk I added the test cases. |
1339daf
to
3494728
Compare
Test build #66162 has finished for PR 15178 at commit
|
retest this please. |
Test build #66163 has finished for PR 15178 at commit
|
Test build #66165 has finished for PR 15178 at commit
|
So if the primary use of this is inside of SQL then that might be OK (because we can just be very careful about it) - but since we are also exposing it to the user it feels like that these behaviour will probably catch some people by surprise (and at the very least we should document the behaviour). Maybe it would make sense to update the cleaning logic somehow or store the blocks differently so the currently cleaning logic behaves as expected - but it would be really good to hear what @rxin or @JoshRosen think about this because I'm a little uncertain. |
@holdenk After rethinking this, I have an idea that we can avoid this surprise. Let me refactor this and please review this then. Thanks! |
Awesome - so excited to see this :) |
@holdenk This is updated. Now we only require the RDD to be executor side broadcast needs to be cached and materialized first. Broadcast blocks are created immediately on the executors. When you get the broadcast variable, you can unpersist the RDD and the broadcast variable is ok to be used. |
Test build #66196 has finished for PR 15178 at commit
|
Test build #66199 has finished for PR 15178 at commit
|
retest this please. |
Test build #66204 has finished for PR 15178 at commit
|
Test build #70877 has finished for PR 15178 at commit
|
ping @rxin Do you have any more thoughts or feedback for this? Thanks. |
Test build #77963 has finished for PR 15178 at commit
|
Test build #77967 has finished for PR 15178 at commit
|
4b31c7e
to
59d96d7
Compare
Test build #77968 has finished for PR 15178 at commit
|
59d96d7
to
ecebb3f
Compare
Test build #77972 has started for PR 15178 at commit |
retest this please. |
Test build #77979 has finished for PR 15178 at commit
|
Test build #77992 has finished for PR 15178 at commit
|
retest this please. |
Test build #78011 has finished for PR 15178 at commit
|
@rxin Do we still consider to incorporate this broadcast on executor feature? Thanks. |
Test build #81636 has finished for PR 15178 at commit
|
retest this please. |
Test build #83255 has finished for PR 15178 at commit
|
The current solution could OOM executors whose memory sizes are normally much smaller than the driver. We might also see the performance regression when the number of partitions is large and the partition size is small. How about closing this PR now and we can revisit it when we need this feature? |
Well, it can be discussed for the many aspects of this feature. I agree we can close this for now because this is open for a while and seems no urgent need for it. |
if (dataSize >= (8L << 30)) { | ||
// Call persist on the RDD because we want to broadcast the RDD blocks on executors. | ||
childRDD = child.execute().mapPartitionsInternal { rowIterator => | ||
rowIterator.map(_.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the PR has been closed, but I was interested in understanding the code.
Why is copy of child
RDD made before persisting ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SparkSQL, the underlying mechanism reuses single object (the row here) when iterating. We must manually copy the row in map
before persisting. Otherwise you will get the results with all same rows.
@viirya Thanks for this diff. |
What changes were proposed in this pull request?
The mechanism of broadcast in Spark is to collect the result of an RDD and then broadcast it. This introduces some extra latency. We can broadcast the RDD directly from executors. This patch implements broadcast from executors, and applies it on broadcast join of Spark SQL.
The advantages of executor-size broadcast:
Design document: https://issues.apache.org/jira/secure/attachment/12831201/executor-side-broadcast.pdf
Major API changes
New API
broadcastRDDOnExecutor
inSparkContext
It takes two parameters
rdd: RDD[T]
andmode: BroadcastMode[T]
. It will broadcast the content of the rdd between executors without collecting it back to the driver.mode
is used to convert the content of the rdd to the broadcasted object.Besides
T
, this API has another type parameterU
, which is the type of the converted object.New
Broadcast
implementationTorrentExecutorBroadcast
Different to
TorrentBroadcast
, this implementation doesn't divide and store object data waiting to broadcast in the driver. The executors use local and remote fetches to fetch the blocks of the RDD and convert the rdd content to broadcasted object.BroadcastMode
is moved fromorg.apache.spark.sql.catalyst.plans.physical
toorg.apache.spark.broadcast
It is added a type parameter
T
now which is the converted type of the broadcasted object on executors.Usage: How to use executor side broadcast
To broadcast the result of a RDD, instead of collecting the result back to the driver and broadcasting it, we can use executor side broadcast feature proposed in this proposal.
Prepare the RDD to be broadcast
Define how to transform the result of the RDD with
BroadcastMode
Broadcast the RDD and use broadcasted variable
How was this patch tested?
Jenkins tests.