-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606
Conversation
@@ -133,7 +134,8 @@ class FileShuffleBlockManager(conf: SparkConf) | |||
logWarning(s"Failed to remove existing shuffle file $blockFile") | |||
} | |||
} | |||
blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics) | |||
blockManager.getDiskWriter(blockId, blockFile, serializerInstance, bufferSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this line is called once for every bucket (reduce task), since it's enclosed in
Array.tabulate[BlockObjectWriter](numBuckets) { bucketId =>
...
Test build #30653 has started for PR 5606 at commit |
This made a large performance difference in a It would be good to verify that I haven't made any bad thread-safety / single-threadedness assumptions here. |
Test build #30654 has started for PR 5606 at commit |
@@ -47,6 +48,11 @@ private[spark] class CoarseGrainedExecutorBackend( | |||
var executor: Executor = null | |||
@volatile var driver: Option[RpcEndpointRef] = None | |||
|
|||
// This is a thread-local in case we ever decide to change this to a non-thread-safe RpcEndpoint | |||
private[this] val ser: ThreadLocal[SerializerInstance] = new ThreadLocal[SerializerInstance] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually now i think about it -- i think if the underlying thread pool keeps creating new threads (which it might), this thread local variable might lead to memory leak. maybe the best way to handle this is with your old one, and add a comment to the beginning of the class saying if we ever change it to support multiple threads, change here as well.
Test build #30653 has finished for PR 5606 at commit
|
Test PASSed. |
Test build #30654 has finished for PR 5606 at commit
|
Test PASSed. |
Test build #30681 has started for PR 5606 at commit |
As an example of the speedup that this can give, here's a simple benchmark. Launch spark-shell with the following options:
Then, past the following commands into the shell:
Prior to this patch, this takes about 2.5 seconds to run (after a few warmup runs); after this patch, this same query takes around 600ms (after warmup). |
Test build #30681 has finished for PR 5606 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #30686 has started for PR 5606 at commit |
Test build #30686 has finished for PR 5606 at commit
|
Test PASSed. |
LGTM. |
Hi. I'm a little late to this party, but I just started using spark 2 weeks ago. My concern with this fix is that it shares serializers, but it doesn't call reset on the Kryo Serializer between files. If reference tracking is turned on this could lead to issues where a later ref may be missing because it's in a different file. |
Hi @carrino, Thanks for spotting this issue. I've managed to write a regression test which exposes this bug: JoshRosen@71845e3 This is only a problem when reference tracking is enabled and auto-reset has been disabled by a user's custom Kryo registrator. I think that this issue should be fairly easy to fix by adding |
Good times. Thanks for being super responsive. |
I've filed https://issues.apache.org/jira/browse/SPARK-7766 for this issue and have submitted #6293 to fix this. |
…s disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
…s disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug (cherry picked from commit eac0069) Signed-off-by: Josh Rosen <[email protected]>
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones. Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths. The key change in this patch is modifying `getDiskWriter()` / `DiskBlockObjectWriter` to accept `SerializerInstance`s instead of `Serializer`s (which are factories for instances). This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used. The rest of the patch modifies several write and read paths to use shared serializers. One big win is in `ShuffleBlockFetcherIterator`, where we used to create a new serializer per received block. Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time. I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement. Author: Josh Rosen <[email protected]> Closes apache#5606 from JoshRosen/SPARK-3386 and squashes the following commits: f661ce7 [Josh Rosen] Remove thread local; add comment instead 64f8398 [Josh Rosen] Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend aeb680e [Josh Rosen] [SPARK-3386] Reuse SerializerInstance in shuffle code paths
…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug
This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones. Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths.
The key change in this patch is modifying
getDiskWriter()
/DiskBlockObjectWriter
to acceptSerializerInstance
s instead ofSerializer
s (which are factories for instances). This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used.The rest of the patch modifies several write and read paths to use shared serializers. One big win is in
ShuffleBlockFetcherIterator
, where we used to create a new serializer per received block. Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time.I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement.