[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606

JoshRosen · 2015-04-21T07:53:39Z

This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones. Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths.

The key change in this patch is modifying getDiskWriter() / DiskBlockObjectWriter to accept SerializerInstances instead of Serializers (which are factories for instances). This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used.

The rest of the patch modifies several write and read paths to use shared serializers. One big win is in ShuffleBlockFetcherIterator, where we used to create a new serializer per received block. Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time.

I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement.

JoshRosen · 2015-04-21T07:55:26Z

core/src/main/scala/org/apache/spark/shuffle/FileShuffleBlockManager.scala

@@ -133,7 +134,8 @@ class FileShuffleBlockManager(conf: SparkConf)
              logWarning(s"Failed to remove existing shuffle file $blockFile")
            }
          }
-          blockManager.getDiskWriter(blockId, blockFile, serializer, bufferSize, writeMetrics)
+          blockManager.getDiskWriter(blockId, blockFile, serializerInstance, bufferSize,


Note that this line is called once for every bucket (reduce task), since it's enclosed in

Array.tabulate[BlockObjectWriter](numBuckets) { bucketId => ...

SparkQA · 2015-04-21T07:57:44Z

Test build #30653 has started for PR 5606 at commit aeb680e.

JoshRosen · 2015-04-21T07:58:44Z

This made a large performance difference in a local-mode SparkSQL test that I was running today, cutting one of my query's times from 30 seconds to 9 seconds. The benefit may be smaller for non-SQL jobs, since they're less likely to use costly-to-construct serializers.

It would be good to verify that I haven't made any bad thread-safety / single-threadedness assumptions here.

SparkQA · 2015-04-21T08:13:47Z

Test build #30654 has started for PR 5606 at commit 64f8398.

rxin · 2015-04-21T08:15:56Z

core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala

@@ -47,6 +48,11 @@ private[spark] class CoarseGrainedExecutorBackend(
  var executor: Executor = null
  @volatile var driver: Option[RpcEndpointRef] = None

+  // This is a thread-local in case we ever decide to change this to a non-thread-safe RpcEndpoint
+  private[this] val ser: ThreadLocal[SerializerInstance] = new ThreadLocal[SerializerInstance] {


actually now i think about it -- i think if the underlying thread pool keeps creating new threads (which it might), this thread local variable might lead to memory leak. maybe the best way to handle this is with your old one, and add a comment to the beginning of the class saying if we ever change it to support multiple threads, change here as well.

SparkQA · 2015-04-21T09:41:35Z

Test build #30653 has finished for PR 5606 at commit aeb680e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T09:41:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30653/
Test PASSed.

SparkQA · 2015-04-21T09:53:05Z

Test build #30654 has finished for PR 5606 at commit 64f8398.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T09:53:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30654/
Test PASSed.

SparkQA · 2015-04-21T17:18:35Z

Test build #30681 has started for PR 5606 at commit f661ce7.

JoshRosen · 2015-04-21T17:58:24Z

As an example of the speedup that this can give, here's a simple benchmark.

Launch spark-shell with the following options:

./bin/spark-shell --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --master=local[8]

Then, past the following commands into the shell:

val start = System.currentTimeMillis()
sc.parallelize(1 to 10000, 100).map(x => (x, x)).reduceByKey(_ + _, 100).count()
println(System.currentTimeMillis() - start)

Prior to this patch, this takes about 2.5 seconds to run (after a few warmup runs); after this patch, this same query takes around 600ms (after warmup).

SparkQA · 2015-04-21T18:21:54Z

Test build #30681 has finished for PR 5606 at commit f661ce7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T18:21:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30681/
Test FAILed.

JoshRosen · 2015-04-21T18:38:22Z

Jenkins, retest this please.

SparkQA · 2015-04-21T18:43:36Z

Test build #30686 has started for PR 5606 at commit f661ce7.

SparkQA · 2015-04-21T20:27:16Z

Test build #30686 has finished for PR 5606 at commit f661ce7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T20:27:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30686/
Test PASSed.

rxin · 2015-04-21T23:23:11Z

LGTM.

carrino · 2015-05-20T18:55:45Z

Hi. I'm a little late to this party, but I just started using spark 2 weeks ago.

My concern with this fix is that it shares serializers, but it doesn't call reset on the Kryo Serializer between files. If reference tracking is turned on this could lead to issues where a later ref may be missing because it's in a different file.

JoshRosen · 2015-05-20T20:00:03Z

Hi @carrino,

Thanks for spotting this issue. I've managed to write a regression test which exposes this bug: JoshRosen@71845e3

This is only a problem when reference tracking is enabled and auto-reset has been disabled by a user's custom Kryo registrator. I think that this issue should be fairly easy to fix by adding reset calls at the end of serialize and serializeStream. I'll file a JIRA for this and put together a pull request shortly, then ping you for review.

carrino · 2015-05-20T20:12:48Z

Good times. Thanks for being super responsive.

JoshRosen · 2015-05-20T20:27:05Z

I've filed https://issues.apache.org/jira/browse/SPARK-7766 for this issue and have submitted #6293 to fix this.

…s disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug

…s disabled SPARK-3386 / #5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes #6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug (cherry picked from commit eac0069) Signed-off-by: Josh Rosen <[email protected]>

…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug

This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones. Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths. The key change in this patch is modifying `getDiskWriter()` / `DiskBlockObjectWriter` to accept `SerializerInstance`s instead of `Serializer`s (which are factories for instances). This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used. The rest of the patch modifies several write and read paths to use shared serializers. One big win is in `ShuffleBlockFetcherIterator`, where we used to create a new serializer per received block. Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time. I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement. Author: Josh Rosen <[email protected]> Closes apache#5606 from JoshRosen/SPARK-3386 and squashes the following commits: f661ce7 [Josh Rosen] Remove thread local; add comment instead 64f8398 [Josh Rosen] Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend aeb680e [Josh Rosen] [SPARK-3386] Reuse SerializerInstance in shuffle code paths

…s disabled SPARK-3386 / apache#5606 modified the shuffle write path to re-use serializer instances across multiple calls to DiskBlockObjectWriter. It turns out that this introduced a very rare bug when using `KryoSerializer`: if auto-reset is disabled and reference-tracking is enabled, then we'll end up re-using the same serializer instance to write multiple output streams without calling `reset()` between write calls, which can lead to cases where objects in one file may contain references to objects that are in previous files, causing errors during deserialization. This patch fixes this bug by calling `reset()` at the start of `serialize()` and `serializeStream()`. I also added a regression test which demonstrates that this problem only occurs when auto-reset is disabled and reference-tracking is enabled. Author: Josh Rosen <[email protected]> Closes apache#6293 from JoshRosen/kryo-instance-reuse-bug and squashes the following commits: e19726d [Josh Rosen] Add fix for SPARK-7766. 71845e3 [Josh Rosen] Add failing regression test to trigger Kryo re-use bug

[SPARK-3386] Reuse SerializerInstance in shuffle code paths

aeb680e

JoshRosen changed the title ~~[SPARK-3386] Share and reuse SerializerInstances in shuffle code paths~~ [SPARK-3386] Share and reuse SerializerInstances in shuffle paths Apr 21, 2015

JoshRosen reviewed Apr 21, 2015
View reviewed changes

Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend

64f8398

rxin reviewed Apr 21, 2015
View reviewed changes

Remove thread local; add comment instead

f661ce7

JoshRosen mentioned this pull request Apr 21, 2015

[SPARK-7041] Avoid writing empty files in BypassMergeSortShuffleWriter #5622

Closed

asfgit closed this in f83c0f1 Apr 21, 2015

JoshRosen deleted the SPARK-3386 branch April 22, 2015 04:32

JoshRosen mentioned this pull request May 5, 2015

[SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation #5924

Closed

jeanlyn mentioned this pull request May 14, 2015

[SPARK-2926][Shuffle]Add MR style sort-merge shuffle read for Spark sort-based shuffle #3438

Closed

JoshRosen mentioned this pull request May 20, 2015

[SPARK-7766] KryoSerializerInstance reuse is unsafe when auto-reset is disabled #6293

Closed

JoshRosen mentioned this pull request May 21, 2015

[SPARK-7795] [Core] Speed up task scheduling in standalone mode by reusing serializer #6323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606

[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606

JoshRosen commented Apr 21, 2015

JoshRosen Apr 21, 2015

SparkQA commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

rxin Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

SparkQA commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

rxin commented Apr 21, 2015

carrino commented May 20, 2015

JoshRosen commented May 20, 2015

carrino commented May 20, 2015

JoshRosen commented May 20, 2015

[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606

[SPARK-3386] Share and reuse SerializerInstances in shuffle paths #5606

Conversation

JoshRosen commented Apr 21, 2015

JoshRosen Apr 21, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

rxin Apr 21, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

SparkQA commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

JoshRosen commented Apr 21, 2015

SparkQA commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

rxin commented Apr 21, 2015

carrino commented May 20, 2015

JoshRosen commented May 20, 2015

carrino commented May 20, 2015

JoshRosen commented May 20, 2015