[SPARK-19525][CORE]Add RDD checkpoint compression support #17789

zsxwing · 2017-04-27T20:48:20Z

What changes were proposed in this pull request?

This PR adds RDD checkpoint compression support and add a new config spark.checkpoint.compress to enable/disable it. Credit goes to @aramesh117

Closes #17024

How was this patch tested?

The new unit test.

Spark's performance improves greatly if we enable compression of checkpoints.

zsxwing · 2017-04-27T20:48:41Z

cc @mridulm since you reviewed the initial PR.

aramesh117 · 2017-04-27T23:04:21Z

@zsxwing Sorry for the delay! Thank you so much for your review and I saw a bit of your patch - it looks very nice. I have just one question - would it be a good idea to separate the codecs for compressing checkpoints and for network communication? I see that you reuse the same codec. If we wanted to isolate the effect of the RDD checkpointing, we would not be able to do that easily if these are coupled.

mridulm

LGTM, thanks for pushing on this !
I have a query regarding naming though in case codec's are enabled.

mridulm · 2017-04-27T23:12:58Z

core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala

-      fs.create(tempOutputPath, false, bufferSize)
+      val fileStream = fs.create(tempOutputPath, false, bufferSize)
+      if (env.conf.get(CHECKPOINT_COMPRESS)) {
+        CompressionCodec.createCodec(env.conf).compressedOutputStream(fileStream)


A question I had even with the earlier PR was - should we add the extension to either the directory or the file indicating compression type ?

zsxwing · 2017-04-27T23:19:00Z

A question I had even with the earlier PR was - should we add the extension to either the directory or the file indicating compression type ?

Shuffle and cache files don't have an extension. I think it's better to be consistent in the whole code base.

would it be a good idea to separate the codecs for compressing checkpoints and for network communication?

Save as above. Shuffle and cache files use the same codec.

zsxwing · 2017-04-27T23:21:42Z

In addition, I agree that having an extension and separating the codecs are good ideas. But they should be done in other PRs to not introduce multiple features in a large PR.

mridulm · 2017-04-27T23:21:45Z

Shuffle and cache files are not on hdfs :-) They do not potentially survive the application or be consumed OOB for recovery/inspection.

mridulm · 2017-04-27T23:23:49Z

Sounds good on doing it in separate PR - I am not too worried about shuffle/blockdata/etc btw - since they are private to application execution - checkpoint's tend to also be perused for other purposes (whether by design or not) since they are on hdfs; and with this PR it will become necessary to either guess the codec by iterating over supported codec's or pass it through other means for consumers.

zsxwing · 2017-04-27T23:24:29Z

Streaming checkpoint files are on HDFS but don't have an extension :)

mridulm · 2017-04-27T23:26:13Z

@zsxwing They are compressed ? Interesting ... I never played with spark streaming unfortunately, so did not know !

zsxwing · 2017-04-27T23:28:26Z

yes. See https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala#L138

SparkQA · 2017-04-27T23:39:20Z

Test build #76242 has finished for PR 17789 at commit dab44ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mridulm · 2017-04-28T01:55:26Z

I thought the main reason @aramesh117 did this PR was for compression to be enabled for spark streaming usecase.
If compression is already enabled, then am I missing something here ?

mridulm · 2017-04-28T01:57:46Z

To add, for non streaming usecases, this will definitely help - but was this a recent change for streaming ? (probably after @aramesh117 make the PR ?)

zsxwing · 2017-04-28T18:58:22Z

did this PR was for compression to be enabled for spark streaming usecase.

Streaming checkpoint includes two parts:

DStream graph and metadata
RDD checkpoints

Right now the first one is compressed. This PR is for RDD checkpoints.

mridulm · 2017-04-28T22:14:27Z

Ah interesting, thanks for clarifying ... weird that first was compressed and not second. But if there is an expectation some of the data is compressed already; perhaps we are being consistent now and there is no need to add extension (unless we uniformly do it everywhere).

LGTM, thanks for the change @zsxwing and @aramesh117 !
@zsxwing would it be possible for you to merge this ? I am facing some env issues right now. Thanks

zsxwing · 2017-04-28T22:26:22Z

Thanks, @mridulm @aramesh117 Merging to master and 2.2.

## What changes were proposed in this pull request? This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117 Closes #17024 ## How was this patch tested? The new unit test. Author: Shixiong Zhu <[email protected]> Author: Aaditya Ramesh <[email protected]> Closes #17789 from zsxwing/pr17024. (cherry picked from commit 77bcd77) Signed-off-by: Shixiong Zhu <[email protected]>

aramesh117 · 2017-04-29T07:16:45Z

@mridulm and @zsxwing thank you so much! This will help us out a lot! Much appreciated. :)

This PR adds RDD checkpoint compression support and add a new config `spark.checkpoint.compress` to enable/disable it. Credit goes to aramesh117 Closes apache#17024 The new unit test. Author: Shixiong Zhu <[email protected]> Author: Aaditya Ramesh <[email protected]> Closes apache#17789 from zsxwing/pr17024. (cherry picked from commit 77bcd77) Signed-off-by: Shixiong Zhu <[email protected]>

Aaditya Ramesh and others added 4 commits February 21, 2017 21:16

[SPARK-19525][CORE] Compressing checkpoints.

7837b0c

Spark's performance improves greatly if we enable compression of checkpoints.

[SPARK-19525][CORE] Addressing comments.

18e7ba6

Merge remote-tracking branch 'origin/master' into pr17024

0a84178

Finish the PR

dab44ed

zsxwing mentioned this pull request Apr 27, 2017

[SPARK-19525][CORE] Compressing checkpoints. #17024

Closed

mridulm approved these changes Apr 27, 2017

View reviewed changes

asfgit closed this in 77bcd77 Apr 28, 2017

zsxwing deleted the pr17024 branch April 28, 2017 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19525][CORE]Add RDD checkpoint compression support #17789

[SPARK-19525][CORE]Add RDD checkpoint compression support #17789

zsxwing commented Apr 27, 2017

zsxwing commented Apr 27, 2017

aramesh117 commented Apr 27, 2017

mridulm left a comment

mridulm Apr 27, 2017

zsxwing commented Apr 27, 2017 •

edited

Loading

zsxwing commented Apr 27, 2017

mridulm commented Apr 27, 2017

mridulm commented Apr 27, 2017

zsxwing commented Apr 27, 2017 •

edited

Loading

mridulm commented Apr 27, 2017

zsxwing commented Apr 27, 2017

SparkQA commented Apr 27, 2017

mridulm commented Apr 28, 2017

mridulm commented Apr 28, 2017

zsxwing commented Apr 28, 2017

mridulm commented Apr 28, 2017

zsxwing commented Apr 28, 2017

aramesh117 commented Apr 29, 2017

[SPARK-19525][CORE]Add RDD checkpoint compression support #17789

[SPARK-19525][CORE]Add RDD checkpoint compression support #17789

Conversation

zsxwing commented Apr 27, 2017

What changes were proposed in this pull request?

How was this patch tested?

zsxwing commented Apr 27, 2017

aramesh117 commented Apr 27, 2017

mridulm left a comment

Choose a reason for hiding this comment

mridulm Apr 27, 2017

Choose a reason for hiding this comment

zsxwing commented Apr 27, 2017 • edited Loading

zsxwing commented Apr 27, 2017

mridulm commented Apr 27, 2017

mridulm commented Apr 27, 2017

zsxwing commented Apr 27, 2017 • edited Loading

mridulm commented Apr 27, 2017

zsxwing commented Apr 27, 2017

SparkQA commented Apr 27, 2017

mridulm commented Apr 28, 2017

mridulm commented Apr 28, 2017

zsxwing commented Apr 28, 2017

mridulm commented Apr 28, 2017

zsxwing commented Apr 28, 2017

aramesh117 commented Apr 29, 2017

zsxwing commented Apr 27, 2017 •

edited

Loading

zsxwing commented Apr 27, 2017 •

edited

Loading