Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19112][CORE] add codec for ZStandard #17303

Closed
wants to merge 1 commit into from

Conversation

dongjinleekr
Copy link
Contributor

@dongjinleekr dongjinleekr commented Mar 15, 2017

What changes were proposed in this pull request?

Hadoop & HBase started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.'

How was this patch tested?

3 additional unit tests in CompressionCodecSuite.scala.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Mar 15, 2017

Same questions from last PR -- can this be something the user includes if needed or is there value in integrating it into Spark? where would it come into play and with what versions of Hadoop et al?

@tgravescs
Copy link
Contributor

this should not be needed just to use to write to hdfs. The regular hadoop input/output type formats have support for it if you are using the right version (I think hadoop 2.8).

This seems to be adding the support to the spark.io.compression.codec for internal compression. From what I've heard zstd is better then the other codecs since it gives Gzip level Compression with Lz4 level CPU usage. So if you have a job that had a ton of intermediate data or was causing network issues you may want to use ztsd to get the gzip compression levels without much cpu penalty.

@dongjinleekr It doesn't looks like you ran any manual tests on a real cluster? It would be nice to have some basic performance/compression numbers to show it actually working. Are you planning on actually using zstd in your spark deployment?

@rxin
Copy link
Contributor

rxin commented Mar 15, 2017

Yes it'd be nice to have some benchmark on this.

@maropu
Copy link
Member

maropu commented Apr 28, 2017

I did quick benchmarks by using a TPCDS query (Q4) (I just referred the previous work in #10342)
Based on the result, it seems it's a bit earlier to implement this?;

scaleFactor: 4
AWS instance: c4.4xlarge	

-- zstd
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 53.315878375s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 53.468174668s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 57.282403146s 

-- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 20.779643053s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.520911319s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.897124967s

-- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.132412036999998s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 15.908867743999998s                                             
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.789648712s

-- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 21.339518781s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.881225328s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.813455479s

@srowen
Copy link
Member

srowen commented May 6, 2017

OK, seems like we should close this.

class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec {

override def compressedOutputStream(s: OutputStream): OutputStream = {
val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "3").toInt
Copy link

@Cyan4973 Cyan4973 May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use cases which favor speed over size should prefer using level 1.
Compression speed difference can be fairly large.

@maropu
Copy link
Member

maropu commented May 9, 2017

@Cyan4973 I quickly checked again;

scaleFactor: 4
AWS instance: c4.4xlarge	

// In this bench, I used `local-cluster` (`local` used in the benchmark above)
./bin/spark-shell --master local-cluster[4,4,7500] \
  --conf spark.driver.memory=1g \
  --conf spark.executor.memory=7g \
  --conf spark.io.compression.codec=xxx

--- zstd (level=3)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 36.517211838s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 25.026869575s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 24.370711575s                                                   

--- zstd (level=1)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 29.654705815s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 20.638918335s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 19.928730758999997s

--- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.422360631s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.38519278s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.779084563s

--- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.476569521000002s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.438640631s                                                   
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 14.949329456s

--- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.853010073s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.431232532000003s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.916569896999999s

zstd was still worse than the others. Not sure though, there might be the winner case where zstd overcomes the others in more larger data set.

@Cyan4973
Copy link

Cyan4973 commented May 9, 2017

@maropu : What about compression ratios ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants