-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19112][CORE] add codec for ZStandard #17303
Conversation
Can one of the admins verify this patch? |
Same questions from last PR -- can this be something the user includes if needed or is there value in integrating it into Spark? where would it come into play and with what versions of Hadoop et al? |
this should not be needed just to use to write to hdfs. The regular hadoop input/output type formats have support for it if you are using the right version (I think hadoop 2.8). This seems to be adding the support to the spark.io.compression.codec for internal compression. From what I've heard zstd is better then the other codecs since it gives Gzip level Compression with Lz4 level CPU usage. So if you have a job that had a ton of intermediate data or was causing network issues you may want to use ztsd to get the gzip compression levels without much cpu penalty. @dongjinleekr It doesn't looks like you ran any manual tests on a real cluster? It would be nice to have some basic performance/compression numbers to show it actually working. Are you planning on actually using zstd in your spark deployment? |
Yes it'd be nice to have some benchmark on this. |
I did quick benchmarks by using a TPCDS query (Q4) (I just referred the previous work in #10342)
|
OK, seems like we should close this. |
class ZStandardCompressionCodec(conf: SparkConf) extends CompressionCodec { | ||
|
||
override def compressedOutputStream(s: OutputStream): OutputStream = { | ||
val level = conf.getSizeAsBytes("spark.io.compression.zstandard.level", "3").toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use cases which favor speed over size should prefer using level 1.
Compression speed difference can be fairly large.
@Cyan4973 I quickly checked again;
|
@maropu : What about compression ratios ? |
What changes were proposed in this pull request?
Hadoop & HBase started to support ZStandard Compression from their recent releases. This update enables saving a file in HDFS using ZStandard Codec, by implementing ZStandardCodec. It also requires adding a new configuration for default compression level, for example, 'spark.io.compression.zstandard.level.'
How was this patch tested?
3 additional unit tests in
CompressionCodecSuite.scala
.