-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12388] change default compression to lz4 #10342
Conversation
We also need to update the documentation, don't we? |
Note that LZ4 does not support concatenation of serialized streams, meaning that sort-based shuffle won't be able to take advantage of a special spill-merging optimization. This might not make a huge difference in practice compared to the other speed benefits of LZ4. |
Test build #47876 has finished for PR 10342 at commit
|
@JoshRosen Can you explain that some more? What are we giving up? Would fixing it if we decide to just to add a simple header that consisted of the lengths of each stream? |
I think the main thing this is giving up is that previously we could use file channel's transferTo to copy data directly, but only when the compression blocks can be just concatenated at the byte level. Without that property, we'd have to read them (decompress) and then write them out again. |
I had patched LZ4BlockInputStream to support concated streams. |
Test build #47898 has finished for PR 10342 at commit
|
Test build #47908 has finished for PR 10342 at commit
|
* support {@link #mark(int)}/{@link #reset()}. | ||
* @see LZ4BlockOutputStream | ||
* | ||
* This is based on net.jpountz.lz4.LZ4BlockInputStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment what is modified? This is otherwise hard to review/maintain if lz4 updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the changes here: davies/lz4-java@cc1fa94
Test build #47961 has finished for PR 10342 at commit
|
Test build #2231 has finished for PR 10342 at commit
|
ping @JoshRosen @rxin @nongli |
LGTM |
One thing I noticed when I was reviewing another pr just now: in the future we should also include test cases when pulling in 3rd party code. Otherwise it's pretty difficult to ensure correctness if we need to change the file. |
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc @rxin