Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12388] change default compression to lz4 #10342

Closed
wants to merge 4 commits into from
Closed

Conversation

davies
Copy link
Contributor

@davies davies commented Dec 17, 2015

According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.

After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).

[1] https://github.com/ning/jvm-compressor-benchmark/wiki

cc @rxin

@rxin
Copy link
Contributor

rxin commented Dec 17, 2015

We also need to update the documentation, don't we?

@JoshRosen
Copy link
Contributor

Note that LZ4 does not support concatenation of serialized streams, meaning that sort-based shuffle won't be able to take advantage of a special spill-merging optimization. This might not make a huge difference in practice compared to the other speed benefits of LZ4.

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47876 has finished for PR 10342 at commit 14940de.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor

nongli commented Dec 17, 2015

@JoshRosen Can you explain that some more? What are we giving up? Would fixing it if we decide to just to add a simple header that consisted of the lengths of each stream?

@rxin
Copy link
Contributor

rxin commented Dec 17, 2015

I think the main thing this is giving up is that previously we could use file channel's transferTo to copy data directly, but only when the compression blocks can be just concatenated at the byte level. Without that property, we'd have to read them (decompress) and then write them out again.

@davies
Copy link
Contributor Author

davies commented Dec 17, 2015

I had patched LZ4BlockInputStream to support concated streams.

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47898 has finished for PR 10342 at commit f49f1ef.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47908 has finished for PR 10342 at commit 13390e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public final class LZ4BlockInputStream extends FilterInputStream\n

* support {@link #mark(int)}/{@link #reset()}.
* @see LZ4BlockOutputStream
*
* This is based on net.jpountz.lz4.LZ4BlockInputStream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment what is modified? This is otherwise hard to review/maintain if lz4 updates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the changes here: davies/lz4-java@cc1fa94

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47961 has finished for PR 10342 at commit 83c77f4.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public final class LZ4BlockInputStream extends FilterInputStream\n

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #2231 has finished for PR 10342 at commit 83c77f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public final class LZ4BlockInputStream extends FilterInputStream\n

@davies
Copy link
Contributor Author

davies commented Dec 18, 2015

ping @JoshRosen @rxin @nongli

@nongli
Copy link
Contributor

nongli commented Dec 21, 2015

LGTM

@asfgit asfgit closed this in 29cecd4 Dec 21, 2015
@rxin
Copy link
Contributor

rxin commented Oct 10, 2016

One thing I noticed when I was reviewing another pr just now: in the future we should also include test cases when pulling in 3rd party code. Otherwise it's pretty difficult to ensure correctness if we need to change the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants