-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace lz4 with lz4_flex Allowing Compilation for WASM #4884
Conversation
@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> { | |||
Compression::BROTLI(_) => { | |||
Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read> | |||
} | |||
Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will decode lz4 data encoded without any framing, which is so niche that I struggle to conceive of people relying on this functionality. Further this is a utility CLI tool, and so I'm not too concerned about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- We can update the CSV tool if eeded
parquet/src/compression.rs
Outdated
@@ -383,64 +383,6 @@ impl BrotliLevel { | |||
} | |||
} | |||
|
|||
#[cfg(any(feature = "lz4", test))] | |||
mod lz4_codec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This codec has been replaced by LZ4HadoopCodec so lets just remove it, it isn't used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "replaced"? Is there something in the parquet standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the standard didn't specify the framing and so the ecosystem ended up with two 😄
That PR replaced this codec with a LZ4HadoopCodec which has an automatic fallback, this is what has been being used since then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me -- do we have any performance numbers?
Also, I don't understand the "replaced lz4 with lz4hadoopcodec" comment. I probably am missing something.
cc @sunchao who might have more context on parquet / compression formats (or know someone who does)
@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> { | |||
Compression::BROTLI(_) => { | |||
Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read> | |||
} | |||
Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- We can update the CSV tool if eeded
parquet/src/compression.rs
Outdated
@@ -383,64 +383,6 @@ impl BrotliLevel { | |||
} | |||
} | |||
|
|||
#[cfg(any(feature = "lz4", test))] | |||
mod lz4_codec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "replaced"? Is there something in the parquet standard?
Thank you for this @tustvold 🙏 |
I'm excited for this! I maintain https://github.com/kylebarron/parquet-wasm, which up until now hasn't been able to support lz4 for the arrow/parquet bindings. It might be of use to note that arrow2/parquet2 implemented support for both lz4 and lz4_flex, so that the end user could choose which to enable. jorgecarleitao/parquet2#124
|
Running the benchmarks shows this does appear to regress performance
In particular we see regressions
These benchmarks represent two pretty extreme cases, and it is likely that most realistic workloads sit somewhere inbetween and would see fairly minor regressions to both decompression and compression. This is consistent with lz4_flex's own benchmarking which shows lz4_flex tending to perform better than lz4 at one of either compression or decompression for a given corpus. I personally am happy enough with the performance to not feel the need to deal with the complexity of maintaining two possible implementations, especially given how rarely LZ4 is used in the ecosystem (it was only properly standardised a few years ago), but welcome other opinions |
@alamb There are now two LZ4 compression codes in Parquet: the old/deprecated "Hadoop" LZ4 and the new There's an email thread and discussions on this: https://www.mail-archive.com/[email protected]/msg14529.html |
That is about as good as a rationale for removing LZ4 as I have heard |
Which issue does this PR close?
Relates to apache/datafusion#7652 and apache/datafusion#7653
Rationale for this change
lz4_flex is a pure Rust implementation of lz4 that achieves similar performance to the C library, but with the benefit of being compatible with WASM
What changes are included in this PR?
Are there any user-facing changes?
No, the only changes are to experimental modules