Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

tustvold · 2023-10-01T12:13:13Z

Which issue does this PR close?

Relates to apache/datafusion#7652 and apache/datafusion#7653

Rationale for this change

lz4_flex is a pure Rust implementation of lz4 that achieves similar performance to the C library, but with the benefit of being compatible with WASM

What changes are included in this PR?

Are there any user-facing changes?

No, the only changes are to experimental modules

tustvold · 2023-10-01T12:15:03Z

parquet/src/bin/parquet-fromcsv.rs

@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> {
        Compression::BROTLI(_) => {
            Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read>
        }
-        Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| {


This will decode lz4 data encoded without any framing, which is so niche that I struggle to conceive of people relying on this functionality. Further this is a utility CLI tool, and so I'm not too concerned about this

I agree -- We can update the CSV tool if eeded

tustvold · 2023-10-01T12:15:37Z

parquet/src/compression.rs

@@ -383,64 +383,6 @@ impl BrotliLevel {
    }
 }

-#[cfg(any(feature = "lz4", test))]
-mod lz4_codec {


This codec has been replaced by LZ4HadoopCodec so lets just remove it, it isn't used

What do you mean "replaced"? Is there something in the parquet standard?

#3013

Basically the standard didn't specify the framing and so the ecosystem ended up with two 😄

That PR replaced this codec with a LZ4HadoopCodec which has an automatic fallback, this is what has been being used since then

alamb

The code looks good to me -- do we have any performance numbers?

Also, I don't understand the "replaced lz4 with lz4hadoopcodec" comment. I probably am missing something.

cc @sunchao who might have more context on parquet / compression formats (or know someone who does)

alamb · 2023-10-01T12:50:42Z

parquet/src/bin/parquet-fromcsv.rs

@@ -386,9 +386,6 @@ fn convert_csv_to_parquet(args: &Args) -> Result<(), ParquetFromCsvError> {
        Compression::BROTLI(_) => {
            Box::new(brotli::Decompressor::new(input_file, 0)) as Box<dyn Read>
        }
-        Compression::LZ4 => Box::new(lz4::Decoder::new(input_file).map_err(|e| {


I agree -- We can update the CSV tool if eeded

alamb · 2023-10-01T12:51:36Z

parquet/src/compression.rs

@@ -383,64 +383,6 @@ impl BrotliLevel {
    }
 }

-#[cfg(any(feature = "lz4", test))]
-mod lz4_codec {


What do you mean "replaced"? Is there something in the parquet standard?

alamb · 2023-10-01T12:53:49Z

Thank you for this @tustvold 🙏

kylebarron · 2023-10-01T16:15:55Z

I'm excited for this! I maintain https://github.com/kylebarron/parquet-wasm, which up until now hasn't been able to support lz4 for the arrow/parquet bindings.

It might be of use to note that arrow2/parquet2 implemented support for both lz4 and lz4_flex, so that the end user could choose which to enable. jorgecarleitao/parquet2#124

This implementation a bit slower but uses no unsafe and is written in native Rust, therefore supporting being compiled to wasm.

tustvold · 2023-10-01T20:04:51Z

Running the benchmarks shows this does appear to regress performance

compress LZ4 - alphanumeric
                        time:   [116.18 µs 116.44 µs 116.74 µs]
                        change: [+0.7922% +1.1707% +1.5607%] (p = 0.00 < 0.05)
                        Change within noise threshold.

LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
decompress LZ4 - alphanumeric
                        time:   [34.815 µs 34.839 µs 34.865 µs]
                        change: [+20.848% +21.196% +21.530%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

compress LZ4_RAW - alphanumeric
                        time:   [117.29 µs 117.50 µs 117.73 µs]
                        change: [+4.0342% +4.3202% +4.6091%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
decompress LZ4_RAW - alphanumeric
                        time:   [33.121 µs 33.139 µs 33.159 µs]
                        change: [+9.0540% +9.5426% +10.041%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe

Benchmarking compress LZ4 - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
compress LZ4 - words    time:   [1.5822 ms 1.5831 ms 1.5840 ms]
                        change: [+14.406% +14.490% +14.573%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  10 (10.00%) high mild
  2 (2.00%) high severe

LZ4 compressed 1048576 bytes of words to 408369 bytes
decompress LZ4 - words  time:   [253.21 µs 253.31 µs 253.42 µs]
                        change: [+3.8904% +3.9696% +4.0484%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking compress LZ4_RAW - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.9s, enable flat sampling, or reduce sample count to 50.
compress LZ4_RAW - words
                        time:   [1.5648 ms 1.5653 ms 1.5659 ms]
                        change: [+13.811% +13.877% +13.940%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe

LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
decompress LZ4_RAW - words
                        time:   [253.63 µs 253.73 µs 253.84 µs]
                        change: [+3.8637% +3.9363% +4.0044%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  5 (5.00%) high mild
  2 (2.00%) high severe

In particular we see regressions

~10-20% when decompressing non-compressible input
~15% when compressing compressible input

These benchmarks represent two pretty extreme cases, and it is likely that most realistic workloads sit somewhere inbetween and would see fairly minor regressions to both decompression and compression.

This is consistent with lz4_flex's own benchmarking which shows lz4_flex tending to perform better than lz4 at one of either compression or decompression for a given corpus.

I personally am happy enough with the performance to not feel the need to deal with the complexity of maintaining two possible implementations, especially given how rarely LZ4 is used in the ecosystem (it was only properly standardised a few years ago), but welcome other opinions

sunchao · 2023-10-02T16:48:17Z

Also, I don't understand the "replaced lz4 with lz4hadoopcodec" comment. I probably am missing something.
cc @sunchao who might have more context on parquet / compression formats (or know someone who does)

@alamb There are now two LZ4 compression codes in Parquet: the old/deprecated "Hadoop" LZ4 and the new LZ4_RAW, due to the framing issue @tustvold mentioned.

There's an email thread and discussions on this: https://www.mail-archive.com/[email protected]/msg14529.html

alamb · 2023-10-02T17:23:40Z

There's an email thread and discussions on this: https://www.mail-archive.com/[email protected]/msg14529.html

Despite several attempts by the parquet-cpp developers, we were not
able to reach the point where LZ4-compressed Parquet files are
bidirectionally compatible between parquet-cpp and parquet-mr. Other
implementations are having, or have had, similar issues.  My conclusion
is that the Parquet spec doesn't allow independent reimplementation of
the LZ4 compression format required by parquet-mr. Therefore, LZ4
compression should be removed from the spec (possibly replaced with
another enum value for a properly-specified, interoperable, LZ4-backed
compression scheme).

That is about as good as a rationale for removing LZ4 as I have heard

Use lz4_flex

a6e1df9

github-actions bot added the parquet Changes to the parquet crate label Oct 1, 2023

tustvold changed the title ~~Use lz4_flex~~ Replace lz4 with lz4_flex Allowing Compilation for WASM Oct 1, 2023

tustvold commented Oct 1, 2023

View reviewed changes

tustvold mentioned this pull request Oct 1, 2023

Support compiling remaining DataFusion crates (datafusion-core) to WASM apache/datafusion#7652

Open

Fix features

3cf9888

tustvold force-pushed the lz4-flex branch from 3392d08 to 3cf9888 Compare October 1, 2023 12:52

alamb reviewed Oct 1, 2023

View reviewed changes

tustvold added 2 commits October 1, 2023 13:58

Install clang for zlib

2681f31

Update arrow-ipc

ff438a5

github-actions bot added the arrow Changes to the arrow crate label Oct 1, 2023

tustvold added 2 commits October 1, 2023 14:42

Fix CI

9e139cc

Use LZ4F

0fc9486

tustvold added 5 commits October 1, 2023 17:23

Support LZ4F fallback

f6dbb41

Restore support for LZ4F compressed CSV

710c47b

Clippy

0b10493

Fix features

b61383f

Add benchmark

3f2214e

Additional system dependencies

c1dd28e

alamb approved these changes Oct 2, 2023

View reviewed changes

tustvold merged commit 3b0ede4 into apache:master Oct 2, 2023
31 checks passed

kylebarron mentioned this pull request Feb 29, 2024

feat: add lz4_raw support for arrow1 kylebarron/parquet-wasm#466

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

tustvold commented Oct 1, 2023

tustvold Oct 1, 2023

alamb Oct 1, 2023

tustvold Oct 1, 2023

alamb Oct 1, 2023

tustvold Oct 1, 2023 •

edited

Loading

alamb left a comment

alamb Oct 1, 2023

alamb Oct 1, 2023

alamb commented Oct 1, 2023

kylebarron commented Oct 1, 2023

tustvold commented Oct 1, 2023 •

edited

Loading

sunchao commented Oct 2, 2023

alamb commented Oct 2, 2023

Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

Replace lz4 with lz4_flex Allowing Compilation for WASM #4884

Conversation

tustvold commented Oct 1, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Oct 1, 2023

Choose a reason for hiding this comment

alamb Oct 1, 2023

Choose a reason for hiding this comment

tustvold Oct 1, 2023

Choose a reason for hiding this comment

alamb Oct 1, 2023

Choose a reason for hiding this comment

tustvold Oct 1, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 1, 2023

Choose a reason for hiding this comment

alamb Oct 1, 2023

Choose a reason for hiding this comment

alamb commented Oct 1, 2023

kylebarron commented Oct 1, 2023

tustvold commented Oct 1, 2023 • edited Loading

sunchao commented Oct 2, 2023

alamb commented Oct 2, 2023

tustvold Oct 1, 2023 •

edited

Loading

tustvold commented Oct 1, 2023 •

edited

Loading