-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added optional support for LZ4 via LZ4-flex crate (thus enabling wasm) #124
Conversation
I tried to test this out via this branch on Errors
|
Thanks a lot for the feedback! You need to use jorgecarleitao/arrow2#923, which is waiting for this PR to land and a new release of parquet2 to depend on it :) |
I updated this branch to test, pointing to a fork off of jorgecarleitao/arrow2#923 so that I could point When trying to load an lz4-compressed file written by pyarrow, it still fails. I'm guessing that this is the difference between the two lz4 Parquet implementations. Pyarrow apparently writes files that advertise
Note that this is running locally, not within wasm. It uses this entry point. (I figured I needed a way to debug files outside of wasm, exactly for cases like this 😄 ) The lz4 parquet test file is here, made by this python script. |
@kylebarron , I am able to read it: import pyarrow as pa # pyarrow==7.0.0
import pyarrow.parquet
path = "bla.parquet"
t = pa.table(
[pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
)
pyarrow.parquet.write_table(
t,
path,
use_dictionary=False,
compression="LZ4",
) (using cargo run --example parquet_read --features io_parquet,io_parquet_compression -- bla.parquet note that the file can't be read by spark, though: import pyspark.sql # pyspark==3.1.2
spark = pyspark.sql.SparkSession.builder.getOrCreate()
result = spark.read.parquet("bla.parquet").select("int64").collect() fails I think that there is a big problem with LZ4 in the ecosystem - I can see at least 3 implementations of LZ4 around: https://github.com/apache/arrow/blob/bf18e6e4b5bb6180706b1ba0d597a65a4ce5ca48/cpp/src/arrow/util/compression_lz4.cc |
Aha! I had pyarrow v6 (I think) installed before, and upgrading to pyarrow v7 did indeed produce different files. Diffing the generated python files, it looks like the older file had a footer with the text This command ( Thanks for the PR! |
1 similar comment
Aha! I had pyarrow v6 (I think) installed before, and upgrading to pyarrow v7 did indeed produce different files. Diffing the generated python files, it looks like the older file had a footer with the text This command ( Thanks for the PR! |
This PR was done together with @kylebarron and adds an optional dependency lz4-flex by @PSeitz as a LZ4 compressor/decompressor.
This implementation a bit slower but uses no
unsafe
and is written in native Rust, therefore supporting being compiled to wasm.