Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression errors with feather/parquet export (lz4/zstd) #2018

Closed
alexander-beedie opened this issue Dec 9, 2021 · 4 comments · Fixed by #2035
Closed

Compression errors with feather/parquet export (lz4/zstd) #2018

alexander-beedie opened this issue Dec 9, 2021 · 4 comments · Fixed by #2035

Comments

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 9, 2021

Versions

Python 3.9 / Polars 0.10.27 / Windows 10

Describe your bug.

Exports to compressed feather/parquet cannot be read back if use_pyarrow=True (succeed only if use_pyarrow=False).
Errors include:

  • OSError: ZSTD decompression failed: Src size is incorrect.
  • OSError: Lz4 compressed input contains more than one frame.
  • OSError: Corrupt Lz4 compressed data.

What are the steps to reproduce the behavior?

# setup simple test data/frame
import polars as pl
from datetime import date

test_data = {
    'key':['abc','mno','xyz'], 
    'date':[date(1973,3,5),date(2000,8,20),date(2022,12,31)],
    'value':[-10.5,None,9999.0],
}
df = pl.DataFrame( data=test_data )

# export to compressed feather/parquet files
df.to_ipc( 'test1.feather', compression='lz4' )
df.to_ipc( 'test2.feather', compression='zstd' )
df.to_parquet( 'test1.parquet', compression='lz4' )
df.to_parquet( 'test2.parquet', compression='zstd' )

What is the actual behavior?

pl.read_ipc( 'test1.feather' )
# OSError: Lz4 compressed input contains more than one frame

pl.read_ipc( 'test2.feather' )
# OSError: ZSTD decompression failed: Src size is incorrect

pl.read_parquet( 'test1.parquet' )
# OSError: Corrupt Lz4 compressed data.

pl.read_parquet( 'test2.parquet' )
# succeeds for parquet with zstd compression

What is the expected behavior?

Successful load into DataFrame for round-trip import/export to compressed feather/parquet from Polars with default settings.

@alexander-beedie alexander-beedie changed the title Compression issues with feather/parquet export (lz4/zstd) Compression errors with feather/parquet export (lz4/zstd) Dec 9, 2021
@ritchie46
Copy link
Member

Could you open this issue upstream in arrow2? The parquet and IPC implementations are from arrow2.

@jorgecarleitao
Copy link
Collaborator

Oh, sorry, I missed this one. Looking into it

@jorgecarleitao
Copy link
Collaborator

jorgecarleitao commented Dec 11, 2021

wrt to the parquet, I concluded that the likely cause is on pyarrow itself, as parquet files written by pyarrow with LZ4 and ZSTD compression are un-readable by (py)spark. Filled bug upstream https://issues.apache.org/jira/browse/ARROW-15073

Will now look into the feather one.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 12, 2021

Many thanks all - I'll try all other possible permutations and -for now- hide compression options that aren't bidirectionally successful on our end, and report any other interesting results (upstream, as requested ;).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants