Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load Feather v2 files created by pyarrow and pandas. #286

Closed
ghuls opened this issue May 11, 2021 · 15 comments
Closed

Unable to load Feather v2 files created by pyarrow and pandas. #286

ghuls opened this issue May 11, 2021 · 15 comments
Labels

Comments

@ghuls
Copy link
Contributor

ghuls commented May 11, 2021

Describe the bug

Original bug report is here (agains polars, which was using arrow-rs for parsing Feather v2 files (IPC)):
pola-rs/polars#623

Unable to load Feather v2 files created by pyarrow and pandas.

Those files can be loaded fine by pyarrow and pandas itself.

To Reproduce
Steps to reproduce the behavior:

Try to load the attached Feather files:
test_feather_file.zip
)

test_pandas.feather: Original Feather file
test_arrow.feather: loading test_pandas.feather with pyarrow and saving with pyarrow: df_pa = pa.feather.read_feather('test_pandas.feather')
test_polars.feather:  Loading test_pandas.feather with pyarrow and saving with polars (this one can be read by arrow-rs)
test_pandas_from_polars.feather: Loading test_polars.feather with polars and using the to_pandas option.

Expected behavior

Feather v2 files can be opened by arrow-rs.

Additional context

import polars as pl
import pyarrow as pa
import pandas as pd

# Reading Feather file created with Pandas with pyarrow works fine.
df_pa = pa.feather.read_feather('test_pandas.feather')

# Write pyarrow dataframe to Feather file.
df_pa.to_feather('test_arrow.feather')

# Convert pyarrow dataframe to polars dataframe.
df_pl = pl.DataFrame(df_pa)

# Convert polars dataframe to pandas dataframe.
df_pd = df_pl.to_pandas()

# Write pandas dataframe  to feather file.
df_pd.to_feather('test_pandas_from_polars.feather')


In [88]: df_pa
Out[88]: 
   motif1  motif2  motif3  motif4 regions
0     1.2     3.0     0.3     5.6    reg1
1     6.7     3.0     4.3     5.6    reg2
2     3.5     3.0     0.0     0.0    reg3
3     0.0     3.0     0.0     5.6    reg4
4     2.4     3.0     7.8     1.2    reg5
5     2.4     3.0     0.6     0.0    reg6
6     2.4     3.0     7.7     0.0    reg7

In [89]: df_pl
Out[89]: 
shape: (7, 5)
╭────────┬────────┬────────┬────────┬─────────╮
│ motif1motif2motif3motif4regions │
│ ---------------     │
│ f64f64f64f64str     │
╞════════╪════════╪════════╪════════╪═════════╡
│ 1.230.35.6"reg1"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.734.35.6"reg2"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.530.00.0"reg3"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.030.05.6"reg4"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.437.81.2"reg5"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.430.60.0"reg6"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.437.70.0"reg7"  │
╰────────┴────────┴────────┴────────┴─────────╯

In [90]: df_pd
Out[90]: 
   motif1  motif2  motif3  motif4 regions
0     1.2     3.0     0.3     5.6    reg1
1     6.7     3.0     4.3     5.6    reg2
2     3.5     3.0     0.0     0.0    reg3
3     0.0     3.0     0.0     5.6    reg4
4     2.4     3.0     7.8     1.2    reg5
5     2.4     3.0     0.6     0.0    reg6
6     2.4     3.0     7.7     0.0    reg7



In [103]: pl.read_ipc('test_polars.feather')
Out[103]: 
shape: (7, 5)
╭────────┬────────┬────────┬────────┬─────────╮
│ motif1motif2motif3motif4regions │
│ ---------------     │
│ f64f64f64f64str     │
╞════════╪════════╪════════╪════════╪═════════╡
│ 1.230.35.6"reg1"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.734.35.6"reg2"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.530.00.0"reg3"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.030.05.6"reg4"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.437.81.2"reg5"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.430.60.0"reg6"  │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.437.70.0"reg7"  │
╰────────┴────────┴────────┴────────┴─────────╯

In [104]: pl.read_ipc('test_arrow.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-104-f9a22f9a0eb1> in <module>
----> 1 pl.read_ipc('test_arrow.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()

In [105]: pl.read_ipc('test_pandas.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-105-35809d9ae65f> in <module>
----> 1 pl.read_ipc('test_pandas.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()

In [106]: pl.read_ipc('test_pandas_from_polars.feather')
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/d008f31/arrow/src/buffer/immutable.rs:179:9
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-107-d0a17f51c6ac> in <module>
----> 1 pl.read_ipc('test_pandas_from_polars.feather')

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/functions.py in read_ipc(file)
    278     """
    279     file = _prepare_file_arg(file)
--> 280     return DataFrame.read_ipc(file)
    281 
    282 

~/software/anaconda3/envs/create_cistarget_databases/lib/python3.8/site-packages/polars/frame.py in read_ipc(file)
    235         """
    236         self = DataFrame.__new__(DataFrame)
--> 237         self._df = PyDataFrame.read_ipc(file)
    238         return self
    239 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()
@ghuls ghuls added the bug label May 11, 2021
@jorgecarleitao
Copy link
Member

I did not know this: is feather compatible with IPC?

@ghuls
Copy link
Contributor Author

ghuls commented May 12, 2021

It should be IPC on disk with optional compression with lz4 or zstd:

https://arrow.apache.org/docs/python/feather.html
https://ursalabs.org/blog/2020-feather-v2/

Feather v1 is indeed a total different format. (header bytes: FEA1 instead of ARROW1

@ghuls
Copy link
Contributor Author

ghuls commented May 12, 2021

Here is the original commit that introduced Feather v2 support in Arrow: apache/arrow@e03251c

@jorgecarleitao
Copy link
Member

jorgecarleitao commented May 12, 2021

Nice, learnt something new today. Thanks for the explanation

This is indeed a bug, and a dangerous one because that prefix and suffix imply that we allowed misaligned bytes to go to the MutableBuffer (that check is like the last line of defense against UB).

@jorgecarleitao
Copy link
Member

I investigated this and there is something funny going on: the file reports that there is an array whose buffer of type u8 has 201326592 slots, but the buffers' total length is 51. This happens on the 5th column, which is a Utf8.

This behavior is consistent among test_pandas.feather and test_arrow.feather on the zip.

That number of slots seems incorrect. I need to check if this is a problem while reading those slots from the file or whewther they are already written as that.

@jorgecarleitao
Copy link
Member

More details: in both files, I am getting the following:

Reading Utf8
field_node: FieldNode { length: 7, null_count: 0 }
offset buffer: Buffer { offset: 200, length: 55 }
offsets: [32, 0, 407708164, 545407072, 8388608, 67108864, 134217728, 201326592]
values buffer: Buffer { offset: 256, length: 51 }
  • offsets[0] != 0 indicates a problem: offsets are expected to start from zero on any array with offsets.
  • offsets[i+1] < offsets[i+1] for some i, which indicates a problem: offsets are expected to be monotonically increasing

I do not have a root cause yet, these are just observations.

@ghuls
Copy link
Contributor Author

ghuls commented May 12, 2021

It makes sense that you see the same in the Feather file created by pyarrow and pandas as pandas uses the same pyarrow.feather code: https://github.com/pandas-dev/pandas/blob/059c8bac51e47d6eaaa3e36d6a293a22312925e6/pandas/io/feather_format.py

@ghuls
Copy link
Contributor Author

ghuls commented May 12, 2021

Could it be that this difference you see is due tostreaming IPC vs random access IPC format?

For most cases, it is most convenient to use the RecordBatchStreamReader or RecordBatchFileReader class, depending on which variant of the IPC format you want to read. The former requires a InputStream source, while the latter requires a RandomAccessFile.

Reading Arrow IPC data is inherently zero-copy if the source allows it. For example, a BufferReader or MemoryMappedFile can typically be zero-copy. Exceptions are when the data must be transformed on the fly, e.g. when buffer compression has been enabled on the IPC stream or file.

https://arrow.apache.org/docs/cpp/ipc.html

@ghuls
Copy link
Contributor Author

ghuls commented May 18, 2021

IPC File Format

We define a “file format” supporting random access that is build with the stream format. The file starts and ends with a magic string ARROW1 (plus padding). What follows in the file is identical to the stream format. At the end of the file, we write a footer containing a redundant copy of the schema (which is a part of the streaming format) plus memory offsets and sizes for each of the data blocks in the file. This enables random access any record batch in the file. See File.fbs for the precise details of the file footer.

Schematically we have:

<magic number "ARROW1">
<empty padding bytes [to 8 byte boundary]>
<STREAMING FORMAT with EOS>
<FOOTER>
<FOOTER SIZE: int32>
<magic number "ARROW1">

In the file format, there is no requirement that dictionary keys should be defined in a DictionaryBatch before they are used in a RecordBatch, as long as the keys are defined somewhere in the file. Further more, it is invalid to have more than one non-delta dictionary batch per dictionary ID (i.e. dictionary replacement is not supported). Delta dictionaries are applied in the order they appear in the file footer.

https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

@ghuls
Copy link
Contributor Author

ghuls commented Jun 16, 2021

@jorgecarleitao There is a recent commit on arrow that improves the documentation of the arrow IPC file format:
apache/arrow@59c5781#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60L1011-R1023

IPC File Format
---------------

- We define a "file format" supporting random access that is build with
- the stream format. The file starts and ends with a magic string
- ``ARROW1`` (plus padding). What follows in the file is identical to
- the stream format. At the end of the file, we write a *footer*
- containing a redundant copy of the schema (which is a part of the
- streaming format) plus memory offsets and sizes for each of the data
- blocks in the file. This enables random access any record batch in the
- file. See `File.fbs`_ for the precise details of the file footer.
+ We define a "file format" supporting random access that is an extension of
+ the stream format. The file starts and ends with a magic string ``ARROW1``
+ (plus padding). What follows in the file is identical to the stream format.
+ At the end of the file, we write a *footer* containing a redundant copy of
+ the schema (which is a part of the streaming format) plus memory offsets and
+ sizes for each of the data blocks in the file. This enables random access to
+ any record batch in the file. See `File.fbs`_ for the precise details of the
+ file footer.

@ghuls
Copy link
Contributor Author

ghuls commented Jun 21, 2021

@jorgecarleitao I think I might have figured out the problem.

import polars as pl
import pyarrow as pa
import pandas as pd

# Read Feather file written with pandas, with pa,feather.read_feather (wrapped inside pl.read_ipc) in Polars dataframe.
df_pl = pl.read_ipc('test_pandas.feather', use_pyarrow=True)

# Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_uncompressed.feather', compression='uncompressed', version=2)

# Convert Polars dataframe to arrow table and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow(), 'test_polars_to_arrow_lz4.feather', compression='lz4', version=2)

# Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file without compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_uncompressed.feather', compression='uncompressed', version=2)

# Convert Polars dataframe to arrow table and convert arrow table to pandas dataframe and write to Feather v2 file with lz4 compression (with pyarrow).
pa.feather.write_feather(df_pl.to_arrow().to_pandas(), 'test_polars_to_arrow_to_pandas_lz4.feather', compression='lz4', version=2)


# Now try to read all those files with polars without using the pyarrow Feather reading code, but the arrow-rs code instead.

# Reading Feather v2 file without compression containing saved arrow table data, works.
In [9]: pl.read_ipc('test_polars_to_arrow_uncompressed.feather', use_pyarrow=False)
Out[9]: 
shape: (7, 5)
╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
│ motif1motif2motif3motif4regions │
│ ---------------     │
│ f32f32f32f32str     │
╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
│ 1.200000047683715830.300000011920928965.599999904632568"reg1"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.69999980926513734.3000001907348635.599999904632568"reg2"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.530.00.0"reg3"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.030.05.599999904632568"reg4"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431637.8000001907348631.2000000476837158"reg5"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431630.60000002384185790.0"reg6"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431637.6999998092651370.0"reg7"  │
╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯


# Reading Feather v2 file without compression containing saved pandas dataframe, works.
In [10]: pl.read_ipc('test_polars_to_arrow_to_pandas_uncompressed.feather', use_pyarrow=False)
Out[10]: 
shape: (7, 5)
╭────────────────────┬────────┬─────────────────────┬────────────────────┬─────────╮
│ motif1motif2motif3motif4regions │
│ ---------------     │
│ f32f32f32f32str     │
╞════════════════════╪════════╪═════════════════════╪════════════════════╪═════════╡
│ 1.200000047683715830.300000011920928965.599999904632568"reg1"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 6.69999980926513734.3000001907348635.599999904632568"reg2"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3.530.00.0"reg3"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0.030.05.599999904632568"reg4"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431637.8000001907348631.2000000476837158"reg5"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431630.60000002384185790.0"reg6"  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2.400000095367431637.6999998092651370.0"reg7"  │
╰────────────────────┴────────┴─────────────────────┴────────────────────┴─────────╯


# Reading Feather v2 file with lz4 compression containing saved pandas dataframe, gives the error from the first post.
In [11]: pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
thread '<unnamed>' panicked at 'assertion failed: prefix.is_empty() && suffix.is_empty()', /github/home/.cargo/git/checkouts/arrow-rs-3b86e19e889d5acc/9f56afb/arrow/src/buffer/immutable.rs:179:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
<ipython-input-11-04613b1d0975> in <module>
----> 1 pl.read_ipc('test_polars_to_arrow_to_pandas_lz4.feather', use_pyarrow=False)
/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/functions.py in read_ipc(file, use_pyarrow)
    337     """
    338     file = _prepare_file_arg(file)
--> 339     return DataFrame.read_ipc(file, use_pyarrow)
    340 
    341 

/software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/polars/frame.py in read_ipc(file, use_pyarrow)
    302 
    303         self = DataFrame.__new__(DataFrame)
--> 304         self._df = PyDataFrame.read_ipc(file)
    305         return self
    306 

PanicException: assertion failed: prefix.is_empty() && suffix.is_empty()


# Reading Feather v2 file with lz4 compression containing saved pyarrow table, results in killing of iPython due to trying to allocate a too big buffer.
In [12]: pl.read_ipc('test_polars_to_arrow_lz4.feather', use_pyarrow=False)
Out[12]: memory allocation of 2702793507844465093 bytes failed
Aborted

So to me it looks like that arrow-rs is not detecting that pyarrow saved the Feather file with lz4 compression and I guess it is reading data (or offsets) from the wrong locations.

In [6]: ?pa.feather.write_feather
Signature:
pa.feather.write_feather(
    df,
    dest,
    compression=None,
    compression_level=None,
    chunksize=None,
    version=2,
)
Docstring:
Write a pandas.DataFrame to Feather format.

Parameters
----------
df : pandas.DataFrame or pyarrow.Table
    Data to write out as Feather format.
dest : str
    Local destination path.
compression : string, default None
    Can be one of {"zstd", "lz4", "uncompressed"}. The default of None uses
    LZ4 for V2 files if it is available, otherwise uncompressed.
compression_level : int, default None
    Use a compression level particular to the chosen compressor. If None
    use the default compression level
chunksize : int, default None
    For V2 files, the internal maximum size of Arrow RecordBatch chunks
    when writing the Arrow IPC file format. None means use the default,
    which is currently 64K
version : int, default 2
    Feather file version. Version 2 is the current. Version 1 is the more
    limited legacy format
File:      /software/miniconda3/envs/cisTopic/lib/python3.7/site-packages/pyarrow/feather.py
Type:      function

Feather files are attached:
test_feather_polars_to_pyarrow.zip

@nevi-me
Copy link
Contributor

nevi-me commented Jun 22, 2021

@ghuls compression isn't supported, see #70 and https://issues.apache.org/jira/browse/ARROW-8676. I had a PR for this, but struggled with getting integration tests to pass, so I abandoned it as I didn't have more time for it.

Here's the PR: apache/arrow#9137

@ghuls
Copy link
Contributor Author

ghuls commented Jun 22, 2021

@nevi-me A pity it is not supported (yet) as Pandas and pyarrow will write Feather files with lz4 compression by default (at least when using the official packages). At least arrow-rs should detect that a compression codec is used that it does not support yet, instead of doing the wrong thing and reading compressed data as uncompressed data.

@ghuls
Copy link
Contributor Author

ghuls commented Feb 15, 2023

I guess it is solved now, if I read https://arrow.apache.org/blog/2023/02/13/rust-32.0.0/

IPC File Compression: Arrow IPC file compression with ZSTD and LZ4 is now fully supported.

correctly.

@tustvold
Copy link
Contributor

I believe this was closed by #2369 feel to reopen if I am mistaken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants