How to read a subset of columns from arrow ipc format file the fastest way? #13827

code1704 · 2022-08-09T16:51:06Z

We are building a high performance training system, and we do care about the performance a lot. We store the training data in arrow ipc format file, say, there are 100M rows and 1000 columns. We just need to read data of 10 columns each time for training.

Seems with the arrow ipc format file, we have to read the whole file first to get the 10 columns. We did to use parquet, because there will be serialize and deserialize and we think arrow ipc format will be faster.

Is there's any suggestion if we can just read 10 columns only to get better performance?

drin · 2022-08-09T17:18:18Z

I think the absolute fastest way is to break up the columns into different files, in which case you'll have far fewer inefficiencies.

I think structurally, IPC and parquet are similar, so performance should not drop when switching to IPC from parquet with regard to how much of a file must be read to select a subset of columns.

I have ideas of experiments to run to make recommendations, but let me try and see if these happen to exist anywhere...

westonpace · 2022-08-09T17:20:56Z

I think we did some investigation of partial reads in #11616 but I can't remember if we enabled it for the synchronous path or just the asynchronous path (GetRecordBatchGenerator).

Can you try calling PreBufferMetadata before you start reading the record batches? If that doesn't work can you try GetRecordBatchGenerator?

code1704 · 2022-08-09T18:05:49Z

I think the absolute fastest way is to break up the columns into different files, in which case you'll have far fewer inefficiencies.

maybe not. there will be 1000 files, and we may have 1M such files. it brings more disk IOs, file open requests, overheads on each column, and complexity to maintain the data.

code1704 · 2022-08-09T18:13:53Z

I think we did some investigation of partial reads in #11616 but I can't remember if we enabled it for the synchronous path or just the asynchronous path (GetRecordBatchGenerator).

Can you try calling PreBufferMetadata before you start reading the record batches? If that doesn't work can you try GetRecordBatchGenerator?

Thanks for your suggestions.

We do not know much about arrow and did not use those APIs you mentioned. Most of the time we use pyarrow to read/write data and do AI model training. So if there's python way, that would be great. We are trying to read the arrow code and try..

drin · 2022-08-09T19:32:12Z

maybe not. there will be 1000 files, and we may have 1M such files. it brings more disk IOs, file open requests, overheads on each column, and complexity to maintain the data.

just for clarity, I meant you should group columns in files in the size that you access them. In this case it'd be 10 columns per file, in which case you can also fit more rows per batch in the same footprint, thus improving your useful throughput. but, the point of that was just that any other layout is going to have some inefficiencies related to "partial reads" as Weston mentioned. Or, some form of having to access extents that contain data for other columns. Since I don't know the exact use case, I agree, this may not actually improve performance across various use cases.

drin · 2022-08-09T19:39:46Z

looks to me like these are the functions Weston mentioned:
ipc::reader.cc#L1310
ipc::reader.cc#L1366

I don't see any obvious python bindings yet. These might not be exposed directly in pyarrow

westonpace · 2022-08-09T20:34:46Z

@drin is correct, these functions are not exposed to pyarrow at the moment. However, from pyarrow, if you use the datasets API to read those files, it should achieve the desired effect:

import pyarrow.dataset as ds
my_dataset = ds.dataset(['/tmp/my_ipc.arrow'], format='arrow')
my_table = my_dataset.to_table(columns=["a", "b"])

Another benefit is if you can save your file in multiple row groups. This will allow you to start doing processing before you load the entire file into memory. I'm not sure if this is workable for you or not:

import pyarrow.dataset as ds
my_dataset = ds.dataset(['/tmp/my_ipc.arrow'], format='arrow')
for next_batch in my_dataset.to_batches(columns=["a", "b"]):
  # Do something with batch
  print(batch)

code1704 · 2022-08-10T13:17:37Z

Thanks all. The dataset works for us.

ben-da6 · 2023-12-06T12:59:58Z

It would be useful if there were options to read a subset of column from the python side and at the lower level open_file api. Eg if we could do

source = pa.OSFile(path, "rb")
file = pa.ipc.open_file(source)
record = file.get_batch(0, columns=subset)

Or to be able to control how dataset.to_batches iterates through the data at the row group level across potentially partitioned data on disk.

code1704 changed the title ~~How to read a subset of columns from arrow ipc format file the fast way?~~ How to read a subset of columns from arrow ipc format file the fastest way? Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read a subset of columns from arrow ipc format file the fastest way? #13827

How to read a subset of columns from arrow ipc format file the fastest way? #13827

code1704 commented Aug 9, 2022 •

edited

Loading

drin commented Aug 9, 2022

westonpace commented Aug 9, 2022

code1704 commented Aug 9, 2022

code1704 commented Aug 9, 2022

drin commented Aug 9, 2022

drin commented Aug 9, 2022

westonpace commented Aug 9, 2022 •

edited

Loading

code1704 commented Aug 10, 2022

ben-da6 commented Dec 6, 2023

How to read a subset of columns from arrow ipc format file the fastest way? #13827

How to read a subset of columns from arrow ipc format file the fastest way? #13827

Comments

code1704 commented Aug 9, 2022 • edited Loading

drin commented Aug 9, 2022

westonpace commented Aug 9, 2022

code1704 commented Aug 9, 2022

code1704 commented Aug 9, 2022

drin commented Aug 9, 2022

drin commented Aug 9, 2022

westonpace commented Aug 9, 2022 • edited Loading

code1704 commented Aug 10, 2022

ben-da6 commented Dec 6, 2023

code1704 commented Aug 9, 2022 •

edited

Loading

westonpace commented Aug 9, 2022 •

edited

Loading