Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read a subset of columns from arrow ipc format file the fastest way? #13827

Open
code1704 opened this issue Aug 9, 2022 · 9 comments
Open

Comments

@code1704
Copy link

code1704 commented Aug 9, 2022

We are building a high performance training system, and we do care about the performance a lot. We store the training data in arrow ipc format file, say, there are 100M rows and 1000 columns. We just need to read data of 10 columns each time for training.

Seems with the arrow ipc format file, we have to read the whole file first to get the 10 columns. We did to use parquet, because there will be serialize and deserialize and we think arrow ipc format will be faster.

Is there's any suggestion if we can just read 10 columns only to get better performance?

@code1704 code1704 changed the title How to read a subset of columns from arrow ipc format file the fast way? How to read a subset of columns from arrow ipc format file the fastest way? Aug 9, 2022
@drin
Copy link
Contributor

drin commented Aug 9, 2022

I think the absolute fastest way is to break up the columns into different files, in which case you'll have far fewer inefficiencies.

I think structurally, IPC and parquet are similar, so performance should not drop when switching to IPC from parquet with regard to how much of a file must be read to select a subset of columns.

I have ideas of experiments to run to make recommendations, but let me try and see if these happen to exist anywhere...

@westonpace
Copy link
Member

I think we did some investigation of partial reads in #11616 but I can't remember if we enabled it for the synchronous path or just the asynchronous path (GetRecordBatchGenerator).

Can you try calling PreBufferMetadata before you start reading the record batches? If that doesn't work can you try GetRecordBatchGenerator?

@code1704
Copy link
Author

code1704 commented Aug 9, 2022

I think the absolute fastest way is to break up the columns into different files, in which case you'll have far fewer inefficiencies.

maybe not. there will be 1000 files, and we may have 1M such files. it brings more disk IOs, file open requests, overheads on each column, and complexity to maintain the data.

@code1704
Copy link
Author

code1704 commented Aug 9, 2022

I think we did some investigation of partial reads in #11616 but I can't remember if we enabled it for the synchronous path or just the asynchronous path (GetRecordBatchGenerator).

Can you try calling PreBufferMetadata before you start reading the record batches? If that doesn't work can you try GetRecordBatchGenerator?

Thanks for your suggestions.

We do not know much about arrow and did not use those APIs you mentioned. Most of the time we use pyarrow to read/write data and do AI model training. So if there's python way, that would be great. We are trying to read the arrow code and try..

@drin
Copy link
Contributor

drin commented Aug 9, 2022

maybe not. there will be 1000 files, and we may have 1M such files. it brings more disk IOs, file open requests, overheads on each column, and complexity to maintain the data.

just for clarity, I meant you should group columns in files in the size that you access them. In this case it'd be 10 columns per file, in which case you can also fit more rows per batch in the same footprint, thus improving your useful throughput. but, the point of that was just that any other layout is going to have some inefficiencies related to "partial reads" as Weston mentioned. Or, some form of having to access extents that contain data for other columns. Since I don't know the exact use case, I agree, this may not actually improve performance across various use cases.

@drin
Copy link
Contributor

drin commented Aug 9, 2022

looks to me like these are the functions Weston mentioned:
ipc::reader.cc#L1310
ipc::reader.cc#L1366

I don't see any obvious python bindings yet. These might not be exposed directly in pyarrow

@westonpace
Copy link
Member

westonpace commented Aug 9, 2022

@drin is correct, these functions are not exposed to pyarrow at the moment. However, from pyarrow, if you use the datasets API to read those files, it should achieve the desired effect:

import pyarrow.dataset as ds
my_dataset = ds.dataset(['/tmp/my_ipc.arrow'], format='arrow')
my_table = my_dataset.to_table(columns=["a", "b"])

Another benefit is if you can save your file in multiple row groups. This will allow you to start doing processing before you load the entire file into memory. I'm not sure if this is workable for you or not:

import pyarrow.dataset as ds
my_dataset = ds.dataset(['/tmp/my_ipc.arrow'], format='arrow')
for next_batch in my_dataset.to_batches(columns=["a", "b"]):
  # Do something with batch
  print(batch)

@code1704
Copy link
Author

Thanks all. The dataset works for us.

@ben-da6
Copy link

ben-da6 commented Dec 6, 2023

It would be useful if there were options to read a subset of column from the python side and at the lower level open_file api. Eg if we could do

source = pa.OSFile(path, "rb")
file = pa.ipc.open_file(source)
record = file.get_batch(0, columns=subset)

Or to be able to control how dataset.to_batches iterates through the data at the row group level across potentially partitioned data on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants