-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to read a subset of columns from arrow ipc format file the fastest way? #13827
Comments
I think the absolute fastest way is to break up the columns into different files, in which case you'll have far fewer inefficiencies. I think structurally, IPC and parquet are similar, so performance should not drop when switching to IPC from parquet with regard to how much of a file must be read to select a subset of columns. I have ideas of experiments to run to make recommendations, but let me try and see if these happen to exist anywhere... |
I think we did some investigation of partial reads in #11616 but I can't remember if we enabled it for the synchronous path or just the asynchronous path ( Can you try calling |
maybe not. there will be 1000 files, and we may have 1M such files. it brings more disk IOs, file open requests, overheads on each column, and complexity to maintain the data. |
Thanks for your suggestions. We do not know much about arrow and did not use those APIs you mentioned. Most of the time we use pyarrow to read/write data and do AI model training. So if there's python way, that would be great. We are trying to read the arrow code and try.. |
just for clarity, I meant you should group columns in files in the size that you access them. In this case it'd be 10 columns per file, in which case you can also fit more rows per batch in the same footprint, thus improving your useful throughput. but, the point of that was just that any other layout is going to have some inefficiencies related to "partial reads" as Weston mentioned. Or, some form of having to access extents that contain data for other columns. Since I don't know the exact use case, I agree, this may not actually improve performance across various use cases. |
looks to me like these are the functions Weston mentioned: I don't see any obvious python bindings yet. These might not be exposed directly in pyarrow |
@drin is correct, these functions are not exposed to pyarrow at the moment. However, from pyarrow, if you use the datasets API to read those files, it should achieve the desired effect:
Another benefit is if you can save your file in multiple row groups. This will allow you to start doing processing before you load the entire file into memory. I'm not sure if this is workable for you or not:
|
Thanks all. The dataset works for us. |
It would be useful if there were options to read a subset of column from the python side and at the lower level
Or to be able to control how |
We are building a high performance training system, and we do care about the performance a lot. We store the training data in arrow ipc format file, say, there are 100M rows and 1000 columns. We just need to read data of 10 columns each time for training.
Seems with the arrow ipc format file, we have to read the whole file first to get the 10 columns. We did to use parquet, because there will be serialize and deserialize and we think arrow ipc format will be faster.
Is there's any suggestion if we can just read 10 columns only to get better performance?
The text was updated successfully, but these errors were encountered: