-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request to Memmap Arrow IPC files on disk #5153
Comments
I believe this should already be possible if you create a Buffer from an external allocation and them feed this into the IPC decoder. I'm aware some people have done something similar in the past |
I thought that was the case, but couldn't find a way to do that. The public function that reads record batches in the IPC crate needs a lot of boilerplate to adequately read an IPC file. The functions in the flight crate requires you go through a few hoops. The simplest pair is this read and write pair. I implemented it using the IPC file reader code here. Did I miss something? |
* Blockwise IO in IPC FileReader (#5153) * Docs * Clippy * Update arrow-ipc/src/reader.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have machine learning dataset that exceed my system's ram which I access in a random pattern. I currently split the data into a flat f32 array written to the disk, and a set of parquet files. This is awkward and requires complex glue code.
Describe the solution you'd like
I'd like to use a memmap like this: https://huggingface.co/docs/datasets/about_arrow
Describe alternatives you've considered
Switching to another format. rkvy works, but is not nearly as interoperable.
Additional context
The text was updated successfully, but these errors were encountered: