-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5181: [Rust] Initial support for Arrow File reader #4167
Conversation
Hi @paddyhoran @andygrove @sunchao please review when you get a chance. We can place the reusable parts (validating headers, reading schemas and batches) in a common module when we work on the streaming format. |
Codecov Report
@@ Coverage Diff @@
## master #4167 +/- ##
=========================================
Coverage ? 83.52%
=========================================
Files ? 87
Lines ? 24958
Branches ? 0
=========================================
Hits ? 20845
Misses ? 4113
Partials ? 0
Continue to review full report at Codecov.
|
@nevi-me I'm excited to see this! I will make time to review over the weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review @nevi-me . Overall looks good.
@@ -61,6 +63,108 @@ fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder { | |||
fbb | |||
} | |||
|
|||
/// Deserialize a Schema table from IPC format to Schema data type | |||
pub fn fb_to_schema(fb: ipc::Schema) -> Schema { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have tests for these newly added functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added one test to rule them all, by converting a large Schema
with all the types that we support, to flatbuffers and back. Would that suffice?
rust/arrow/src/ipc/file/reader.rs
Outdated
/// | ||
/// Sets the current block to the batch number, and reads the record batch at that | ||
/// block | ||
pub fn read_batch(&mut self, batch_num: usize) -> Result<Option<RecordBatch>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wondering if we can change this to something like set_index
which just change the current_block
and the caller then have to call next()
to actually read the batch.
Otherwise, this does have a side effect of advancing the current block index, which may caught people by surprise. We should at least point this out in the method comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Hi @sunchao I'm currently working on this PR to add support for lists and structs, as I've now generated the data. |
Cool. Looking forward to the updated PR! |
Update: The buffers from the Arrow file are padded to 64 bits, while the ones in Rust are padded to 8-bits. Due to this difference, I can't test data equality using Hi @sunchao I've updated the PR with list and struct reading. Regarding your comment about using The current commit ab6ff9b (ab6ff9b#diff-cd3519a5e748548b5323b0d05d8e9c8aR504) will fail with the below: thread 'ipc::file::reader::tests::test_read_struct_file' panicked at 'assertion failed: `(left == right)`
left: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bba200, len: 8 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bb9fc0, len: 8 }, offset: 0 } }) }`,
right: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bbec00, len: 1 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bbf000, len: 1 }, offset: 0 } }) }`', arrow\src\ipc\file\reader.rs:510:13 The difference being the There's other differences that I've picked up as I was going along, which might affect our interop compatibility, but I'll list/address them over the coming days. |
@nevi-me This is looking great . I think this change is big enough that we should update the README too and explain where those test data files came from and how there were created. |
In Rust we also pad buffer with 64 bytes. I think the real reason here is that the |
Thanks @sunchao |
What is the status of this patch for 0.14.0? |
Hi @wesm, this might not make 0.14. I ended up being dependent on a few other changes that we needed to make. I'll be travelling next week, and doubt I'll be able to complete this in time for 0.14. |
OK, no problem, I removed from the milestone |
ed180da
to
85fe336
Compare
Hi @jbabyhacker TL;DR it's not abandoned. The long version's that I haven't had enough bandwidth to work on it. I got stuck because I needed help with some changes, but by the time we had introduced those changes, I was already swamped. I've been working long hours at work for the past few months, including weekends. I'm nearly done with the current project at work, so I'm anticipating having downtime from next weekend,. So my time will be broken down between my studies and catching up here. In terms of work required, I still need to:
I intend on getting IPC (reader, writer, stream vs batch) before 1.0.0. |
If it's interesting at all, I put together a minimal Arrow Flight for Rust proof-of-concept on top of this branch: https://github.com/lihalite/arrow/commit/5ace5b226fb4a3a2a445b11c5b13f847ee3991b1 I used tower-grpc. The impl in this CR is enough to deserialize Flight schema and record batch messages, so we can make a ListFlights call followed by a DoGet. |
That's cool. It seems like hardening IPC in Rust is a pre-requisite for many other projects. It might need to be a collective effort rather than blocking on one person |
Maybe we can kick off a discussion on the mailing list?
|
My apologies for dragging so much on this, I've been on one of those "it might end next week" projects that's taken too much of my time. In terms of effort, the reader is complete, supporting all types that Rust can read. I'm mainly left with semantics and moving test cases to I haven't been able to follow the discussions around buffer alignment, so I might need help there from someone else. As a start, I'll rebase when I get home tonight. |
was using an older version of rustfmt
@paddyhoran @andygrove @sunchao @liurenjie1024 PTAL. If anyone has capacity, it would be great to add more data type support so we can be able to read more integration files. I'm going to work on the stream format reader next, which will be very useful for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nevi-me I don't pretend to understand every detail here but LGTM overall. There are a couple of unwrap/panics that could probably be changed to use results but I think we should get this PR merged to make it easier for other committers (myself included) to start helping you out with this effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @andygrove. Let's get this merged so that others can start to chip in. Thanks @nevi-me, this is great progress!
What is the status of this? I can only see the beginnings of a schema converter in the repo |
Hi @andy-thomason, your I haven't had time to work on the stream reader (as this PR was only for the file reader), and we'd welcome more hands on deck so we can complete IPC in Rust by |
Excellent news, @nevi-me We use arrow extensively at work for genetic variant analysis and I have written a number of parsers for the format - all internal, I'm afraid. I'll sync up and take a look. |
Is this all using Rust @andy-thomason? |
We were using C++, Python and R in the past, but now we are starting to write services in Rust instead of running batch code on a cluster. |
We also have a group in Cambridge using Rust in WASM. |
This related work also just got merged.
https://github.com/apache/arrow/tree/master/rust/arrow-flight
…On Sun, Nov 24, 2019 at 8:04 AM Andy Thomson ***@***.***> wrote:
We also have a group in Cambridge using Rust in WASM.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4167?email_source=notifications&email_token=AAHEBRBYJFXZSNYYGLR3IXTQVKJZLA5CNFSM4HGX5JP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFANLHY#issuecomment-557897119>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHEBRAZLOAEJAK3QEAHOSDQVKJZLANCNFSM4HGX5JPQ>
.
|
Excellent @andygrove, Saved me a huge pile of work. I have a deadline of the end of next week to |
If you'll be using |
I'll certainly look at |
JIRA logs 10 minutes to each PR for comments, so maybe the mailing list might be a better place to continue the conversation (as this PR is closed). Be wary though that gRPC has low data limits by default (4MB, haven't tried modifying it in |
This adds initial support for reading Arrow files. Only the file format is supported, with support for the streaming format to follow in a separate PR.
Only Rust supported datatypes are read and tested, thus files with timestamp[tz], intervals, durations, maps, decimal; aren't tested.