ARROW-5181: [Rust] Initial support for Arrow File reader #4167

nevi-me · 2019-04-17T21:18:27Z

This adds initial support for reading Arrow files. Only the file format is supported, with support for the streaming format to follow in a separate PR.

Only Rust supported datatypes are read and tested, thus files with timestamp[tz], intervals, durations, maps, decimal; aren't tested.

nevi-me · 2019-04-17T21:20:10Z

Hi @paddyhoran @andygrove @sunchao please review when you get a chance. We can place the reusable parts (validating headers, reading schemas and batches) in a common module when we work on the streaming format.

codecov-io · 2019-04-17T22:24:25Z

Codecov Report

❗ No coverage uploaded for pull request base (master@6d25dfd). Click here to learn what that means.
The diff coverage is 88.28%.

@@            Coverage Diff            @@
##             master    #4167   +/-   ##
=========================================
  Coverage          ?   83.52%           
=========================================
  Files             ?       87           
  Lines             ?    24958           
  Branches          ?        0           
=========================================
  Hits              ?    20845           
  Misses            ?     4113           
  Partials          ?        0

Impacted Files	Coverage Δ
rust/arrow/src/array/array.rs	`91.12% <0%> (ø)`
rust/arrow/src/ipc/file/reader.rs	`87.34% <87.34%> (ø)`
rust/arrow/src/ipc/convert.rs	`96.94% <96.72%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d25dfd...d555384. Read the comment docs.

ghost · 2019-04-18T19:16:55Z

@nevi-me I'm excited to see this! I will make time to review over the weekend.

sunchao

Sorry for the late review @nevi-me . Overall looks good.

rust/arrow/src/ipc/convert.rs

sunchao · 2019-04-24T08:08:20Z

rust/arrow/src/ipc/convert.rs

@@ -61,6 +63,108 @@ fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder {
    fbb
 }

+/// Deserialize a Schema table from IPC format to Schema data type
+pub fn fb_to_schema(fb: ipc::Schema) -> Schema {


Can we have tests for these newly added functions?

I've added one test to rule them all, by converting a large Schema with all the types that we support, to flatbuffers and back. Would that suffice?

rust/arrow/src/ipc/file/reader.rs

sunchao · 2019-04-25T06:07:53Z

rust/arrow/src/ipc/file/reader.rs

+    ///
+    /// Sets the current block to the batch number, and reads the record batch at that
+    /// block
+    pub fn read_batch(&mut self, batch_num: usize) -> Result<Option<RecordBatch>> {


nit: wondering if we can change this to something like set_index which just change the current_block and the caller then have to call next() to actually read the batch.

Otherwise, this does have a side effect of advancing the current block index, which may caught people by surprise. We should at least point this out in the method comments.

rust/arrow/src/ipc/convert.rs

nevi-me · 2019-04-25T06:26:46Z

Hi @sunchao I'm currently working on this PR to add support for lists and structs, as I've now generated the data.

sunchao · 2019-04-25T17:48:27Z

Cool. Looking forward to the updated PR!

nevi-me · 2019-04-26T15:29:52Z

Update: The buffers from the Arrow file are padded to 64 bits, while the ones in Rust are padded to 8-bits. Due to this difference, I can't test data equality using ArrayData.

Hi @sunchao I've updated the PR with list and struct reading. Regarding your comment about using ArrayData, I've noticed that there's a difference between Rust and Python/CPP (maybe just IPC) with buffer padding to multiples of 8 (and treatment of null buffers, but I'll address this separately).

The current commit ab6ff9b (ab6ff9b#diff-cd3519a5e748548b5323b0d05d8e9c8aR504) will fail with the below:

thread 'ipc::file::reader::tests::test_read_struct_file' panicked at 'assertion failed: `(left == right)`
  left: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bba200, len: 8 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bb9fc0, len: 8 }, offset: 0 } }) }`,
 right: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bbec00, len: 1 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bbf000, len: 1 }, offset: 0 } }) }`', arrow\src\ipc\file\reader.rs:510:13

The difference being the len: 1 vs len: 8 in the BufferData lengths. The above is from a BooleanArray with 5 elements. I've run out of time for now, but I'll compare the binary rep of the buffers to see what's causing the difference, but I suspect that it's just the lengths being padded. https://www.diffchecker.com/xU8bBtne shows the diff between the struct_array buffer and the struct_type file that I'm reading. The padding to multiples of 8 is more apparent there.

There's other differences that I've picked up as I was going along, which might affect our interop compatibility, but I'll list/address them over the coming days.
I'll also add the python scripts that I used to generate the data, but I'd prefer to move the data files that I'll use to arrow-testing during the course of this PR.

rust/arrow/src/ipc/file/reader.rs

andygrove · 2019-04-27T15:07:14Z

@nevi-me This is looking great . I think this change is big enough that we should update the README too and explain where those test data files came from and how there were created.

sunchao · 2019-05-02T05:32:54Z

Update: The buffers from the Arrow file are padded to 64 bits, while the ones in Rust are padded to 8-bits. Due to this difference, I can't test data equality using ArrayData.

In Rust we also pad buffer with 64 bytes. I think the real reason here is that the len in BufferData is the # of valid bytes, instead of the # of total bytes (= # of valid bytes + # of padded bytes). I'm not sure whether it is easy to change the conversion of ipc::Buffer to meet this requirement. Changing the BufferData implementation might be a little involving but let me take a look.

nevi-me · 2019-05-02T06:08:31Z

Thanks @sunchao

rust/arrow/src/array.rs

wesm · 2019-06-24T15:13:44Z

What is the status of this patch for 0.14.0?

nevi-me · 2019-06-24T15:18:24Z

What is the status of this patch for 0.14.0?

Hi @wesm, this might not make 0.14. I ended up being dependent on a few other changes that we needed to make. I'll be travelling next week, and doubt I'll be able to complete this in time for 0.14.

wesm · 2019-06-24T22:23:10Z

OK, no problem, I removed from the milestone

jbabyhacker · 2019-08-14T00:58:15Z

@nevi-me & @wesm what is the status of this PR?

nevi-me · 2019-08-14T01:48:13Z

Hi @jbabyhacker

TL;DR it's not abandoned.

The long version's that I haven't had enough bandwidth to work on it. I got stuck because I needed help with some changes, but by the time we had introduced those changes, I was already swamped. I've been working long hours at work for the past few months, including weekends. I'm nearly done with the current project at work, so I'm anticipating having downtime from next weekend,. So my time will be broken down between my studies and catching up here.

In terms of work required, I still need to:

add test files to arrow-testing https://issues.apache.org/jira/browse/ARROW- 5399
update test cases after we merge the above, including adding more tests for coverage purposes

I intend on getting IPC (reader, writer, stream vs batch) before 1.0.0.

ghost · 2019-09-09T22:45:09Z

If it's interesting at all, I put together a minimal Arrow Flight for Rust proof-of-concept on top of this branch: https://github.com/lihalite/arrow/commit/5ace5b226fb4a3a2a445b11c5b13f847ee3991b1

I used tower-grpc. The impl in this CR is enough to deserialize Flight schema and record batch messages, so we can make a ListFlights call followed by a DoGet.

wesm · 2019-09-10T15:31:28Z

That's cool. It seems like hardening IPC in Rust is a pre-requisite for many other projects. It might need to be a collective effort rather than blocking on one person

ghost · 2019-09-10T15:42:25Z

Maybe we can kick off a discussion on the mailing list?

Status of Rust IPC, other subprojects there. Maybe it's reasonable to merge this as a start and build out more test cases and the other APIs?
Flight support in Rust: which gRPC implementation, which Protobuf implementation, etc. (tower-grpc isn't quite up to par with the standard gRPC implementations, but the others bind to C or C++ IIRC)

nevi-me · 2019-09-10T15:54:00Z

My apologies for dragging so much on this, I've been on one of those "it might end next week" projects that's taken too much of my time.

In terms of effort, the reader is complete, supporting all types that Rust can read. I'm mainly left with semantics and moving test cases to arrow-testing. I'll work on wrapping this up by the weekend so we can unlock progress here.

I haven't been able to follow the discussions around buffer alignment, so I might need help there from someone else.

As a start, I'll rebase when I get home tonight.

was using an older version of rustfmt

nevi-me · 2019-11-19T03:37:52Z

@paddyhoran @andygrove @sunchao @liurenjie1024 PTAL. If anyone has capacity, it would be great to add more data type support so we can be able to read more integration files.

I'm going to work on the stream format reader next, which will be very useful for flight.

andygrove

@nevi-me I don't pretend to understand every detail here but LGTM overall. There are a couple of unwrap/panics that could probably be changed to use results but I think we should get this PR merged to make it easier for other committers (myself included) to start helping you out with this effort.

paddyhoran

I agree with @andygrove. Let's get this merged so that others can start to chip in. Thanks @nevi-me, this is great progress!

andy-thomason · 2019-11-24T14:40:28Z

What is the status of this? I can only see the beginnings of a schema converter in the repo
and went ahead and wrote a full version of this. I also have batch and dictionary support in
internal work projects if this is required but would be overjoyed if someone had done this already.

https://github.com/andy-thomason/arrow

nevi-me · 2019-11-24T14:46:43Z

Hi @andy-thomason, your master branch is 725 commits behind apache/arrow. It's often a good idea to rebase before doing any work 😃
The reader was merged 5 days ago, so you can review that and perhaps continue from there. dictionary support is exciting to me! Its relevant JIRA is https://issues.apache.org/jira/browse/ARROW-5949, so perhaps you can open a PR against it as a start.

I haven't had time to work on the stream reader (as this PR was only for the file reader), and we'd welcome more hands on deck so we can complete IPC in Rust by 1.0.0 (not sure of when that'll be, but might be in December)

andy-thomason · 2019-11-24T14:52:50Z

Excellent news, @nevi-me We use arrow extensively at work for genetic variant analysis and I have written a number of parsers for the format - all internal, I'm afraid. I'll sync up and take a look.

nevi-me · 2019-11-24T14:54:49Z

Is this all using Rust @andy-thomason?

andy-thomason · 2019-11-24T15:03:52Z

We were using C++, Python and R in the past, but now we are starting to write services in Rust instead of running batch code on a cluster.

andy-thomason · 2019-11-24T15:04:52Z

We also have a group in Cambridge using Rust in WASM.

andygrove · 2019-11-24T15:40:49Z

This related work also just got merged. https://github.com/apache/arrow/tree/master/rust/arrow-flight

…

On Sun, Nov 24, 2019 at 8:04 AM Andy Thomson ***@***.***> wrote: We also have a group in Cambridge using Rust in WASM. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4167?email_source=notifications&email_token=AAHEBRBYJFXZSNYYGLR3IXTQVKJZLA5CNFSM4HGX5JP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFANLHY#issuecomment-557897119>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHEBRAZLOAEJAK3QEAHOSDQVKJZLANCNFSM4HGX5JPQ> .

andy-thomason · 2019-11-24T15:52:44Z

Excellent @andygrove, Saved me a huge pile of work. I have a deadline of the end of next week to
get a file reader working using hyper. Liking this more and more.

nevi-me · 2019-11-24T15:54:29Z

If you'll be using arrow-flight, the stream reader might be the next best thing for us to work on. I can't remember the spec, but most of the work should be refactoring the file reader to check for end of stream

andy-thomason · 2019-11-24T16:09:35Z

I'll certainly look at arrow-flight. At the moment we are using static data servers as our data is significantly read only, but I would like to move to a microservice model. We use record batches of about 1GB and have a second arrow file as a sparse index.

nevi-me · 2019-11-24T16:30:54Z

JIRA logs 10 minutes to each PR for comments, so maybe the mailing list might be a better place to continue the conversation (as this PR is closed). Be wary though that gRPC has low data limits by default (4MB, haven't tried modifying it in tonic or tower-grpc, but you can change it with a setting in other languages). You might end up having to break your arrow batches down to smaller chunks.

nevi-me changed the title ~~ARROW-5180: [Rust] Initial support for Arrow File reader~~ ARROW-5181: [Rust] Initial support for Arrow File reader Apr 17, 2019

sunchao added the Component: Rust label Apr 18, 2019

sunchao reviewed Apr 25, 2019

View reviewed changes

nevi-me force-pushed the ARROW-5180 branch from 50fb537 to ab6ff9b Compare April 26, 2019 15:15

nevi-me commented Apr 26, 2019

View reviewed changes

rust/arrow/src/ipc/file/reader.rs Show resolved Hide resolved

pitrou reviewed May 2, 2019

View reviewed changes

rust/arrow/src/array.rs Outdated Show resolved Hide resolved

nevi-me force-pushed the ARROW-5180 branch from 48d0920 to eb3dc9c Compare May 6, 2019 14:26

sunchao mentioned this pull request May 14, 2019

ARROW-5284: [Rust] Replace libc with std::alloc for memory allocation #4273

Closed

nevi-me mentioned this pull request May 22, 2019

ARROW-5357: [Rust] Change Buffer::len to represent total bytes instead of used bytes #4331

Closed

nevi-me force-pushed the ARROW-5180 branch from eb3dc9c to 4efe60b Compare July 8, 2019 06:24

kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29

nevi-me and others added 10 commits November 19, 2019 05:30

save progress, going to work on JSON integration

c6563d5

use generated arrow files for testing

51c3842

fix rustfmt issue

49ff127

was using an older version of rustfmt

read fixed size list array

65228a4

fix test file locations

11301bc

read strings and binaries

bda4a7c

debug messages to identify reader errors

af5ebef

Update to fix precision on float32 arrays.

c710076

minor cleaning up, test IPC reader with more files

84e31b7

fix compiler warnings

cc38646

nevi-me force-pushed the ARROW-5180 branch from 42210fc to cc38646 Compare November 19, 2019 03:30

nevi-me requested review from andygrove and paddyhoran November 19, 2019 03:34

andygrove approved these changes Nov 19, 2019

View reviewed changes

paddyhoran approved these changes Nov 19, 2019

View reviewed changes

paddyhoran closed this in a5a67e8 Nov 19, 2019

nevi-me deleted the ARROW-5180 branch November 19, 2019 12:47

asfimport mentioned this pull request Nov 24, 2019

[Rust] Create Arrow File reader #21658

Closed

ARROW-5181: [Rust] Initial support for Arrow File reader #4167

ARROW-5181: [Rust] Initial support for Arrow File reader #4167

Conversation

nevi-me commented Apr 17, 2019 • edited Loading

nevi-me commented Apr 17, 2019

codecov-io commented Apr 17, 2019 • edited Loading

Codecov Report

ghost commented Apr 18, 2019

sunchao left a comment

Choose a reason for hiding this comment

sunchao Apr 24, 2019

Choose a reason for hiding this comment

nevi-me May 6, 2019

Choose a reason for hiding this comment

sunchao Apr 25, 2019

Choose a reason for hiding this comment

nevi-me Oct 17, 2019

Choose a reason for hiding this comment

nevi-me commented Apr 25, 2019

sunchao commented Apr 25, 2019

nevi-me commented Apr 26, 2019 • edited Loading

andygrove commented Apr 27, 2019

sunchao commented May 2, 2019 • edited Loading

nevi-me commented May 2, 2019

wesm commented Jun 24, 2019

nevi-me commented Jun 24, 2019

wesm commented Jun 24, 2019

jbabyhacker commented Aug 14, 2019

nevi-me commented Aug 14, 2019

ghost commented Sep 9, 2019

wesm commented Sep 10, 2019

ghost commented Sep 10, 2019

nevi-me commented Sep 10, 2019

nevi-me commented Nov 19, 2019

andygrove left a comment

Choose a reason for hiding this comment

paddyhoran left a comment

Choose a reason for hiding this comment

andy-thomason commented Nov 24, 2019

nevi-me commented Nov 24, 2019

andy-thomason commented Nov 24, 2019

nevi-me commented Nov 24, 2019

andy-thomason commented Nov 24, 2019

andy-thomason commented Nov 24, 2019

andygrove commented Nov 24, 2019 via email

andy-thomason commented Nov 24, 2019

nevi-me commented Nov 24, 2019

andy-thomason commented Nov 24, 2019

nevi-me commented Nov 24, 2019

nevi-me commented Apr 17, 2019 •

edited

Loading

codecov-io commented Apr 17, 2019 •

edited

Loading

nevi-me commented Apr 26, 2019 •

edited

Loading

sunchao commented May 2, 2019 •

edited

Loading