Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5181: [Rust] Initial support for Arrow File reader #4167

Closed
wants to merge 16 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Apr 17, 2019

This adds initial support for reading Arrow files. Only the file format is supported, with support for the streaming format to follow in a separate PR.

Only Rust supported datatypes are read and tested, thus files with timestamp[tz], intervals, durations, maps, decimal; aren't tested.

@nevi-me
Copy link
Contributor Author

nevi-me commented Apr 17, 2019

Hi @paddyhoran @andygrove @sunchao please review when you get a chance. We can place the reusable parts (validating headers, reading schemas and batches) in a common module when we work on the streaming format.

@nevi-me nevi-me changed the title ARROW-5180: [Rust] Initial support for Arrow File reader ARROW-5181: [Rust] Initial support for Arrow File reader Apr 17, 2019
@codecov-io
Copy link

codecov-io commented Apr 17, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@6d25dfd). Click here to learn what that means.
The diff coverage is 88.28%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #4167   +/-   ##
=========================================
  Coverage          ?   83.52%           
=========================================
  Files             ?       87           
  Lines             ?    24958           
  Branches          ?        0           
=========================================
  Hits              ?    20845           
  Misses            ?     4113           
  Partials          ?        0
Impacted Files Coverage Δ
rust/arrow/src/array/array.rs 91.12% <0%> (ø)
rust/arrow/src/ipc/file/reader.rs 87.34% <87.34%> (ø)
rust/arrow/src/ipc/convert.rs 96.94% <96.72%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d25dfd...d555384. Read the comment docs.

@ghost
Copy link

ghost commented Apr 18, 2019

@nevi-me I'm excited to see this! I will make time to review over the weekend.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review @nevi-me . Overall looks good.

rust/arrow/src/ipc/convert.rs Outdated Show resolved Hide resolved
@@ -61,6 +63,108 @@ fn schema_to_fb(schema: &Schema) -> FlatBufferBuilder {
fbb
}

/// Deserialize a Schema table from IPC format to Schema data type
pub fn fb_to_schema(fb: ipc::Schema) -> Schema {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have tests for these newly added functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added one test to rule them all, by converting a large Schema with all the types that we support, to flatbuffers and back. Would that suffice?

rust/arrow/src/ipc/file/reader.rs Outdated Show resolved Hide resolved
rust/arrow/src/ipc/file/reader.rs Outdated Show resolved Hide resolved
rust/arrow/src/ipc/file/reader.rs Outdated Show resolved Hide resolved
rust/arrow/src/ipc/file/reader.rs Outdated Show resolved Hide resolved
///
/// Sets the current block to the batch number, and reads the record batch at that
/// block
pub fn read_batch(&mut self, batch_num: usize) -> Result<Option<RecordBatch>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wondering if we can change this to something like set_index which just change the current_block and the caller then have to call next() to actually read the batch.

Otherwise, this does have a side effect of advancing the current block index, which may caught people by surprise. We should at least point this out in the method comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

rust/arrow/src/ipc/convert.rs Outdated Show resolved Hide resolved
@nevi-me
Copy link
Contributor Author

nevi-me commented Apr 25, 2019

Hi @sunchao I'm currently working on this PR to add support for lists and structs, as I've now generated the data.

@sunchao
Copy link
Member

sunchao commented Apr 25, 2019

Cool. Looking forward to the updated PR!

@nevi-me
Copy link
Contributor Author

nevi-me commented Apr 26, 2019

Update: The buffers from the Arrow file are padded to 64 bits, while the ones in Rust are padded to 8-bits. Due to this difference, I can't test data equality using ArrayData.


Hi @sunchao I've updated the PR with list and struct reading. Regarding your comment about using ArrayData, I've noticed that there's a difference between Rust and Python/CPP (maybe just IPC) with buffer padding to multiples of 8 (and treatment of null buffers, but I'll address this separately).

The current commit ab6ff9b (ab6ff9b#diff-cd3519a5e748548b5323b0d05d8e9c8aR504) will fail with the below:

thread 'ipc::file::reader::tests::test_read_struct_file' panicked at 'assertion failed: `(left == right)`
  left: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bba200, len: 8 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bb9fc0, len: 8 }, offset: 0 } }) }`,
 right: `ArrayData { data_type: Boolean, len: 5, null_count: 2, offset: 0, buffers: [Buffer { data: BufferData { ptr: 0x20a62bbec00, len: 1 }, offset: 0 }], child_data: [], null_bitmap: Some(Bitmap { bits: Buffer { data: BufferData { ptr: 0x20a62bbf000, len: 1 }, offset: 0 } }) }`', arrow\src\ipc\file\reader.rs:510:13

The difference being the len: 1 vs len: 8 in the BufferData lengths. The above is from a BooleanArray with 5 elements. I've run out of time for now, but I'll compare the binary rep of the buffers to see what's causing the difference, but I suspect that it's just the lengths being padded. https://www.diffchecker.com/xU8bBtne shows the diff between the struct_array buffer and the struct_type file that I'm reading. The padding to multiples of 8 is more apparent there.

There's other differences that I've picked up as I was going along, which might affect our interop compatibility, but I'll list/address them over the coming days.
I'll also add the python scripts that I used to generate the data, but I'd prefer to move the data files that I'll use to arrow-testing during the course of this PR.

@andygrove
Copy link
Member

@nevi-me This is looking great . I think this change is big enough that we should update the README too and explain where those test data files came from and how there were created.

@sunchao
Copy link
Member

sunchao commented May 2, 2019

Update: The buffers from the Arrow file are padded to 64 bits, while the ones in Rust are padded to 8-bits. Due to this difference, I can't test data equality using ArrayData.

In Rust we also pad buffer with 64 bytes. I think the real reason here is that the len in BufferData is the # of valid bytes, instead of the # of total bytes (= # of valid bytes + # of padded bytes). I'm not sure whether it is easy to change the conversion of ipc::Buffer to meet this requirement. Changing the BufferData implementation might be a little involving but let me take a look.

@nevi-me
Copy link
Contributor Author

nevi-me commented May 2, 2019

Thanks @sunchao

rust/arrow/src/array.rs Outdated Show resolved Hide resolved
@wesm
Copy link
Member

wesm commented Jun 24, 2019

What is the status of this patch for 0.14.0?

@nevi-me
Copy link
Contributor Author

nevi-me commented Jun 24, 2019

What is the status of this patch for 0.14.0?

Hi @wesm, this might not make 0.14. I ended up being dependent on a few other changes that we needed to make. I'll be travelling next week, and doubt I'll be able to complete this in time for 0.14.

@wesm
Copy link
Member

wesm commented Jun 24, 2019

OK, no problem, I removed from the milestone

@kszucs kszucs force-pushed the master branch 2 times, most recently from ed180da to 85fe336 Compare July 22, 2019 19:29
@jbabyhacker
Copy link

@nevi-me & @wesm what is the status of this PR?

@nevi-me
Copy link
Contributor Author

nevi-me commented Aug 14, 2019

Hi @jbabyhacker

TL;DR it's not abandoned.

The long version's that I haven't had enough bandwidth to work on it. I got stuck because I needed help with some changes, but by the time we had introduced those changes, I was already swamped. I've been working long hours at work for the past few months, including weekends. I'm nearly done with the current project at work, so I'm anticipating having downtime from next weekend,. So my time will be broken down between my studies and catching up here.

In terms of work required, I still need to:

I intend on getting IPC (reader, writer, stream vs batch) before 1.0.0.

@ghost
Copy link

ghost commented Sep 9, 2019

If it's interesting at all, I put together a minimal Arrow Flight for Rust proof-of-concept on top of this branch: https://github.com/lihalite/arrow/commit/5ace5b226fb4a3a2a445b11c5b13f847ee3991b1

I used tower-grpc. The impl in this CR is enough to deserialize Flight schema and record batch messages, so we can make a ListFlights call followed by a DoGet.

@wesm
Copy link
Member

wesm commented Sep 10, 2019

That's cool. It seems like hardening IPC in Rust is a pre-requisite for many other projects. It might need to be a collective effort rather than blocking on one person

@ghost
Copy link

ghost commented Sep 10, 2019

Maybe we can kick off a discussion on the mailing list?

  • Status of Rust IPC, other subprojects there. Maybe it's reasonable to merge this as a start and build out more test cases and the other APIs?
  • Flight support in Rust: which gRPC implementation, which Protobuf implementation, etc. (tower-grpc isn't quite up to par with the standard gRPC implementations, but the others bind to C or C++ IIRC)

@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 10, 2019

My apologies for dragging so much on this, I've been on one of those "it might end next week" projects that's taken too much of my time.

In terms of effort, the reader is complete, supporting all types that Rust can read. I'm mainly left with semantics and moving test cases to arrow-testing. I'll work on wrapping this up by the weekend so we can unlock progress here.

I haven't been able to follow the discussions around buffer alignment, so I might need help there from someone else.

As a start, I'll rebase when I get home tonight.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 19, 2019

@paddyhoran @andygrove @sunchao @liurenjie1024 PTAL. If anyone has capacity, it would be great to add more data type support so we can be able to read more integration files.

I'm going to work on the stream format reader next, which will be very useful for flight.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me I don't pretend to understand every detail here but LGTM overall. There are a couple of unwrap/panics that could probably be changed to use results but I think we should get this PR merged to make it easier for other committers (myself included) to start helping you out with this effort.

Copy link
Contributor

@paddyhoran paddyhoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @andygrove. Let's get this merged so that others can start to chip in. Thanks @nevi-me, this is great progress!

@nevi-me nevi-me deleted the ARROW-5180 branch November 19, 2019 12:47
@andy-thomason
Copy link
Contributor

What is the status of this? I can only see the beginnings of a schema converter in the repo
and went ahead and wrote a full version of this. I also have batch and dictionary support in
internal work projects if this is required but would be overjoyed if someone had done this already.

https://github.com/andy-thomason/arrow

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 24, 2019

Hi @andy-thomason, your master branch is 725 commits behind apache/arrow. It's often a good idea to rebase before doing any work 😃
The reader was merged 5 days ago, so you can review that and perhaps continue from there. dictionary support is exciting to me! Its relevant JIRA is https://issues.apache.org/jira/browse/ARROW-5949, so perhaps you can open a PR against it as a start.

I haven't had time to work on the stream reader (as this PR was only for the file reader), and we'd welcome more hands on deck so we can complete IPC in Rust by 1.0.0 (not sure of when that'll be, but might be in December)

@andy-thomason
Copy link
Contributor

Excellent news, @nevi-me We use arrow extensively at work for genetic variant analysis and I have written a number of parsers for the format - all internal, I'm afraid. I'll sync up and take a look.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 24, 2019

Is this all using Rust @andy-thomason?

@andy-thomason
Copy link
Contributor

We were using C++, Python and R in the past, but now we are starting to write services in Rust instead of running batch code on a cluster.

@andy-thomason
Copy link
Contributor

We also have a group in Cambridge using Rust in WASM.

@andygrove
Copy link
Member

andygrove commented Nov 24, 2019 via email

@andy-thomason
Copy link
Contributor

Excellent @andygrove, Saved me a huge pile of work. I have a deadline of the end of next week to
get a file reader working using hyper. Liking this more and more.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 24, 2019

If you'll be using arrow-flight, the stream reader might be the next best thing for us to work on. I can't remember the spec, but most of the work should be refactoring the file reader to check for end of stream

@andy-thomason
Copy link
Contributor

I'll certainly look at arrow-flight. At the moment we are using static data servers as our data is significantly read only, but I would like to move to a microservice model. We use record batches of about 1GB and have a second arrow file as a sparse index.

@nevi-me
Copy link
Contributor Author

nevi-me commented Nov 24, 2019

JIRA logs 10 minutes to each PR for comments, so maybe the mailing list might be a better place to continue the conversation (as this PR is closed). Be wary though that gRPC has low data limits by default (4MB, haven't tried modifying it in tonic or tower-grpc, but you can change it with a setting in other languages). You might end up having to break your arrow batches down to smaller chunks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants