Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reading Arrow files #6337

Merged
merged 10 commits into from
May 16, 2023

Conversation

jonmmease
Copy link
Contributor

Which issue does this PR close?

Closes #5594

What changes are included in this PR?

This PR adds support for registering Arrow files with the SessionContext and reading them. It's largely pattern matched on the existing Avro support, with some inspiration from #1858.

Are these changes tested?

There is one test, but I'd appreciate guidance on what additional testing we'd like to have here.

Are there any user-facing changes?

Yes, there is now a register_arrow method on the SessionContext.

@github-actions github-actions bot added the core Core DataFusion crate label May 11, 2023
@jonmmease
Copy link
Contributor Author

Ok, tests are passing and this is ready for a look. Thanks!

@alamb
Copy link
Contributor

alamb commented May 12, 2023

Thank you @jonmmease -- this looks great. I plan to review it in the next day or so if no one beats me to it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @jonmmease -- thank you so much 👍

I think we should add a sqllogictest (I can help with this) if possible to cover SQL as well and file a ticket for the follow on work to avoid reading the file twice, but otherwise I think this PR is good to go.

Really nice work 🏆

// specific language governing permissions and limitations
// under the License.

//! Apache Arrow format abstractions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to the format docs to make it clear this writer writes the IPC format https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format ?

Copy link
Contributor Author

@jonmmease jonmmease May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f9053a3

let schema = match store.get(&object.location).await? {
GetResult::File(mut file, _) => read_arrow_schema_from_reader(&mut file)?,
r @ GetResult::Stream(_) => {
// TODO: Fetching entire file to get schema is potentially wasteful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it is more than potentially 😆

I don't think we need to fix it for this PR, however. Maybe @tustvold has some ideas, and if not then I think we can just file a follow on ticket to track fetching only the parts that are needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, yeah, I carried that comment over from the Avro reader as the situation is the same here. But maybe there's some way to look at the start of the stream and parse out the schema?


use super::*;

async fn register_arrow(ctx: &mut SessionContext) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good -- I think we should also write a sqllogictest version of this (e.g. https://github.com/apache/arrow-datafusion/blob/b578c5819fc05801f2972dd427fbc13cf4773ea8/datafusion/core/tests/sqllogictests/test_files/ddl.slt#LL259C1-L260C1)

So that we have coverage of creating an arrow backed table via SQL

But I also think that could be added as a follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried creating an arrow table via SQL and unfortunately it did not work:

datafusion-cli
❯ create external table example stored as arrow location '../datafusion/core/tests/data/example.arrow';
Execution error: Unable to find factory for ARROW

To support that I think all we need is to add another entry to this list:

https://github.com/apache/arrow-datafusion/blob/b578c5819fc05801f2972dd427fbc13cf4773ea8/datafusion/core/src/execution/context.rs#L1358-L1362

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

registration support added in cfad814, and sqllogictest added in 718a21f

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 15, 2023
@jonmmease
Copy link
Contributor Author

Thanks for taking a look and for the kind works @alamb! I think I've addressed your feedback above. Let me know if you'd like me to open the issue to track the schema-from-stream TODO.

@alamb
Copy link
Contributor

alamb commented May 16, 2023

Let me know if you'd like me to open the issue to track the schema-from-stream TODO.

Yes, please, if you can that would be awesome 🙏 thank you

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you @jonmmease 👨‍🍳 👌

@jonmmease
Copy link
Contributor Author

Thanks a lot for the helpful review @alamb! I opened #6368 as discussed.

@Jefffrey Jefffrey mentioned this pull request May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading from Arrow files
2 participants