-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for reading Arrow files #6337
Conversation
Ok, tests are passing and this is ready for a look. Thanks! |
Thank you @jonmmease -- this looks great. I plan to review it in the next day or so if no one beats me to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great @jonmmease -- thank you so much 👍
I think we should add a sqllogictest (I can help with this) if possible to cover SQL as well and file a ticket for the follow on work to avoid reading the file twice, but otherwise I think this PR is good to go.
Really nice work 🏆
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Apache Arrow format abstractions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to the format docs to make it clear this writer writes the IPC format https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in f9053a3
let schema = match store.get(&object.location).await? { | ||
GetResult::File(mut file, _) => read_arrow_schema_from_reader(&mut file)?, | ||
r @ GetResult::Stream(_) => { | ||
// TODO: Fetching entire file to get schema is potentially wasteful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree it is more than potentially 😆
I don't think we need to fix it for this PR, however. Maybe @tustvold has some ideas, and if not then I think we can just file a follow on ticket to track fetching only the parts that are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah, I carried that comment over from the Avro reader as the situation is the same here. But maybe there's some way to look at the start of the stream and parse out the schema?
|
||
use super::*; | ||
|
||
async fn register_arrow(ctx: &mut SessionContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good -- I think we should also write a sqllogictest version of this (e.g. https://github.com/apache/arrow-datafusion/blob/b578c5819fc05801f2972dd427fbc13cf4773ea8/datafusion/core/tests/sqllogictests/test_files/ddl.slt#LL259C1-L260C1)
So that we have coverage of creating an arrow backed table via SQL
But I also think that could be added as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried creating an arrow table via SQL and unfortunately it did not work:
datafusion-cli
❯ create external table example stored as arrow location '../datafusion/core/tests/data/example.arrow';
Execution error: Unable to find factory for ARROW
To support that I think all we need is to add another entry to this list:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look and for the kind works @alamb! I think I've addressed your feedback above. Let me know if you'd like me to open the issue to track the schema-from-stream TODO. |
Yes, please, if you can that would be awesome 🙏 thank you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thank you @jonmmease 👨🍳 👌
Which issue does this PR close?
Closes #5594
What changes are included in this PR?
This PR adds support for registering Arrow files with the
SessionContext
and reading them. It's largely pattern matched on the existing Avro support, with some inspiration from #1858.Are these changes tested?
There is one test, but I'd appreciate guidance on what additional testing we'd like to have here.
Are there any user-facing changes?
Yes, there is now a
register_arrow
method on theSessionContext
.