Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading from Arrow files #5594

Closed
andrewthad opened this issue Mar 14, 2023 · 4 comments · Fixed by #6337
Closed

Loading from Arrow files #5594

andrewthad opened this issue Mar 14, 2023 · 4 comments · Fixed by #6337
Labels
enhancement New feature or request

Comments

@andrewthad
Copy link

andrewthad commented Mar 14, 2023

This is not necessarily a feature request. Rather, it's a request for either a feature to be added or for improved documentation clarifying that the feature is not available. To my understanding, datafusion (at the CLI at least) cannot read from Apache arrow files. There are two different kinds of Arrow files: the .arrow file (which has a footer with metadata about block positions) and the .arrows "streaming" file (which lacks the footer). I've tried out several CREATE EXTERNAL TABLE invocations:

CREATE EXTERNAL TABLE foo stored as ARROW LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as ARROWS LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as FEATHER LOCATION foo.arrow

They all give an "Unable to find factory for ..." error. After looking through more of the documentation for a while and paying attention to what wasn't explicitly said, I realized that arrow files are not support as a form of input. I think that, if this is the case, it should be mentioned explicitly in the documentation. Datafusion's documentation is misleading about arrow being an internal implementation detail, not an external-facing way to communicate with a producer of data. From the readme on GitHub:

Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem

I read this as meaning the datafusion can consume data in the arrow format. Flight is a tool specifically for the purpose of shuffling arrow-formatted data around on a network, so it's hard to interpret this as meaning anything else. Perhaps this was a goal at some point, or maybe it's possible to do this, but it's undocumented.

Here are three mutually exclusive possibilities for improving this situation:

  • Document that arrow files are not supported.
  • Document that arrow files are supported (maybe they are and I couldn't figure it out!)
  • Support arrow files as a source of data
@andrewthad andrewthad added the enhancement New feature or request label Mar 14, 2023
@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

I think @Dandandan had a PR that started to do this: #1858

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

Document that arrow files are not supported.

Arrow files are not supported, to the best of my knowledge

@andrewthad
Copy link
Author

If that's the case, I'd like to PR an addition to the FAQ page that makes this more clear. Would that be alright?

@alamb
Copy link
Contributor

alamb commented Mar 15, 2023

If that's the case, I'd like to PR an addition to the FAQ page that makes this more clear. Would that be alright?

Please do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants