You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not necessarily a feature request. Rather, it's a request for either a feature to be added or for improved documentation clarifying that the feature is not available. To my understanding, datafusion (at the CLI at least) cannot read from Apache arrow files. There are two different kinds of Arrow files: the .arrow file (which has a footer with metadata about block positions) and the .arrows "streaming" file (which lacks the footer). I've tried out several CREATE EXTERNAL TABLE invocations:
CREATE EXTERNAL TABLE foo stored as ARROW LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as ARROWS LOCATION foo.arrow
CREATE EXTERNAL TABLE foo stored as FEATHER LOCATION foo.arrow
They all give an "Unable to find factory for ..." error. After looking through more of the documentation for a while and paying attention to what wasn't explicitly said, I realized that arrow files are not support as a form of input. I think that, if this is the case, it should be mentioned explicitly in the documentation. Datafusion's documentation is misleading about arrow being an internal implementation detail, not an external-facing way to communicate with a producer of data. From the readme on GitHub:
Easy to Connect: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
I read this as meaning the datafusion can consume data in the arrow format. Flight is a tool specifically for the purpose of shuffling arrow-formatted data around on a network, so it's hard to interpret this as meaning anything else. Perhaps this was a goal at some point, or maybe it's possible to do this, but it's undocumented.
Here are three mutually exclusive possibilities for improving this situation:
Document that arrow files are not supported.
Document that arrow files are supported (maybe they are and I couldn't figure it out!)
Support arrow files as a source of data
The text was updated successfully, but these errors were encountered:
This is not necessarily a feature request. Rather, it's a request for either a feature to be added or for improved documentation clarifying that the feature is not available. To my understanding, datafusion (at the CLI at least) cannot read from Apache arrow files. There are two different kinds of Arrow files: the
.arrow
file (which has a footer with metadata about block positions) and the.arrows
"streaming" file (which lacks the footer). I've tried out severalCREATE EXTERNAL TABLE
invocations:They all give an "Unable to find factory for ..." error. After looking through more of the documentation for a while and paying attention to what wasn't explicitly said, I realized that arrow files are not support as a form of input. I think that, if this is the case, it should be mentioned explicitly in the documentation. Datafusion's documentation is misleading about arrow being an internal implementation detail, not an external-facing way to communicate with a producer of data. From the readme on GitHub:
I read this as meaning the datafusion can consume data in the arrow format. Flight is a tool specifically for the purpose of shuffling arrow-formatted data around on a network, so it's hard to interpret this as meaning anything else. Perhaps this was a goal at some point, or maybe it's possible to do this, but it's undocumented.
Here are three mutually exclusive possibilities for improving this situation:
The text was updated successfully, but these errors were encountered: