-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Utf8View
/BinaryView
--> Utf8
/ Binary
at output
#12119
Comments
Sounds reasonable. One question -- do we need a flag though? |
My rationale is that converting from Utf8View --> Utf8 is not free. Thus users should have the option of not paying the cost if they want performance over compatibility. I will admit I don't have a specific usecase in mind |
BTW with this conversion in place, I think we could contemplate more interesting changes, like |
it is not also, we're transitioning from a state where DF didn't use string views (eager compaction) to a state where DF uses string views (deferred compaction). Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it). |
Utf8View
/BinaryView
--> Utf8
/ Binary
at outputUtf8View
/BinaryView
--> Utf8
/ Binary
at output
Right -- I agree so this is why in my mind having the default be "Utf8" is important
I would not be opposed to a PR that just hard codes the existing behavior (convert to Utf8 on output). I changed the title of this PR to reflect this. I still think a config option offers the most flexibility but I do agree we could add it later if needed |
@findepi -- are you already working on this ticket? If not, I would like to pick it up. |
@wiedld awesome, go for it |
take |
Is your feature request related to a problem or challenge?
Part of #11752
We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the
substr
function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emitStringViewArray
in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)Describe the solution you'd like
I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance
Describe alternatives you've considered
I recommend a config flag that makes it possible to convert
Utf8View
/BinaryView
-->Utf8
/Binary
at the query output and I think this conversion should be done by default.For example we might add this configuration flag:
If this flag is true,
DataType::Utf8View
orDataType::BinaryView
, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast to
DataType::Utf8or
DataType::Binary` respectivelyAdditional context
We already have to do something similar in flight with dictionary arrays
The text was updated successfully, but these errors were encountered: