Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Utf8View/BinaryView --> Utf8 / Binary at output #12119

Closed
Tracked by #11752
alamb opened this issue Aug 22, 2024 · 8 comments · Fixed by #12271
Closed
Tracked by #11752

Convert Utf8View/BinaryView --> Utf8 / Binary at output #12119

alamb opened this issue Aug 22, 2024 · 8 comments · Fixed by #12271
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 22, 2024

Is your feature request related to a problem or challenge?

Part of #11752

We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the substr function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emit StringViewArray in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)

Describe the solution you'd like

I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance

Describe alternatives you've considered

I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.

For example we might add this configuration flag:

datafusion.optimizer.expand_views_at_output=true

If this flag is true,

  1. add code in the Analyzer (maybe in the TypeCOercion code)
  2. check the output columns of a plan, and if any are DataType::Utf8View or DataType::BinaryView, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast toDataType::Utf8orDataType::Binary` respectively

Additional context

We already have to do something similar in flight with dictionary arrays

@alamb alamb added the enhancement New feature or request label Aug 22, 2024
@findepi
Copy link
Member

findepi commented Aug 23, 2024

I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.

Sounds reasonable.

One question -- do we need a flag though?

@alamb
Copy link
Contributor Author

alamb commented Aug 24, 2024

One question -- do we need a flag though?

My rationale is that converting from Utf8View --> Utf8 is not free. Thus users should have the option of not paying the cost if they want performance over compatibility.

I will admit I don't have a specific usecase in mind

@alamb
Copy link
Contributor Author

alamb commented Aug 24, 2024

BTW with this conversion in place, I think we could contemplate more interesting changes, like substr always outputing Utf8View even when the input was Utf8

@findepi
Copy link
Member

findepi commented Aug 24, 2024

My rationale is that converting from Utf8View --> Utf8 is not free.

it is not
but transmitting non-compacted string views isn't free either.

also, we're transitioning from a state where DF didn't use string views (eager compaction) to a state where DF uses string views (deferred compaction). Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).

@alamb alamb changed the title Add config flag to convert Utf8View/BinaryView --> Utf8 / Binary at output Convert Utf8View/BinaryView --> Utf8 / Binary at output Aug 26, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 26, 2024

Thus, if we convert to non-view types on output, we do not risk regressing anything.

Right -- I agree so this is why in my mind having the default be "Utf8" is important

Thus, if we convert to non-view types on output, we do not risk regressing anything. And we can revisit later whether returning view types could be an improvement (and introduce a flag, if needed & worth it).

I would not be opposed to a PR that just hard codes the existing behavior (convert to Utf8 on output). I changed the title of this PR to reflect this.

I still think a config option offers the most flexibility but I do agree we could add it later if needed

@wiedld
Copy link
Contributor

wiedld commented Aug 27, 2024

@findepi -- are you already working on this ticket? If not, I would like to pick it up.

@findepi
Copy link
Member

findepi commented Aug 27, 2024

@wiedld awesome, go for it

@wiedld
Copy link
Contributor

wiedld commented Aug 28, 2024

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants