Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ScalarValue::Dictionary #12488

Conversation

findepi
Copy link
Member

@findepi findepi commented Sep 16, 2024

ScalarValue should be a container for a single nullable logical type and should not be concerned by various physical encodings used in arrays. It doesn't involve arrays even as part of internal representation (like arrow Scalar does).

Remove Dictionary option from the ScalarValue. Note that ScalarValue doesn't have mapping for RLE arrays, while there are very similar type to dictionaries.

Having ScalarValue follow complexity of various physical representations poses a burden on UDF implementers. They need to handle all equivalent scalar values.

`ScalarValue` should be a container for a single nullable logical type
and should not be concerned by various physical encodings used in
arrays. It doesn't involve arrays even as part of internal
representation (like arrow `Scalar` does).

Remove `Dictionary` option from the `ScalarValue`. Note that
`ScalarValue` doesn't have mapping for RLE arrays, while there are very
similar type to dictionaries.

Having `ScalarValue` follow complexity of various physical
representations poses a burden on UDF implementers. They need to handle
all equivalent scalar values.
@github-actions github-actions bot added sql SQL Planner physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate common Related to common crate proto Related to proto crate labels Sep 16, 2024
@findepi
Copy link
Member Author

findepi commented Sep 16, 2024

This won't pass the test yet, but creating the PR to have a conversation first. cc @alamb @comphead

especially if we go with @notfilippo's #11513 , we need to have an answer to: what is a type? what isn't a type? what's logical and what's physical.
but even if we don't do the decoupling linked above, we need to answer this question.

Do we need scalar value to represent "would be dictionary but is single value" (being removed here)?
Do we need scalar value to represent "would be RLE but is single value" (similar, doesn't exist today)?
Do we need scalar value to represent "a string", "a string but actually maybe long one" or "a string but IF it was encoded in array, it would use SSO / German strings", " a string but IF it was encoded in array it would use prefix compression"?
Do we need a scalar to represent "an integer", "an integer but stored as varint", etc.?

To me those are properties of physical representation of series of values, that are not attributes of a single value, so ScalarValue doesn't have to be concerned about them. I think ScalarValue reflects some of them because we expect it to define type, and we expect type to sometimes define physical representation. This is blurry.

Side note:
if we want scalar value to represent all possible aspects of array representation then arrow's builtin Scalar seems to be ready for that.

cc @milevin @sadboy

@findepi findepi marked this pull request as draft September 16, 2024 15:13
@notfilippo
Copy link
Contributor

notfilippo commented Sep 16, 2024

Just posting it here for reference as this PR overlaps with some of the work in this other PR: #11978. I'm currently in the process of resolving merge conflicts accumulated during my vacation time.

@findepi
Copy link
Member Author

findepi commented Sep 17, 2024

This is awesome @notfilippo! I wasn't aware of that work!
Your PR is awesome!
Do you think it could help to have it land faster if it was done in small steps?
For example when removing ScalarValue::Dictionary I did some non-obvious chances that require actual review.

In any case, how can i help with this work?

@notfilippo
Copy link
Contributor

Thanks @findepi

Do you think it could help to have it land faster if it was done in small steps?

Definitely as it would make it easier to review. I opened the PR just to make sure that the same approach would be applicable as we remove more types (which i presume was very successful as we were able to identify some key issues that need attention). Once we reach consensus I was planning splitting my work into smaller PRs.

In any case, how can i help with this work?

Would really appreciate to get more eyes on my PR :)

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

Thanks @findepi

Do you think it could help to have it land faster if it was done in small steps?

Definitely as it would make it easier to review. I opened the PR just to make sure that the same approach would be applicable as we remove more types (which i presume was very successful as we were able to identify some key issues that need attention). Once we reach consensus I was planning splitting my work into smaller PRs.

From my point of view, there is pretty broad consensus that having ScalarValue mirror the physical representation (e.g. ScalarValue::Utf8, ScalarValue::UtfView, ScalarValue::LargeUtf8) is non ideal

The challenge is, as @findepi implies, figuring out how to make the changes incrementally (both to keep the change possible to review as well as to give downstream consumers time to absorb the change)

@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

For example when removing ScalarValue::Dictionary I did some non-obvious chances that require actual review.

BTW I think the most important potential concern for this change is that arrays will be cast at runtime (e.g. when comparing a Dictionary(String) column and a String scalar, it is very important that the Dictionary column is not cast to a StringArray

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @notfilippo . TLDIR is I think this is absolutely the right direction.

Sorry for the delay in reviewing.

To me those are properties of physical representation of series of values, that are not attributes of a single value, so ScalarValue doesn't have to be concerned about them.

I agree

I think ScalarValue reflects some of them because we expect it to define type, and we expect type to sometimes define physical representation. This is blurry.

I also agree it is blurry. I believe that DataFusion uses the Arrow type system for logical types initially for convenience -- at that time there was only Dictionary (REE, StringView were added later). So while it was annoying to handle one special case (Dictionary), it was manageable compared to all the other stuff going on

In my opinion if DataFusion was being implemented again starting today, it would not use Arrow::DataType directly for its logical type system but instead would have a more generic LogicalDataType similarly to what is proposed by @notfilippo

As you show in this PR, even the one special case for Dictionary is quite substantial. Also our experience adding support for StringView involved many changes that have nothing to do with the physical representation (#12503 from @my-vegetable-has-exploded just this morning is a good example)

if we want scalar value to represent all possible aspects of array representation then arrow's builtin Scalar seems to be ready for that.

Yes, I agree. The reason that DataFusion doesn't use the arrow Datum yet is twofold from my perspective:

  1. ScalarValue predates Datum so Datum wasn't an option
  2. Datum (single row array) is quite a bit less efficient than ScalarValue (e.g. a single row StringArray will have several allocations for buffers, offsets, etc)

@jayzhan211
Copy link
Contributor

@findepi Note that ScalarValue in #11978 is not the final state of #11513, it is a step toward the #11513. I think we could have LogicalType without any Arrow's DataType contained in it in the future, and it would be the ideal type what you mentioned. ScalarValue will be a physical type similar to Arrow's Scalar but optimized for native type like (i32, i64).

@findepi
Copy link
Member Author

findepi commented Sep 20, 2024

2. Datum (single row array) is quite a bit less efficient than ScalarValue (e.g. a single row StringArray will have several allocations for buffers, offsets, etc)

for function execution that sounds like a dead end.

in my free time i am trying to cook something for function calls.
that's where my interest in ScalarValue various options comes from

@findepi
Copy link
Member Author

findepi commented Sep 20, 2024

@notfilippo @alamb do you see any value in this PR, or should we just close it?

@alamb
Copy link
Contributor

alamb commented Sep 21, 2024

@notfilippo @alamb do you see any value in this PR, or should we just close it?

I am not sure about the code in the PR, but the idea seems like a step in the right direction (after #12536)

@findepi findepi closed this Oct 9, 2024
@findepi findepi deleted the findepi/remove-scalarvalue-dictionary-2907e2 branch October 9, 2024 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate optimizer Optimizer rules physical-expr Physical Expressions proto Related to proto crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants