-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove LargeUtf8|Binary, Utf8|BinaryView, and Dictionary from ScalarValue #11978
Conversation
} | ||
|
||
fn create_output_array(val: &ScalarValue, len: usize) -> Result<ArrayRef> { | ||
// TODO(@notfilippo): should we reintroduce a way to encode as dictionaries? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing some context for partitions so I'm not really sure if it's the right call to remove the Dictionary output array. IIUC it seems like it's very useful to save space and optimise columns added this way.
02)--Filter: test.column1_utf8view = Utf8View("Andrew") | ||
02)--Filter: test.column1_utf8view = CAST(Utf8("Andrew") AS Utf8View) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casts are not simplified to retain the type information to fix the optimize_projection
issue above.
let col_val = match phys_expr.evaluate(&self.input_batch) { | ||
Ok(v) => v, | ||
Err(err) => return ConstSimplifyResult::SimplifyRuntimeError(err, expr), | ||
}; | ||
|
||
// TODO(@notfilippo): a fix for the select_arrow_cast error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check partially fixes the issue with select_arrow_cast
.
datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
Outdated
Show resolved
Hide resolved
I run this test and fail. It success on main branch. I guess we need schema for converting ScalarValue to ArrayRef Given that we have ScalarValue::Utf8, and we have StringView in schema. We can then get the corresponding StrginViewArray. |
I feel like this is a little bit more complex because casting a scalar value to a logically equal type (
But I'm not really sure about it... |
Since you have no commit yet, so CI is not running. You could run these command to pass CI |
Thanks! Don't worry about turning it on as I can still get results from my fork. |
I've pushed a fairly big experiment. I've tried to change ColumnarValue to #[derive(Clone, Debug)]
pub enum ColumnarValue {
/// Array of values
Array(ArrayRef),
/// A single value
Scalar(ScalarValue),
Scalar(Scalar),
}
#[derive(Clone, Debug)]
pub struct Scalar {
value: ScalarValue,
data_type: DataType,
} which follows the approache that was discussed with @jayzhan211 in the comments above. I've opted for this hybrid solution to retain most of the flexibility of the original ColumnarValue and I'm mostly satisfied with how it turned out. Curious to hear your thoughts @jayzhan211 and @alamb |
It seems the trick you did is to get the first index of ArrayRef (instead of keeping it as arrow::Scalar) as ScalarValue but we still ends up require DataType to keep the type information. However, I think we could move on with this approach, we could figure out if there is better approach later on |
I'm happy to report that I've got most sqllogictests to run successfully (albeit there is still the issue pointed out by @alamb, which i plan to address after I've got all tests passing). The only errors I'm seeing are the following: Aggregates using ScalarValues as stateAggregates use ScalarValue to represent state and evaluate to a result. Should I look into restricting their return type to the subset of types which can be represented by ScalarValues?
Weird error I don't quite understandRemoving this query doesn't yield any other error in the slt file. I don't have any other clue and I'm not sure where to start 🤷 .
|
The issue is that we have only
I fail to reproduce the error 😕 UIpd: It seems there is only one place that calls
|
It seems like this solution is not that easy as the state needs to also be accounted for, but it's definitely a good start! |
Marking as "ready for review" because all tests pass on my end. Some casting issues still remain that I'll look into soon but in the meantime I'm looking forward to some feedback on this huge change ❤️ |
Will |
Yes that's the plan! |
I'm back from vacation and I've rebased my PR to the latest upstream. |
pub struct Scalar { | ||
value: ScalarValue, | ||
data_type: DataType, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the main change of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that I could open a new PR introducing just introducing this columnar value type, and then little by little removing the scalar value variants (Dictionary, Large* and *View...)
I'm mentioning this because of the discussion in #12488 (comment), as I think it will be very beneficial to split this PR in multiple changes, both for reviewers and datafusion users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @findepi 's suggestion to make the smallest PR as possible to begin. I think splitting this PR out into a small part to begin, along with comments explaining the planned end state (or linking back to the ticket that describes this) would be very helpful
To make it clear this PR is not waiting on more review (we are discussing on #12536) marking this as draft |
Closing in favour of #12622 |
This PR removes LargeUtf8|Binary, Utf8|BinaryView, and Dictionary from ScalarValue, following the discussion on #11513
Open questions
Top level ScalarValue cast
This change initially failed the
select_arrow_cast
test (see review comments for updates). The test fails because of an interaction ofexpression_simplifier
+optimize_projections
.The query is:
SELECT arrow_cast(1234, 'Float64') as f64, arrow_cast('foo', 'LargeUtf8') as large
arrow_cast('foo', 'LargeUtf8')
cast (here)Utf8('foo')
(scalar) →LargeUtf8('foo')
(array) →Utf8('foo')
(scalar)optimize_projections
rewrites the Projection and updates the schema, seeing theUtf8('foo')
it correctly assumes that the LogicalPlan's schema field for this value should have DataType == Utf8This check is the one raising this error but I guess it should instead check if schema fields are logically equivalent to eachother. I'm not totally convinced this is the correct solution because it removes some guarantees that might be expected by users downstream. Happy to hear everyone's opinion on this.
arrow_typeof
https://github.com/apache/datafusion/blob/main/datafusion/functions/src/core/arrowtypeof.rs#L59-L72 uses
column_value.data_type()
to determine the type of the argument but this information is not really accurate. If the ColumnValue is a ScalarValue the data_type() will be "logical". e.g.arrow_typeof(arrow_cast('hello', 'Utf8View'))
would yieldUtf8
.Type info
Let's take this expr as an example a_function_that_takes_utf8_view(arrow_cast('test', 'Utf8View'))
the cast expression currently evaluates to a ColumnValue::Scalar(Utf8View("test")) and the function is happy to receive that. With this change the cast expression instead evaluates to ColumnValue::Scalar(Utf8("test")) (as ScalarValue::Utf8View doesn't exist it produces a logically equal value) and the cast expression data_type() returns Utf8View.