Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: preserve more dictionaries when coercing types #10221

Conversation

erratic-pattern
Copy link
Contributor

@erratic-pattern erratic-pattern commented Apr 24, 2024

Which issue does this PR close?

Closes #10220

Rationale for this change

Loosen the restriction on when type coercion will preserve dictionary types to prevent slow casting of columns with dictionary type.

What changes are included in this PR?

Are these changes tested?

There are already tests that would fail on type coercion. Let me know if there additional tests you'd like to see.

Are there any user-facing changes?

Certain coercion operations will now preserve dictionary encoding where before they would downcast to the value type

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Apr 24, 2024
@erratic-pattern
Copy link
Contributor Author

erratic-pattern commented Apr 24, 2024

There is a test failure in sqllogictests

External error: query failed: DataFusion error: Error during planning: No function matches the given name and argument types 'coalesce(Int64, Dictionary(Int32, Int8))'. You might need to add explicit type casts.
        Candidate functions:
        coalesce(CoercibleT, .., CoercibleT)
[SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
at test_files/scalar.slt:1794

here is the test case:
https://github.com/erratic-pattern/arrow-datafusion/blob/3a6c04ec8f989ddcf3b13a871a782c0bf56b9f8e/datafusion/sqllogictest/test_files/scalar.slt#L1795

There could be unintended consequences to this change for casting so I will look into it

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

Can we please add a test to https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt with the explain plan showing the correct filter being added?

@erratic-pattern erratic-pattern force-pushed the type-coercion-preserve-more-dictionaries branch from 3a6c04e to 7fe8eeb Compare April 24, 2024 20:59
@erratic-pattern
Copy link
Contributor Author

The test failure is from comparison_coercion being used in function argument type coercion for the coalesce function and other Signature::VariadicEqual functions.

TypeSignature::VariadicEqual => {
let new_type = current_types.iter().skip(1).try_fold(
current_types.first().unwrap().clone(),
|acc, x| {
// The coerced types found by `comparison_coercion` are not guaranteed to be
// coercible for the arguments. `comparison_coercion` returns more loose
// types that can be coerced to both `acc` and `x` for comparison purpose.
// See `maybe_data_types` for the actual coercion.
let coerced_type = comparison_coercion(&acc, x);
if let Some(coerced_type) = coerced_type {
Ok(coerced_type)
} else {
internal_err!("Coercion from {acc:?} to {x:?} failed.")
}
},
);

@erratic-pattern
Copy link
Contributor Author

There are other places comparison_coercion is used so we may want to look for additional unintended side effects of this change.

array prepend/append:

let new_base_type = comparison_coercion(&array_base_type, &elem_base_type);

list type coercions:

pub fn get_coerce_type_for_list(
expr_type: &DataType,
list_types: &[DataType],
) -> Option<DataType> {
list_types
.iter()
.try_fold(expr_type.clone(), |left_type, right_type| {
comparison_coercion(&left_type, right_type)
})
}

CASE WHEN ... :

pub fn get_coerce_type_for_case_expression(
when_or_then_types: &[DataType],
case_or_else_type: Option<&DataType>,
) -> Option<DataType> {
let case_or_else_type = match case_or_else_type {
None => when_or_then_types[0].clone(),
Some(data_type) => data_type.clone(),
};
when_or_then_types
.iter()
.try_fold(case_or_else_type, |left_type, right_type| {
// TODO: now just use the `equal` coercion rule for case when. If find the issue, and
// refactor again.
comparison_coercion(&left_type, right_type)
})
}

IN subquery:

Expr::InSubquery(InSubquery {
expr,
subquery,
negated,
}) => {
let new_plan = analyze_internal(&self.schema, &subquery.subquery)?;
let expr_type = expr.get_type(&self.schema)?;
let subquery_type = new_plan.schema().field(0).data_type();
let common_type = comparison_coercion(&expr_type, subquery_type).ok_or(plan_datafusion_err!(
"expr type {expr_type:?} can't cast to {subquery_type:?} in InSubquery"
),
)?;
let new_subquery = Subquery {
subquery: Arc::new(new_plan),
outer_ref_columns: subquery.outer_ref_columns,
};
Ok(Transformed::yes(Expr::InSubquery(InSubquery::new(
Box::new(expr.cast_to(&common_type, &self.schema)?),
cast_subquery(new_subquery, &common_type)?,
negated,
))))
}

BETWEEN:

Expr::Between(Between {
expr,
negated,
low,
high,
}) => {
let expr_type = expr.get_type(&self.schema)?;
let low_type = low.get_type(&self.schema)?;
let low_coerced_type = comparison_coercion(&expr_type, &low_type)
.ok_or_else(|| {
DataFusionError::Internal(format!(
"Failed to coerce types {expr_type} and {low_type} in BETWEEN expression"
))
})?;
let high_type = high.get_type(&self.schema)?;
let high_coerced_type = comparison_coercion(&expr_type, &low_type)
.ok_or_else(|| {
DataFusionError::Internal(format!(
"Failed to coerce types {expr_type} and {high_type} in BETWEEN expression"
))
})?;
let coercion_type =
comparison_coercion(&low_coerced_type, &high_coerced_type)
.ok_or_else(|| {
DataFusionError::Internal(format!(
"Failed to coerce types {expr_type} and {high_type} in BETWEEN expression"
))
})?;
Ok(Transformed::yes(Expr::Between(Between::new(
Box::new(expr.cast_to(&coercion_type, &self.schema)?),
negated,
Box::new(low.cast_to(&coercion_type, &self.schema)?),
Box::new(high.cast_to(&coercion_type, &self.schema)?),
))))
}

| (other_type, d @ Dictionary(_, value_type))
if preserve_dictionaries && value_type.as_ref() == other_type =>
(d @ Dictionary(_, _), other_type) | (other_type, d @ Dictionary(_, _))
if preserve_dictionaries && can_cast_types(other_type, d) =>
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, we should check whether value_type and other_type are coercible. If yes return the new dict with new value type

Dict(k, i8) and i64 -> Dict(k, i64)
Dict(k, i64) and i8 -> Dict(k, i64)

Instead of Some(d.clone()) we should create a new dict with new value type.

And, I think we should use the coerce logic in comparison_coercion not can_cast_types, because the logic is not totally the same.

        | (other_type, d @ Dictionary(key_type, value_type))
            if preserve_dictionaries && comparison_coercion(value_type.as_ref(), other_type).is_some() =>
        {
            let new_value = comparison_coercion(&value_type, other_type).unwrap();
            Some(DataType::Dictionary(key_type.clone(), Box::new(new_value)))
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 Thanks for the tip. I initially avoiding calling comparison_coercion because the way I was doing it would cause an infinite recursion, but calling comparison_coercion on the inner value type and then wrapping in a new Dictionary as you suggested solves that issue.

I was hoping this would fix the test failure, but unfortunately it does not so I will need to look into reworking the type coercion logic for VariadicEqual functions

@erratic-pattern erratic-pattern force-pushed the type-coercion-preserve-more-dictionaries branch 3 times, most recently from 3a53770 to 8901246 Compare April 25, 2024 03:38
} else {
value_type
}
};
match (lhs_type, rhs_type) {
(
Dictionary(_lhs_index_type, lhs_value_type),
Dictionary(_rhs_index_type, rhs_value_type),
) => comparison_coercion(lhs_value_type, rhs_value_type),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason why this case doesn't also preserve the dictionary type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I know of

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because value type is enough for comparison

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should not preserve dict for comparison overall 🤔 ?

Copy link
Contributor

@jayzhan211 jayzhan211 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need to solve the issue is avoiding casting from value to dict for column, because casting for column is costly compare with casting constant.

Given the example, if we preserve dict, we still ends up casting column (utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to utf8 and it is enough.

statement ok
create table test as values
  ('row1', arrow_cast(1, 'Utf8')),
  ('row2', arrow_cast(2, 'Utf8')),
  ('row3', arrow_cast(3, 'Utf8'))
;

# query using an string '1' which must be coerced into a dictionary string
query TT
SELECT * from test where column2 =  arrow_cast(2, 'Dictionary(Int32, Int64)');
----
row2 2

query TT
explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');
----
logical_plan
01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, Utf8("2"))
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]

statement ok
drop table test;

expect plan

01)Filter: test.column2 = Utf8("2")
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column2@1 = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]

I think we should not preserve dict, but need specialization on comparing dict vs non-dict case.

Copy link
Contributor

@alamb alamb Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2

Yes I agree that looks bad. It should be unwrapped.

Thank you for that example 💯

Maybe we could extend

match (left.as_mut(), right.as_mut()) {
which handles CAST(col, ..) = const for other datatypes 🤔

I can try to do so later this weekend. Or would you like to try it @erratic-pattern ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I tried the example from @jayzhan211 on main and it doesn't put a cast on the filter -- thus I agree this PR would be a regression if merged as is. I will dismiss my review

DataFusion CLI v37.1.0
> create table test as values
  ('row1', arrow_cast(1, 'Utf8')),
  ('row2', arrow_cast(2, 'Utf8')),
  ('row3', arrow_cast(3, 'Utf8'))
;
0 row(s) fetched.
Elapsed 0.050 seconds.

> explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');

+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: test.column2 = Utf8("2")                  |
|               |   TableScan: test projection=[column1, column2]   |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: column2@1 = 2                       |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.053 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that we probably don't want to cast everything to dictionaries in the way that we are currently doing it, and what we really want is a way for the optimizer to avoid expensive casts of dictionary columns, and more generally to just avoid column casts in favor of casting constants and scalar expressions.

I think what we have works fine for now and fixes performance issues we're seeing on dictionary columns, but should be improved for the general case in subsequent PRs that redesign the type coercion logic.

@erratic-pattern erratic-pattern force-pushed the type-coercion-preserve-more-dictionaries branch from 8901246 to f47e74d Compare April 25, 2024 03:54
@erratic-pattern
Copy link
Contributor Author

It looks like the failing test case is no longer throwing an error after I added @jayzhan211 's suggestion:

DataFusion CLI v37.1.0
> select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
+----------------------------------------------------------------------------+
| coalesce(Int64(34),arrow_cast(Int64(123),Utf8("Dictionary(Int32, Int8)"))) |
+----------------------------------------------------------------------------+
| 34                                                                         |
+----------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.071 seconds.

I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>

External error: query columns mismatch:
[SQL] select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
[Expected] [I]
[Actual  ] [?]
at test_files/scalar.slt:1794

@alamb alamb marked this pull request as draft April 25, 2024 10:05
@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

Marking as draft as I think this PR is still in progress and the CI is not yet passing

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>

That seems like you could just update the test. Perhaps via https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode

@erratic-pattern
Copy link
Contributor Author

erratic-pattern commented Apr 25, 2024

I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.

I think a more specific optimization might look something like
ValueType(column) <binary op> CAST(<constant or expression> as ValueType)
to:
column <binary op> CAST(<constant or expression> as Dictionary<IndexType, ValueType>)

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.

FWIW I think the coercion is the right approach here. Let me take a look at this PR and see if I can find something

@erratic-pattern
Copy link
Contributor Author

Or even more generally, you could ignore whether the operands are column references or constants and just have:

ValueType(<expr> :: Dictionary<IndexType, ValueType>) <binary op> ValueType(<expr> :: <any type>)
becomes:
(<expr> :: Dictionary<IndexType, ValueType>) <binary op> CAST(<expr> as Dictionary<IndexType, ValueType>)

@erratic-pattern
Copy link
Contributor Author

I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be.

FWIW I think the coercion is the right approach here. Let me take a look at this PR and see if I can find something

I think it is the right approach for this one particular case, but the coalesce example in the tests is a good example of where you probably do want to convert to a simple integer type rather than converting everything to dictionaries.

Maybe we need to decouple the non-comparison type coercions from using the comparison_coercion function and have their own coercion function(s). For example VariadicEqual functions can use a new operand_coercion function which does the same thing as the previous comparison_coercion function, but with preserve_dictionaries flag set to false.

@erratic-pattern
Copy link
Contributor Author

this comment also seems to hint at a possible refactor:

// The coerced types found by `comparison_coercion` are not guaranteed to be
// coercible for the arguments. `comparison_coercion` returns more loose
// types that can be coerced to both `acc` and `x` for comparison purpose.
// See `maybe_data_types` for the actual coercion.
let coerced_type = comparison_coercion(&acc, x);

we could potentially rewrite coalesce coercion to use maybe_data_type which allows restricting coercion to a list of output types

/// Try to coerce the current argument types to match the given `valid_types`.
///
/// For example, if a function `func` accepts arguments of `(int64, int64)`,
/// but was called with `(int32, int64)`, this function could match the
/// valid_types by coercing the first argument to `int64`, and would return
/// `Some([int64, int64])`.
fn maybe_data_types(
valid_types: &[DataType],
current_types: &[DataType],
) -> Option<Vec<DataType>> {

@erratic-pattern
Copy link
Contributor Author

This diff shows a list of types that coalesce can accept. There might be others but this would serve as a good starting point:
https://github.com/apache/datafusion/pull/9459/files#diff-9b644c479dfb999609b40e8da6d3b2a40c7adadf88296eeefd2f803201e7ab6dL25-L43

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 25, 2024
@erratic-pattern
Copy link
Contributor Author

I think it's now complaining that the type is no longer an integer and is instead Dictionary<Int32, Int8>

That seems like you could just update the test. Perhaps via https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode

I updated the tests, but I'm still unsure if it's okay to ignore this change in behavior. It seems like this could be a regression in other cases where we might be converting things to dictionaries for no reason.

It would be nice to have some issue tracking this, or maybe there is one already.

A possibly easy change to make in the short-term is to refactor away from using comparison_coercion for non-comparisons and instead use coercion logic specifically tailored to those expressions.

I just feels like we're playing Jenga with type coercion logic right now. The coercion logic should prioritize avoiding expensive casts, I think.

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

I updated the tests, but I'm still unsure if it's okay to ignore this change in behavior. It seems like this could be a regression in other cases where we might be converting things to dictionaries for no reason.

I think the reason to convert to a dictonary is that the other side of the comparsion is already a dictionary, which thus avoids convering both arguments (though I may be missing something)

Maybe @viirya has some thoughts given he worked on coercion as part of #9459 recently

It would be nice to have some issue tracking this, or maybe there is one already.

Perhaps #5928 ?

There appear to be a bunch of open tickets about coercion https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+coercion

A possibly easy change to make in the short-term is to refactor away from using comparison_coercion for non-comparisons and instead use coercion logic specifically tailored to those expressions.

This seems reasonable to me (aka have special rules for coerce). Other functions like BETWEEN I think can be thought of as syntactic sugar for comparisons (namely x > low AND x < high)

I just feels like we're playing Jenga with type coercion logic right now. The coercion logic should prioritize avoiding expensive casts, I think.

The key to ensuring we don't break existing code is to rely on the tests.

I agree the type coercion logic could be improved -- specifically I think it needs to have some encapsulation rather than being spread out through a bunch of free functions.

Is this something you are interested in helping out with? I think the first thing to do would be to try and explain how the current logic works (and in that process we will likely uncover ways to make it better). Then, would improve the code the structure to encapsulate the logic (into structs / enums perhaps -- like I did recently in #10216). Once the logic is more encapsulated, then I think we'll be able to reason about it and feel good about how it fits into the overall picture

I think the difference in casting a scalar integer to a scalar dictionary is neglible. The difference casting a column to a different type is likely subtantial (though casting string --> dictionary doesn't require any data copying)

@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

For this PR I suggest:

  1. We add tests showing the desired behavior (I will push a commit or two)
  2. Update the coercion tests to show the resulting types

Then as a follow on, if you want to take on the refactoring of type coecion we can start doing that.

@alamb alamb marked this pull request as ready for review April 25, 2024 17:26
alamb
alamb previously approved these changes Apr 25, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @erratic-pattern and @jayzhan211

I pushed a few new tests and I think it is ready to go. I think we should wait a day or so to let other reviewers have a change

In terms of "type coercion logic seems quite brittle, I agree. It would be really nice to try and make it easier to understand / change without worrying of unintended consequences

@@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3);
[1, 2, 3]

# coalesce static empty value
query T
SELECT COALESCE('', 'test')
query TT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these tests to show what the coerced type was as well

explain SELECT * from test where column2 = 1;
----
logical_plan
01)Filter: test.column2 = Dictionary(Int32, Utf8("1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key is that there is no CAST on column2 here

query I
select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)'));
# test coercion Int and Dictionary
query ?T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @erratic-pattern noted the difference here is that now that the output type of cealesce is Dictionary rather than Int.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Apr 26, 2024

Maybe we need to decouple the non-comparison type coercions from using the comparison_coercion function and have their own coercion function(s). For example VariadicEqual functions can use a new operand_coercion function which does the same thing as the previous comparison_coercion function, but with preserve_dictionaries flag set to false.

I'm wondering why coalesce has the signature VariadicEqual, since it takes a first non-null value, ideally, coercion is not necessary. I think VariadicAny is more suitable.

In this PR, coalesce do the coercion internally, I forgot why we do coercion but not returning first non-null value with the type it has

Output in this PR:
query IT
select coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')), arrow_typeof(coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')))
----
1 Int64

Expect: 
1 Int32

Change to VariadicAny, I got Int32.

@alamb
Copy link
Contributor

alamb commented Apr 26, 2024

I'm wondering why coalesce has the signature VariadicEqual, since it takes a first non-null value, ideally, coercion is not necessary. I think VariadicAny is more suitable.

In this PR, coalesce do the coercion internally, I forgot why we do coercion but not returning first non-null value with the type it has

I can't remember either but I remember it was tricky -- I suggest we open a new ticket / discussion about it rather than trying to change the behavior in this PR

@erratic-pattern erratic-pattern force-pushed the type-coercion-preserve-more-dictionaries branch from 7ed5503 to 7f64b4a Compare April 26, 2024 19:22
@erratic-pattern
Copy link
Contributor Author

rebased onto main

@alamb
Copy link
Contributor

alamb commented Apr 29, 2024

I had a chat with @erratic-pattern the plan:

We'll put this PR into draft mode

  1. He is going to make a ticket + PR will improve unwrap cast comparison to handle dictionaries
  2. I will review Fix Coalesce casting logic to follows what Postgres and DuckDB do. Introduce signature that do non-comparison coercion #10268 from @jayzhan211 and hopefully get that sorted out
  3. Once that is complete, we'll revisit the approach in this PR and decide how to proceed.

@alamb
Copy link
Contributor

alamb commented May 1, 2024

I believe this is superceded by #10323

@alamb alamb closed this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow comparisions to dictionary columns with type coercion
3 participants