Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: separate get_result_type from coerce_type #6221

Merged
merged 4 commits into from
May 5, 2023

Conversation

jackwener
Copy link
Member

@jackwener jackwener commented May 4, 2023

Which issue does this PR close?

related to #5927

Rationale for this change

coerce_type current mix many motivation in it.

  • get_result_type
  • get_common_type
  • check type convert

I prepare to split it into get_result_type and get_common_type

about check type convertion #6223

What changes are included in this PR?

separate get_result_type from coerce_type.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions labels May 4, 2023
Comment on lines -2056 to -2099
let message = format!(
"Expression {case:?} expected to error due to impossible coercion"
);
assert!(logical_plan.is_err(), "{}", message);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type checking is done in the Type coercion analyzer rule rather than directly in the build

@jackwener jackwener changed the title feat: separate get_result_type and coerce_type feat: separate get_result_type from coerce_type May 4, 2023
@jackwener
Copy link
Member Author

Next, we should separate getCommonType from coerce_type

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label May 4, 2023
@jackwener jackwener marked this pull request as ready for review May 4, 2023 08:25
@jackwener jackwener changed the title feat: separate get_result_type from coerce_type refactor: separate get_result_type from coerce_type May 4, 2023
Comment on lines +118 to +124
| Operator::NotEq
| Operator::Lt
| Operator::Gt
| Operator::GtEq
| Operator::LtEq
| Operator::IsDistinctFrom
Copy link
Member Author

@jackwener jackwener May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original, when we call get_data_type, it will return common type instead of result type (Boolean) in coerce_type. So original it's wrong.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great step forward -- thank you @jackwener

I think this also closes #3419

Perhaps we can update the comments here https://github.com/apache/arrow-datafusion/blob/4297547df6dc297d692ca82566cfdf135d4730b5/datafusion/expr/src/type_coercion/binary.rs#L111-L121 and update it to say that 'coerce_types' is returns the type the arguments should be coerced to

I am happy to do so in a follow on PR

Thank you for taking this first step to clean up the coercion logic ❤️

datafusion/expr/src/type_coercion/binary.rs Show resolved Hide resolved
|| is_interval(lhs_type)
|| is_interval(rhs_type) =>
{
temporal_add_sub_coercion(lhs_type, rhs_type, op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I think it would make the code easier to understand and maintain if we can disentangle coerce_types from get_result_types. To start in this PR, calling into coerce_types is fine.

Eventually I think it would keep the logic much simpler if get_result_type is only called on expressions that have their inputs properly coerced.

datafusion/expr/src/type_coercion/binary.rs Outdated Show resolved Hide resolved
@@ -77,7 +77,7 @@ impl PhysicalExpr for DateTimeIntervalExpr {
}

fn data_type(&self, input_schema: &Schema) -> Result<DataType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also love to remove DateTimeIntervalExpr out of its own thing and into binary.rs with the other operators, but for now 👍

@@ -42,7 +42,7 @@ pub fn binary_operator_data_type(
let result_type = if !any_decimal(lhs_type, rhs_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I als find binary_oprator_data_type to be confusing and redundant with get_result_type but we can fix that in a follow on PR perhaps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, I will do in following PR

@jackwener
Copy link
Member Author

Thanks @alamb review carefully.

@jackwener
Copy link
Member Author

jackwener commented May 5, 2023

I prepare to merge this PR because there are some following PR.

Anyone who wants to review this PR can continue, and they they can also review these content in the following PR

@jackwener jackwener merged commit 439863f into apache:main May 5, 2023
@jackwener jackwener deleted the seperate branch May 5, 2023 03:11
@tustvold
Copy link
Contributor

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion. This has non-trivial implications for decimal arithmetic, in particular for division, as type coercion is not idempotent. I'm still thinking about how to handle this better, but thought I would mention it in case you've already given this some thought

@jackwener
Copy link
Member Author

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.

Agree with it.

It may be related with #6261

@alamb
Copy link
Contributor

alamb commented Jun 27, 2023

One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.

In my opinion, get_result_type should always be called after type coercion and if the function doesn't support the current input types, it should error.

Basically I think once the "Analyzer" pass has run on the expr tree the rest of the code shouldn't have to guess / coerce types anymore -- the types would always line up or the query would be rejected.

I think this is the standard approach in other databases (and, eg. the type checking / semantic pass in compilers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants