-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add coercion rules for AggregateFunctions #1387
Add coercion rules for AggregateFunctions #1387
Conversation
@alamb PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liukun4515 I really like where this is heading and it makes sense to me to have aggregate coercion rules that are separate from coercion rules for other functions.
I am curious how you see the overall code structure evolving. Is your idea that there will be several coercion modules, for each different Expr type, for example?
datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs
datafusion/src/physical_plan/coercion_rule/function.rs
datafusion/src/physical_plan/coercion_rule/operators.rs
I had a few suggestions / comments, but overall I think this is looking almost ready to merge
Thank you for starting this work
datafusion/src/execution/context.rs
Outdated
@@ -2058,7 +2058,7 @@ mod tests { | |||
.await | |||
.unwrap_err(); | |||
|
|||
assert_eq!(results.to_string(), "Error during planning: Coercion from [Timestamp(Nanosecond, None)] to the signature Uniform(1, [Int8, Int16, Int32, Int64, UInt8, UInt16, UInt32, UInt64, Float32, Float64]) failed."); | |||
assert_eq!(results.to_string(), "Error during planning: The function Sum do not support the Timestamp(Nanosecond, None)."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add the valid signatures into this error message? The new wording is more readable, but we did lose some information about what type signatures are valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some SQL systems, if we input the incompatible datatype, they just throw the error or except and don't give the compatible data type.
We can refine this later if the supported data type is necessary for the user.
match signature.type_signature { | ||
TypeSignature::Uniform(agg_count, _) | TypeSignature::Any(agg_count) => { | ||
if input_types.len() != agg_count { | ||
return Err(DataFusionError::Plan(format!("The function {:?} expect argument number is {:?}, but the input argument number is {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Err(DataFusionError::Plan(format!("The function {:?} expect argument number is {:?}, but the input argument number is {:?}", | |
return Err(DataFusionError::Plan(format!("The function {:?} expects {:?} arguments, but {:?} were provided", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion.
Done
use std::ops::Deref; | ||
use std::sync::Arc; | ||
|
||
pub fn coerce_types( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some docstrings would help here (perhaps a pointer to the module level documentation as is done in type_coercion.rs
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have added the doc for these two functions.
// TODO add checker, if the value type is complex data type | ||
Ok(vec![dict_value_type.deref().clone()]) | ||
} | ||
// TODO add checker for datatype which min and max supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a good TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add the checker in the follow-up pull request.
In this pull request, we just make consistent with before logic.
let input_types = vec![ | ||
vec![DataType::Int32], | ||
vec![DataType::Float32], | ||
// vec![DataType::Decimal(20, 3)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, here I think should work now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will support the decimal data type for SUM/AVG function in the feature #1408.
I have added the TODO to mark it.
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! define the coercion rule for different Expr type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! define the coercion rule for different Expr type | |
//! define coercion rules for aggregate functions |
@@ -62,6 +62,24 @@ pub fn avg_return_type(arg_type: &DataType) -> Result<DataType> { | |||
} | |||
} | |||
|
|||
pub(crate) fn is_avg_support_arg_type(arg_type: &DataType) -> bool { | |||
// TODO support the interval | |||
// TODO: do we need to support the unsigned data type? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by the unsigned data type
? If you mean "do we support AVG(UInt8)" I think the answer is yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
@@ -60,6 +60,7 @@ pub mod helpers { | |||
|
|||
pub use approx_distinct::ApproxDistinct; | |||
pub use array_agg::ArrayAgg; | |||
pub(crate) use average::is_avg_support_arg_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to export this function (aka no need to declare it pub crate
in this function). Likewise below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
But is_avg_support_arg_type
will be used in other model, for example coercion_rule
model.
https://github.com/apache/arrow-datafusion/blob/42b219215f1962bbf529c06392f5c7c978f70cc4/datafusion/src/physical_plan/coercion_rule/aggregate_rule.rs#L75
datafusion/tests/sql.rs
Outdated
assert_eq!(err.to_string(), "Error during planning: Invalid or wrong number of arguments passed to aggregate: 'COUNT(DISTINCT )'"); | ||
let logical_plan = ctx.create_logical_plan(sql); | ||
let err = logical_plan.unwrap_err(); | ||
assert_eq!(err.to_string(), DataFusionError::Plan("The function Count expect argument number is 1, but the input argument number is 0".to_string()).to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a nicer error message for sure 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be changed to Error during planning: The function Count expects 1 arguments, but 0 were provided
After this comments https://github.com/apache/arrow-datafusion/pull/1387/files#r760575015
@alamb I wish this refactor can involved many community developer or committer, and other developer can know this context. |
Thank you for your review and grammar suggestions. |
Yes, this is the future result and diff type exprs have diff coercion rule in diff file.
|
I think that is a good idea @liukun4515 -- though the danger of doing so is that we will end up with merge conflicts Perhaps @Dandandan @houqp @rdettai or @jimexist would like to weigh in on the structure and direction as well |
Thanks for your work! @liukun4515. If possible(the ticket doesn't be merged), I'll take a look at the weekend, because these days I am so busy with my job demands 🤦♂️, work 12 + hours per day ...... |
@xudong963 I will begin to address the comments from next week. |
@xudong963 and @liukun4515 what is the status of this PR? Shall we merge it? Is it waiting on more review / work? |
@alamb |
@alamb PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @liukun4515 I think this is good to go.
It would be nice to clean up the tests but we can always do that as a follow on PR
let mut expect_type = data_type.clone(); | ||
if matches!( | ||
data_type, | ||
DataType::UInt8 | ||
| DataType::UInt16 | ||
| DataType::UInt32 | ||
| DataType::UInt64 | ||
) { | ||
expect_type = DataType::UInt64; | ||
} else if matches!( | ||
data_type, | ||
DataType::Int8 | ||
| DataType::Int16 | ||
| DataType::Int32 | ||
| DataType::Int64 | ||
) { | ||
expect_type = DataType::Int64; | ||
} else if matches!( | ||
data_type, | ||
DataType::Float32 | DataType::Float64 | ||
) { | ||
expect_type = data_type.clone(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI you can write this kind of logic in a more concise way with something like (untested and abbreviated)
let expect_type = match (data_type) {
DataType::UInt8 | .... => DataType::UInt64,
DataType::Int8 | .... => DataType::Int64,
_ => data_type.clone()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1416 <-- PR wth proposed cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good style and suggestion!
Thanks also to @xudong963 for the review |
Thank you @liukun4515 for your contribution on this :) |
Which issue does this PR close?
part of #1356
We can extract two common function for all coercion rules, which are
coerce_types
andcoerce_exprs
.The
coerce_types
is used to get the coerced result data type if the expr need to be added a tyr_cast.The
coerce_exprs
is used to get the coerced exprs.Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?