support sum/avg agg for decimal, change sum(float32) --> float64 #1408

liukun4515 · 2021-12-07T04:03:19Z

Which issue does this PR close?

part of #122

The result data type for decimal case (closes #1418)

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

liukun4515 · 2021-12-14T06:11:19Z

PTAL @alamb @houqp

alamb

Thank you for the contribution; Very nicely tested @liukun4515 🏅

I left some comments, but overall I think this is looking quite good

alamb · 2021-12-14T19:12:24Z

datafusion/src/execution/context.rs

+            "+-----------------+",
+        ];
+        assert_eq!(
+            &DataType::Decimal(20, 3),


alamb · 2021-12-14T19:26:12Z

datafusion/src/physical_plan/expressions/average.rs

 }

 /// function return type of an average
 pub fn avg_return_type(arg_type: &DataType) -> Result<DataType> {
    match arg_type {
+        DataType::Decimal(precision, scale) => {
+            // the new precision and scale for return type of avg function


Can you please document document the rationale for the 4 and 38 constants below (or even better pull them into named constants somewhere)?

I also don't understand where the additional 4 came from. I tried to see if it was what postgres did, but when I checked the output schema for avg(numeric(10,3)) appears to be numeric without the precision or scale specified 🤔

(arrow_dev) alamb@MacBook-Pro-2:~/Downloads$ psql psql (14.1) Type "help" for help. alamb=# create table test(x decimal(10, 3)); CREATE TABLE alamb=# insert into test values (1.02); INSERT 0 1 alamb=# create table test2 as select avg(x) from test; SELECT 1 alamb=# select table_name, column_name, numeric_precision, numeric_scale, data_type from information_schema.columns where table_name='test2'; table_name | column_name | numeric_precision | numeric_scale | data_type ------------+-------------+-------------------+---------------+----------- test2 | avg | | | numeric (1 row)

This is the intention for #1418.
In the PG, we can create bigger precision for decimal than that in datafusion.
The decimal in datafusion, whose behavior may be different from that in PG.
So we should discuss them.
Now for the promotion precision or scale, I just follow the spark behavior.
@alamb

I think following the spark behavior is reasonable, but it should be documented (aka that the constant 4 came from spark). Otherwise in 3 months we'll be 🤔 why were the particular constants picked

add comments to the code and issue #1461 to track this rule.
We can add the rule to our document later in the follow-up pull request.

alamb · 2021-12-14T19:27:42Z

datafusion/src/physical_plan/expressions/average.rs

-        Self {
-            name: name.into(),
-            expr,
+        match data_type {


The comment above looks out of date -- I think it should simply be removed.

And perhaps we can change this code so it doesn't use unreachable as I think it would be fairly easy to reach this code by calling Avg::new(..) with some incorrect paramters

How about something like

assert!(matches!(data_type, DataType::Float64 | DataType::Decimal(_, _)));

Which I think might be easier to diagnose if anyone hits it

grate comments

alamb · 2021-12-14T19:32:14Z

datafusion/src/physical_plan/expressions/average.rs

+    fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
+        Ok(Box::new(AvgAccumulator::try_new(
+            // avg is f64 or decimal
+            &self.data_type,


if a sum of decimal(10,2) can be decimal(20,2) shouldn't the accumulator state also be decimal(20,2) to avoid overflow?

I think handling overflow is probably fine to for a later date / PR, but it is strange to me that there is a discrepancy between the type for sum and the accumulator type for computing avg

The result type of phy expr(sum/avg) is same with each Accumulator, and it was decided by sum_return_type and avg_return_type.

If the column is decimal(8,2), the avg of this column must be less than 10^8-1, but we need more digits to represent the decimal part. For example, The avg of 3,4,6 is 4.333333....., we should increase the scale part.

For the sum agg, we just should increase the precision part, and the rule of adding 10 to precision is spark coercion rule for sum decimal. We can have our rules for decimal if we want.
@alamb
We can just follow the spark now, and change the rules if we want to define own rules.

add issue to track the overflow.
#1460

alamb · 2021-12-14T19:32:57Z

datafusion/src/physical_plan/expressions/average.rs

+            ScalarValue::Decimal128(value, precision, scale) => {
+                Ok(match value {
+                    None => ScalarValue::Decimal128(None, precision, scale),
+                    // TODO add the checker for overflow the precision


alamb · 2021-12-14T19:41:35Z