Sum refactor draft #2516

comphead · 2022-05-11T15:08:03Z

Which issue does this PR close?

Closes #2447 .

Rationale for this change

The current sum function looks a bit boilerplate, this is attempt to refactor it.

What changes are included in this PR?

Are there any user-facing changes?

andygrove · 2022-05-12T15:02:01Z

@comphead The AMD64 bulid issue can be fixed by merging latest from master

alamb · 2022-05-12T20:55:43Z

datafusion/physical-expr/src/aggregate/sum.rs

-            typed_sum!(lhs, rhs, UInt64, u64)
+        (DataType::Float32, _) | (_, DataType::Float32) => {
+            let data: ArrayRef =
+                union_arrays!(lhs, rhs, &DataType::Float32, Float32Array, "f32");


this is an interesting idea, but I suspect the performance will be fairly low (as it creates arrays for each value 🤔 )

I wonder if we could move the sum logic into scalar.rs and instead add some sort of coertion logic

Not sure.

do you mean add a sum function to scalar value struct?

like

sum(self, other: ScalarValue) -> ScalarValue

Something like that -- yes -- given that is effectively what this code is doing.

@realno added just such code here #1525 though I was trying to avoid it #1550

However, if it is truly required then having one copy of it seems better than several 🤔

Went through those PRs. Also currently we have negate function on scalar value level.

pub fn arithmetic_negate(&self) -> Self {

Probably we can extend this trend and add other common math functions to Scalar Value too like sum, minus, multiply, divide?

I am torn -- basically I worry that adding the common math functions to ScalarValue will result in them being used more (and they are very slow). See more discussion on #1525 (review)

However, if the alternative is a bunch of replicated code over the codebase, consolidating that all into ScalarValue seems like a much better outcome

I think we had a bunch of discussion on a related topic (how to handle constants and arrays similarly) on #1248

Maybe that will provide some insight

I came up to idea like that, it should put the boilerplate once. And then use external functions over values.

use std::ops::Sub; use std::ops::Add; use std::fmt::Debug; use std::any::Any; #[derive(Debug, Copy, Clone)] enum DataType { Int32(Option<i32>), Float64(Option<f64>) } // boilerplate comes here macro_rules! op { ($ARG1: expr, $ARG2: expr, $FUNC: block) => {{ let res = match ($ARG1, $ARG2) { (DataType::Int32(Some(v1)), DataType::Float64(Some(v2))) => DataType::Float64(Some($FUNC(v1 as f64, v2 as f64))), _ => panic!("123") }; res }}; } fn sum<T:Add<Output = T> + Copy> (num1: T, num2: T) -> T { return num1 + num2; } fn minus<T:Sub<Output = T> + Copy> (num1: T, num2: T) -> T { return num1 - num2; } fn main() { let i_32 = DataType::Int32(Some(1)); let f_64 = DataType::Float64(Some(2.0)); let s = op!(i_32, f_64, {sum}); dbg!(&s); let m = op!(i_32, f_64, {minus}); dbg!(&m); }

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a513ccd564230b3815d635c515236f5f

What about just cut all this coercions logic? I've investigate all the occurrences of sum(), it's only used to accumulate aggregator state in sum, sum_distinct and average where the operand's type of sum() is deterministic. And sum() is an internal function (pub(crate)), API change of this function is acceptable.

I try to remove all the match arms that with different operand types and only fail two cases (sum_distinct_i32_with_nulls and sum_distinct_u32_with_nulls). I think this is acceptable. And I find the min/max calculator already applied this.

About how to achieve calculate operator over different types, I think we can extract our coercion rule to something like

fn coercion(lhs: DataType, rhs: DataType) -> DataType {}

And cast both operands to the result type before calculation.

Thanks @waynexia for your response.

I think we already have similar coercion in type_coercion.rs

/// Returns the data types that each argument must be coerced to match /// `signature`. /// /// See the module level documentation for more detail on coercion. pub fn data_types( current_types: &[DataType], signature: &Signature, ) -> Result<Vec<DataType>> {

Imho AggregateFunction::Sum => sum_return_type(&coerced_data_types[0]), already does the proposed solution.

I'm still afraid the problem is not in the result type coercion but how to do operation with underlying values using correct datatypes without boilerplate.

alamb · 2023-01-14T11:25:10Z

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

comphead added 2 commits May 11, 2022 08:03

Sum refactor draft

07673f0

Formatting

c572310

github-actions bot added the datafusion Changes in the datafusion crate label May 11, 2022

comphead mentioned this pull request May 11, 2022

Investigate and reduce runtime type coercion in aggregates like sum #2447

Closed

comphead added 3 commits May 12, 2022 01:35

Fix tests

891284f

Remove unused test

8108681

fix clippy

d445e59

Merge branch 'apache:master' into sum_refactor

8d6d53b

alamb reviewed May 12, 2022

View reviewed changes

Merge branch 'apache:master' into sum_refactor

5593b45

andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022

alamb closed this Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sum refactor draft #2516

Sum refactor draft #2516

comphead commented May 11, 2022 •

edited

Loading

andygrove commented May 12, 2022

alamb May 12, 2022

comphead May 13, 2022

alamb May 13, 2022

comphead May 15, 2022

alamb May 16, 2022

alamb May 19, 2022

comphead May 22, 2022

comphead May 22, 2022

waynexia May 23, 2022

comphead May 23, 2022 •

edited

Loading

alamb commented Jan 14, 2023

Sum refactor draft #2516

Sum refactor draft #2516

Conversation

comphead commented May 11, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

andygrove commented May 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead May 23, 2022 • edited Loading

Choose a reason for hiding this comment

alamb commented Jan 14, 2023

comphead commented May 11, 2022 •

edited

Loading

comphead May 23, 2022 •

edited

Loading