Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sum refactor draft #2516

Closed
wants to merge 7 commits into from
Closed

Sum refactor draft #2516

wants to merge 7 commits into from

Conversation

comphead
Copy link
Contributor

@comphead comphead commented May 11, 2022

Which issue does this PR close?

Closes #2447 .

Rationale for this change

The current sum function looks a bit boilerplate, this is attempt to refactor it.

What changes are included in this PR?

Are there any user-facing changes?

@andygrove
Copy link
Member

@comphead The AMD64 bulid issue can be fixed by merging latest from master

typed_sum!(lhs, rhs, UInt64, u64)
(DataType::Float32, _) | (_, DataType::Float32) => {
let data: ArrayRef =
union_arrays!(lhs, rhs, &DataType::Float32, Float32Array, "f32");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting idea, but I suspect the performance will be fairly low (as it creates arrays for each value 🤔 )

I wonder if we could move the sum logic into scalar.rs and instead add some sort of coertion logic

Not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean add a sum function to scalar value struct?

like

 sum(self, other: ScalarValue) -> ScalarValue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that -- yes -- given that is effectively what this code is doing.

@realno added just such code here #1525 though I was trying to avoid it #1550

However, if it is truly required then having one copy of it seems better than several 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through those PRs. Also currently we have negate function on scalar value level.

    pub fn arithmetic_negate(&self) -> Self {

Probably we can extend this trend and add other common math functions to Scalar Value too like sum, minus, multiply, divide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am torn -- basically I worry that adding the common math functions to ScalarValue will result in them being used more (and they are very slow). See more discussion on #1525 (review)

However, if the alternative is a bunch of replicated code over the codebase, consolidating that all into ScalarValue seems like a much better outcome

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had a bunch of discussion on a related topic (how to handle constants and arrays similarly) on #1248

Maybe that will provide some insight

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up to idea like that, it should put the boilerplate once. And then use external functions over values.

use std::ops::Sub;
use std::ops::Add;
use std::fmt::Debug;
use std::any::Any;

#[derive(Debug, Copy, Clone)]
enum DataType {
    Int32(Option<i32>),
    Float64(Option<f64>)
}

// boilerplate comes here
macro_rules! op {
    ($ARG1: expr, $ARG2: expr, $FUNC: block) => {{
        let res = match ($ARG1, $ARG2) {
            (DataType::Int32(Some(v1)), DataType::Float64(Some(v2))) => 
            DataType::Float64(Some($FUNC(v1 as f64, v2 as f64))),
            _ => panic!("123")
        };
        
        res
    }};
}

fn sum<T:Add<Output = T> + Copy> (num1: T, num2: T) -> T {
    return num1 + num2;
}

fn minus<T:Sub<Output = T> + Copy> (num1: T, num2: T) -> T {
    return num1 - num2;
}

fn main() {
    let i_32 = DataType::Int32(Some(1));
    let f_64 = DataType::Float64(Some(2.0));

    let s = op!(i_32, f_64, {sum});
    dbg!(&s);
    
    let m = op!(i_32, f_64, {minus});
    dbg!(&m);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just cut all this coercions logic? I've investigate all the occurrences of sum(), it's only used to accumulate aggregator state in sum, sum_distinct and average where the operand's type of sum() is deterministic. And sum() is an internal function (pub(crate)), API change of this function is acceptable.

I try to remove all the match arms that with different operand types and only fail two cases (sum_distinct_i32_with_nulls and sum_distinct_u32_with_nulls). I think this is acceptable. And I find the min/max calculator already applied this.

About how to achieve calculate operator over different types, I think we can extract our coercion rule to something like

fn coercion(lhs: DataType, rhs: DataType) -> DataType {}

And cast both operands to the result type before calculation.

Copy link
Contributor Author

@comphead comphead May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waynexia for your response.

I think we already have similar coercion in type_coercion.rs

/// Returns the data types that each argument must be coerced to match
/// `signature`.
///
/// See the module level documentation for more detail on coercion.
pub fn data_types(
    current_types: &[DataType],
    signature: &Signature,
) -> Result<Vec<DataType>> {

Imho AggregateFunction::Sum => sum_return_type(&coerced_data_types[0]), already does the proposed solution.

I'm still afraid the problem is not in the result type coercion but how to do operation with underlying values using correct datatypes without boilerplate.

@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
@alamb
Copy link
Contributor

alamb commented Jan 14, 2023

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

@alamb alamb closed this Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate and reduce runtime type coercion in aggregates like sum
4 participants