-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sum refactor draft #2516
Sum refactor draft #2516
Conversation
@comphead The AMD64 bulid issue can be fixed by merging latest from master |
typed_sum!(lhs, rhs, UInt64, u64) | ||
(DataType::Float32, _) | (_, DataType::Float32) => { | ||
let data: ArrayRef = | ||
union_arrays!(lhs, rhs, &DataType::Float32, Float32Array, "f32"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting idea, but I suspect the performance will be fairly low (as it creates arrays for each value 🤔 )
I wonder if we could move the sum
logic into scalar.rs
and instead add some sort of coertion logic
Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean add a sum function to scalar value struct?
like
sum(self, other: ScalarValue) -> ScalarValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through those PRs. Also currently we have negate function on scalar value level.
pub fn arithmetic_negate(&self) -> Self {
Probably we can extend this trend and add other common math functions to Scalar Value too like sum, minus, multiply, divide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am torn -- basically I worry that adding the common math functions to ScalarValue
will result in them being used more (and they are very slow). See more discussion on #1525 (review)
However, if the alternative is a bunch of replicated code over the codebase, consolidating that all into ScalarValue seems like a much better outcome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had a bunch of discussion on a related topic (how to handle constants and arrays similarly) on #1248
Maybe that will provide some insight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up to idea like that, it should put the boilerplate once. And then use external functions over values.
use std::ops::Sub;
use std::ops::Add;
use std::fmt::Debug;
use std::any::Any;
#[derive(Debug, Copy, Clone)]
enum DataType {
Int32(Option<i32>),
Float64(Option<f64>)
}
// boilerplate comes here
macro_rules! op {
($ARG1: expr, $ARG2: expr, $FUNC: block) => {{
let res = match ($ARG1, $ARG2) {
(DataType::Int32(Some(v1)), DataType::Float64(Some(v2))) =>
DataType::Float64(Some($FUNC(v1 as f64, v2 as f64))),
_ => panic!("123")
};
res
}};
}
fn sum<T:Add<Output = T> + Copy> (num1: T, num2: T) -> T {
return num1 + num2;
}
fn minus<T:Sub<Output = T> + Copy> (num1: T, num2: T) -> T {
return num1 - num2;
}
fn main() {
let i_32 = DataType::Int32(Some(1));
let f_64 = DataType::Float64(Some(2.0));
let s = op!(i_32, f_64, {sum});
dbg!(&s);
let m = op!(i_32, f_64, {minus});
dbg!(&m);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just cut all this coercions logic? I've investigate all the occurrences of sum()
, it's only used to accumulate aggregator state in sum
, sum_distinct
and average
where the operand's type of sum()
is deterministic. And sum()
is an internal function (pub(crate)
), API change of this function is acceptable.
I try to remove all the match arms that with different operand types and only fail two cases (sum_distinct_i32_with_nulls
and sum_distinct_u32_with_nulls
). I think this is acceptable. And I find the min/max calculator already applied this.
About how to achieve calculate operator over different types, I think we can extract our coercion rule to something like
fn coercion(lhs: DataType, rhs: DataType) -> DataType {}
And cast both operands to the result type before calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waynexia for your response.
I think we already have similar coercion in type_coercion.rs
/// Returns the data types that each argument must be coerced to match
/// `signature`.
///
/// See the module level documentation for more detail on coercion.
pub fn data_types(
current_types: &[DataType],
signature: &Signature,
) -> Result<Vec<DataType>> {
Imho AggregateFunction::Sum => sum_return_type(&coerced_data_types[0]),
already does the proposed solution.
I'm still afraid the problem is not in the result type coercion but how to do operation with underlying values using correct datatypes without boilerplate.
This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more |
Which issue does this PR close?
Closes #2447 .
Rationale for this change
The current sum function looks a bit boilerplate, this is attempt to refactor it.
What changes are included in this PR?
Are there any user-facing changes?