-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove type coercions from ScalarValue and aggregation function code #3705
Conversation
@alamb, this is one of our promised follow-up PR's to analyze |
This is common when computing sums to avoid overflow, I think -- like the type of That being said, the fact there is no test is somewhat concerning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR. Very nice 🏅 I find it especially impressive that you have returned to help improve the codebase.
The only thing I am slightly worried about is the change to casting in aggregates. I am going to write a quick test to make sure everything is fine
} | ||
_ => impl_common_symmetric_cases_op!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|| value_type == DataType::UInt32 | ||
|| value_type == DataType::UInt16 | ||
|| value_type == DataType::UInt8 | ||
matches!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
}}; | ||
|
||
($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident, $TZ:expr) => {{ | ||
($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Macro 🧙
}}; | ||
|
||
($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident, $TZ:expr) => {{ | ||
($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are drive by cleanups to reduce duplication in the macros, right?
(BTW if you submit these as individual free standing PRs you might find the reviews are faster -- finding enough contiguous time to review large PR changes can be challenging at times)
sum_row!(index, accessor, rhs, f64) | ||
} | ||
(DataType::Float64, ScalarValue::UInt8(rhs)) => { | ||
match s { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change -- to use the type of the input to the type of the accumulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::sync::Arc;
use datafusion::arrow::array::{Int8Array};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use datafusion::prelude::SessionContext;
/// This example demonstrates what happens summing a small int type
#[tokio::main]
async fn main() -> Result<()> {
// define data that purposely will overflow an int8
let array: Int8Array = (1..100).collect();
let batch = RecordBatch::try_from_iter(vec![
("c", Arc::new(array) as _)
]).unwrap();
let ctx = SessionContext::new();
ctx.register_batch("t", batch).unwrap();
println!("Count is:");
ctx.sql("SELECT count(c) from t")
.await
.unwrap()
.show()
.await
.unwrap();
// Can't fit the sum in int8
println!("Sum is:");
ctx.sql("SELECT sum(c) from t")
.await
.unwrap()
.show()
.await
.unwrap();
Ok(())
}
It worked on both master and this branch 👍
+------------+
| COUNT(t.c) |
+------------+
| 99 |
+------------+
Sum is:
+----------+
| SUM(t.c) |
+----------+
| 4950 |
+----------+
The runtime coercion in |
Yes, I did a bunch of drive-by cleanups while removing coercions :) The example you posted works because coercions are already done before we ever call |
I think the type coercion should be done in the |
I agree - and I think this PR is a step towards that goal (removes some unneeded logic in the physical phase as it is now done in the logical phase) |
I will plan to merge this PR later today unless anyone else objects or would like more time to review |
I merged this PR with master locally and ran the tests to ensure there are no logical conflicts. Looks good. |
Thanks again @ozankabak |
Benchmark runs are scheduled for baseline = 88eadc4 and contender = 8dcef91. 8dcef91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Improves the situation on #2447.
Rationale for this change
There is an ongoing effort to reduce/eliminate type coercions in places like
ScalarValue
code, aggregation code and other downstream places (see #2355). The situation we want obtain is to move as many coercions as possible upstream to the extent that we can (see discussions in PR #3570).This PR improves the current situation by sanitizing
ScalarValue
code and implementation of aggregation functions to remove the coercions therein. For the former, there is only one coercion left (intry_from_value
function), the need for which arises due to an upstream hardcoding (which we will attempt to fix in the near future).In many cases, this PR makes the code raise a compile-time error when performing "unsafe" operations that require a coercion (and forces the programmer to add the coercion explicitly). For example, one will get a compile time error if one tries to subtract an unsigned value from an uninitialized
ScalarValue
, or divide such a value to a decimal value -- this wasn't the case before.What changes are included in this PR?
As we were removing unnecessary coercions from the codebase, we encountered a "silent upcasting" behavior in the
sum
function, which promotes narrow types to 64-bit types unconditionally:https://github.com/apache/arrow-datafusion/blob/57312284c082c914dd5e1edaa5c1fe3dbe4f222d/datafusion/physical-expr/src/aggregate/sum.rs#L217-L235
We are not 100% sure why this behavior was there, but removing it does not seem to result in any ill-effects. If anyone can tell us why such coercions are necessary, we can add them back in.
Are there any user-facing changes?
No.