Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove type coercions from ScalarValue and aggregation function code #3705

Merged
merged 2 commits into from
Oct 6, 2022
Merged

Remove type coercions from ScalarValue and aggregation function code #3705

merged 2 commits into from
Oct 6, 2022

Conversation

ozankabak
Copy link
Contributor

@ozankabak ozankabak commented Oct 4, 2022

Which issue does this PR close?

Improves the situation on #2447.

Rationale for this change

There is an ongoing effort to reduce/eliminate type coercions in places like ScalarValue code, aggregation code and other downstream places (see #2355). The situation we want obtain is to move as many coercions as possible upstream to the extent that we can (see discussions in PR #3570).

This PR improves the current situation by sanitizing ScalarValue code and implementation of aggregation functions to remove the coercions therein. For the former, there is only one coercion left (in try_from_value function), the need for which arises due to an upstream hardcoding (which we will attempt to fix in the near future).

In many cases, this PR makes the code raise a compile-time error when performing "unsafe" operations that require a coercion (and forces the programmer to add the coercion explicitly). For example, one will get a compile time error if one tries to subtract an unsigned value from an uninitialized ScalarValue, or divide such a value to a decimal value -- this wasn't the case before.

What changes are included in this PR?

As we were removing unnecessary coercions from the codebase, we encountered a "silent upcasting" behavior in the sum function, which promotes narrow types to 64-bit types unconditionally:

https://github.com/apache/arrow-datafusion/blob/57312284c082c914dd5e1edaa5c1fe3dbe4f222d/datafusion/physical-expr/src/aggregate/sum.rs#L217-L235

We are not 100% sure why this behavior was there, but removing it does not seem to result in any ill-effects. If anyone can tell us why such coercions are necessary, we can add them back in.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-expr Physical Expressions label Oct 4, 2022
@ozankabak
Copy link
Contributor Author

@alamb, this is one of our promised follow-up PR's to analyze ScalarValue-related coercions in the code and remove the ones we can.

@andygrove andygrove self-requested a review October 4, 2022 15:24
@alamb
Copy link
Contributor

alamb commented Oct 5, 2022

As we were removing unnecessary coercions from the codebase, we encountered a "silent upcasting" behavior in the sum function, which promotes narrow types to 64-bit types unconditionally:

We are not 100% sure why this behavior was there, but removing it does not seem to result in any ill-effects. If anyone can tell us why such coercions are necessary, we can add them back in.

This is common when computing sums to avoid overflow, I think -- like the type of SUMing Int8 types can quickly end up as a Int16 or larger.

That being said, the fact there is no test is somewhat concerning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR. Very nice 🏅 I find it especially impressive that you have returned to help improve the codebase.

The only thing I am slightly worried about is the change to casting in aggregates. I am going to write a quick test to make sure everything is fine

}
_ => impl_common_symmetric_cases_op!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

|| value_type == DataType::UInt32
|| value_type == DataType::UInt16
|| value_type == DataType::UInt8
matches!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}};

($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident, $TZ:expr) => {{
($VALUES:expr, $ARRAYTYPE:ident, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macro 🧙

}};

($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident, $TZ:expr) => {{
($VALUE:expr, $DELTA:expr, $SCALAR:ident, $OP:ident $(, $EXTRA_ARGS:ident)*) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are drive by cleanups to reduce duplication in the macros, right?

(BTW if you submit these as individual free standing PRs you might find the reviews are faster -- finding enough contiguous time to review large PR changes can be challenging at times)

sum_row!(index, accessor, rhs, f64)
}
(DataType::Float64, ScalarValue::UInt8(rhs)) => {
match s {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change -- to use the type of the input to the type of the accumulator.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use std::sync::Arc;

use datafusion::arrow::array::{Int8Array};
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use datafusion::prelude::SessionContext;

/// This example demonstrates what happens summing a small int type
#[tokio::main]
async fn main() -> Result<()> {
    // define data that purposely will overflow an int8
    let array: Int8Array = (1..100).collect();
    let batch = RecordBatch::try_from_iter(vec![
        ("c", Arc::new(array) as _)
    ]).unwrap();

    let ctx = SessionContext::new();
    ctx.register_batch("t", batch).unwrap();

    println!("Count is:");
    ctx.sql("SELECT count(c) from t")
        .await
        .unwrap()
        .show()
        .await
        .unwrap();

    // Can't fit the sum in int8
    println!("Sum is:");
    ctx.sql("SELECT sum(c) from t")
        .await
        .unwrap()
        .show()
        .await
        .unwrap();

    Ok(())
}

It worked on both master and this branch 👍

+------------+
| COUNT(t.c) |
+------------+
| 99         |
+------------+
Sum is:
+----------+
| SUM(t.c) |
+----------+
| 4950     |
+----------+

@alamb
Copy link
Contributor

alamb commented Oct 5, 2022

The runtime coercion in ScalarValue is pretty old, it might predate when we had better type coercion support in the optimizer. Maybe @andygrove remembers

@ozankabak
Copy link
Contributor Author

Yes, I did a bunch of drive-by cleanups while removing coercions :) The example you posted works because coercions are already done before we ever call add_to_row. My first thoughts regarding overflow were similar to yours, but then I realized by the time we perform additions we already have everything casted to wider types -- so the coercions I removed really seemed unnecessary.

@liukun4515
Copy link
Contributor

I think the type coercion should be done in the TypeCoercion in the logical phase.
There are some TODO works for agg/buildin function/udf in the issue #3582 (comment)

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

I think the type coercion should be done in the TypeCoercion in the logical phase.

I agree - and I think this PR is a step towards that goal (removes some unneeded logic in the physical phase as it is now done in the logical phase)

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

I will plan to merge this PR later today unless anyone else objects or would like more time to review

@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

I merged this PR with master locally and ran the tests to ensure there are no logical conflicts. Looks good.

@alamb alamb merged commit 8dcef91 into apache:master Oct 6, 2022
@alamb
Copy link
Contributor

alamb commented Oct 6, 2022

Thanks again @ozankabak

@ursabot
Copy link

ursabot commented Oct 6, 2022

Benchmark runs are scheduled for baseline = 88eadc4 and contender = 8dcef91. 8dcef91 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants