Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add fail_on_overflow option to BinaryExpr #11400

Merged
merged 6 commits into from
Jul 11, 2024

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

Part of #3520

Rationale for this change

In Comet, we are emulating Spark behavior and we need the ability to choose between failing on overflow or not. This seems like it is also important for Postgres compatibility, according to #3520

What changes are included in this PR?

Add an option to BinaryExpr to control overflow behavior for some operators.

Are these changes tested?

Yes, new tests added.

Are there any user-facing changes?

No, the new flag defaults to false and is not wired into the planner yet.

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Jul 10, 2024
@andygrove andygrove requested review from liukun4515 and viirya July 10, 2024 19:17
@@ -2312,7 +2312,7 @@ mod tests {
// verify that the plan correctly casts u8 to i64
// the cast from u8 to i64 for literal will be simplified, and get lit(int64(5))
// the cast here is implicit so has CastOptions with safe=true
let expected = "BinaryExpr { left: Column { name: \"c7\", index: 2 }, op: Lt, right: Literal { value: Int64(5) } }";
let expected = "BinaryExpr { left: Column { name: \"c7\", index: 2 }, op: Lt, right: Literal { value: Int64(5) }, fail_on_overflow: false }";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to display fail_on_overflow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming from the derived Debug trait

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I also think it is not specified to Spark but should be ansi compliant behavior so other engines might have the behavior too.

Comment on lines +56 to +57
/// Specifies whether an error is returned on overflow or not
fail_on_overflow: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comet can use this when constructing these physical expressions. For DataFusion, it can choose to set this flag on for ansi compliant cases/configs if any.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

I had an API suggestion but I don't think it is critical

ALso @Omega359 has discussed similar things in the past I think, so maybe he has some comments

}
}

/// Create new binary expression with explicit fail_on_overflow value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative API might be more of a builder style and easier to extend in the fuuure

        // create expression
        let expr = BinaryExpr::new(
            Arc::new(Column::new("l", 0)),
            Operator::Plus,
            Arc::new(Column::new("r", 1)),
         )
          // configure to fail on overflow
          .with_fail_on_overflow(true);

@Omega359
Copy link
Contributor

Omega359 commented Jul 10, 2024

Related: #10744

I have a personal branch where I maintain a 'safe' mode for to_timestamp and to_date functions. It would be nice to get that sort of functionality into the core however I would only really want to do it if it could be driven via config somehow. My attempts to make that work didn't pan out well and the best approach is likely one where invoke method is provided state (sessionState, SessionConfig, something) where we could query the config for flag(s) and adjust behaviour on the fly based on that.

@andygrove
Copy link
Member Author

Related: #10744

I have a personal branch where I maintain a 'safe' mode for to_timestamp and to_date functions. It would be nice to get that sort of functionality into the core however I would only really want to do it if it could be driven via config somehow. My attempts to make that work didn't pan out well and the best approach is likely one where invoke method is provided state (sessionState, SessionConfig, something) where we could query the config for flag(s) and adjust behaviour on the fly based on that.

Can you remind me why you need the physical expr to have access to configs at runtime rather than having the planner pass the relevant configs into the expressions when they are created?

@Omega359
Copy link
Contributor

Can you remind me why you need the physical expr to have access to configs at runtime rather than having the planner pass the relevant configs into the expressions when they are created?

It's the singleton nature of UDF's currently - there isn't a way to configure them at all. You can replace a udf in the FunctionFactory with one with a different config which works well for sql however if you want to use the dataframe api with the standard function names such a to_timestamp(args) vs manually calling a modified function via something like ctx.udf("to_timestamp").unwrap().call(args) you can't as you cannot replace or modify what that function calls as it's a singleton

static $GNAME: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
            std::sync::OnceLock::new();

If there was a way to configure those (via argument to invoke, etc) or replace the singletons with a pre-configured instance (or better yet, referencing the one in the FunctionFactory) that would solve the issue.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 11, 2024

If there was a way to configure those (via argument to invoke, etc) or replace the singletons with a pre-configured instance (or better yet, referencing the one in the FunctionFactory) that would solve the issue.

How about overwrite existing function based on config?

set datafusion.function.to_timestamp = spark;

register "to_timestamp" with spark like one

set datafusion.function.to_timestamp = default;

register "to_timestamp" with default one

set datafusion.function.all = spark;

register with all_spark_functions()

We can also overwrite packaged functions base on config to overwrite them, like what we have now

/// Return all default functions
pub fn all_default_functions() -> Vec<Arc<ScalarUDF>> {
    core::functions()
        .into_iter()
        .chain(datetime::functions())
        .chain(encoding::functions())
        .chain(math::functions())
        .chain(regex::functions())
        .chain(crypto::functions())
        .chain(unicode::functions())
        .chain(string::functions())
        .collect::<Vec<_>>()
}

/// Return all spark functions to overwrite
pub fn all_spark_functions() -> Vec<Arc<ScalarUDF>> {
    // 
}

Register functions here

    async fn set_variable(&self, stmt: SetVariable) -> Result<DataFrame> {
        let SetVariable {
            variable, value, ..
        } = stmt;

        // register functions based on config
        self.state().register_udaf(udaf);

        let mut state = self.state.write();
        state.config_mut().options_mut().set(&variable, &value)?;
        drop(state);

        self.return_empty_dataframe()
    }

@andygrove andygrove merged commit 2413155 into apache:main Jul 11, 2024
23 checks passed
@andygrove andygrove deleted the bin-expr-fail-on-overflow branch July 11, 2024 14:56
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 12, 2024
* update tests

* update tests

* add rustdoc

* update PartialEq impl

* fix

* address feedback about improving api
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* update tests

* update tests

* add rustdoc

* update PartialEq impl

* fix

* address feedback about improving api
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* update tests

* update tests

* add rustdoc

* update PartialEq impl

* fix

* address feedback about improving api
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* update tests

* update tests

* add rustdoc

* update PartialEq impl

* fix

* address feedback about improving api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants