Create built-in scalar functions programmatically #1734

HaoYang670 · 2022-02-03T02:02:43Z

Which issue does this PR close?

Closes #1718.
Re #1727.

Rationale for this change

We create a cleaner way to call built-in scalar functions by calling call_builtin_scalar_fn.

What changes are included in this PR?

Are there any user-facing changes?

A new function call_builtin_scalar_fn

Signed-off-by: remzi <[email protected]>

alamb

Thanks @HaoYang670 ! Sorry I had an unsubmitted review for this PR hanging out.

alamb · 2022-02-03T20:14:54Z

datafusion/src/logical_plan/expr.rs

+    match name.as_ref().parse::<functions::BuiltinScalarFunction>() {
+        Ok(fun) => Ok(Expr::ScalarFunction { fun, args }),
+        Err(e) => Err(e),
+    }


One way to write this more idiomatically is

Suggested change

match name.as_ref().parse::<functions::BuiltinScalarFunction>() {

Ok(fun) => Ok(Expr::ScalarFunction { fun, args }),

Err(e) => Err(e),

}

name.as_ref().parse::<functions::BuiltinScalarFunction>()

.map(|fun| Expr::ScalarFunction { fun, args }),

}

(not required, I am just pointing it out because it took me a while to get my head around working with Option and Results, and we are all learning together)

i wonder if it makes sense to make it a macro rather than a function call so that nonexisteng built-in functions will be caught during compile time not runtime

I agree, call_builtin_scalar_fn!(ToTimestamp, vec![lit("2020-09-08T12:00:00+00:00")]) is as readable.

Thank you @jimexist !

And alamb 's opinion of the trade-off between macro and function call is here:
#1718 (comment)

Is it needed to implement both?

No I don't think so. If we want a macro we can always add that as a follow on PR.

Thanks @HaoYang670 !

alamb · 2022-02-03T20:15:17Z

datafusion/src/optimizer/simplify_expressions.rs

-            args: vec![lit("foo"), lit("bar")],
-            fun: BuiltinScalarFunction::Concat,
-        };
+        let expr =


well that is certainly nicer looking 👍

alamb · 2022-02-03T20:18:50Z

datafusion/src/logical_plan/mod.rs

-    translate, trim, trunc, unalias, unnormalize_col, unnormalize_cols, upper, when,
-    Column, Expr, ExprRewriter, ExpressionVisitor, Literal, Recursion, RewriteRecursion,
-    SimplifyInfo,
+    avg, binary_expr, bit_length, btrim, call_builtin_scalar_fn, case, ceil,


Seeing all these other names, what would you think of changing the name from call_builtin_scalar_fn to call_fn to make it less verbose 🤔

Thank you, I will change it

Signed-off-by: remzi <[email protected]>

…ction

alamb · 2022-02-07T11:59:34Z

datafusion/src/logical_plan/expr.rs

+    match name.as_ref().parse::<functions::BuiltinScalarFunction>() {
+        Ok(fun) => Ok(Expr::ScalarFunction { fun, args }),
+        Err(e) => Err(e),
+    }


No I don't think so. If we want a macro we can always add that as a follow on PR.

Thanks @HaoYang670 !

* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <[email protected]> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <[email protected]> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <[email protected]> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <[email protected]> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <[email protected]> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <[email protected]> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <[email protected]> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * Add tests and CI for optional pyarrow module (#1711) * Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info * Update parking_lot requirement from 0.11 to 0.12 (#1735) Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Prevent repartitioning of certain operator's direct children (#1731) (#1732) * Prevent repartitioning of certain operator's direct children (#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint * API to get Expr's type and nullability without a `DFSchema` (#1726) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs * Fix typos in crate documentation (#1739) * add `cargo check --release` to ci (#1737) * remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> * Move optimize test out of context.rs (#1742) * Move optimize test out of context.rs * Update * use clap 3 style args parsing for datafusion cli (#1749) * use clap 3 style args parsing for datafusion cli * upgrade cli version * Add partitioned_csv setup code to sql_integration test (#1743) * use ordered-float 2.10 (#1756) Signed-off-by: Andy Grove <[email protected]> * #1768 Support TimeUnit::Second in hasher (#1769) * Support TimeUnit::Second in hasher * fix linter * format (#1745) * Create built-in scalar functions programmatically (#1734) * create build-in scalar functions programatically Signed-off-by: remzi <[email protected]> * solve conflict Signed-off-by: remzi <[email protected]> * fix spelling mistake Signed-off-by: remzi <[email protected]> * rename to call_fn Signed-off-by: remzi <[email protected]> * [split/1] split datafusion-common module (#1751) * split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <[email protected]> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <[email protected]> * fix: Case insensitive unquoted identifiers (#1747) * move dfschema and column (#1758) * add datafusion-expr module (#1759) * move column, dfschema, etc. to common module (#1760) * include window frames and operator into datafusion-expr (#1761) * move signature, type signature, and volatility to split module (#1763) * [split/10] split up expr for rewriting, visiting, and simplification traits (#1774) * split up expr for rewriting, visiting, and simplification * add docs * move built-in scalar functions (#1764) * split expr type and null info to be expr-schemable (#1784) * rewrite predicates before pushing to union inputs (#1781) * move accumulator and columnar value (#1765) * move accumulator and columnar value (#1762) * fix bad data type in test_try_cast_decimal_to_decimal * added projections for avro columns Co-authored-by: xudong.w <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Yijie Shen <[email protected]> Co-authored-by: Phillip Cloud <[email protected]> Co-authored-by: Matthew Turner <[email protected]> Co-authored-by: Yang <[email protected]> Co-authored-by: Remzi Yang <[email protected]> Co-authored-by: Dan Harris <[email protected]> Co-authored-by: Dom <[email protected]> Co-authored-by: Kun Liu <[email protected]> Co-authored-by: Dmitry Patsura <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]> Co-authored-by: Will Jones <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: r.4ntix <[email protected]> Co-authored-by: Jiayu Liu <[email protected]> Co-authored-by: Andy Grove <[email protected]> Co-authored-by: Rich <[email protected]> Co-authored-by: Marko Mikulicic <[email protected]> Co-authored-by: Eduard Karacharov <[email protected]>

HaoYang670 added 4 commits February 3, 2022 09:36

create build-in scalar functions programatically

4e1e9f3

Signed-off-by: remzi <[email protected]>

merge master branch

78db02b

Signed-off-by: remzi <[email protected]>

solve conflict

9cb1146

Signed-off-by: remzi <[email protected]>

fix spelling mistake

330dd8c

Signed-off-by: remzi <[email protected]>

github-actions bot added the datafusion Changes in the datafusion crate label Feb 3, 2022

houqp requested a review from alamb February 3, 2022 05:40

alamb approved these changes Feb 4, 2022

View reviewed changes

HaoYang670 added 2 commits February 7, 2022 15:37

rename to call_fn

2b1cc01

Signed-off-by: remzi <[email protected]>

Merge branch 'master' into issue1718_cleaner_API_to_create_scalar_fun…

75539c7

…ction

alamb approved these changes Feb 7, 2022

View reviewed changes

alamb merged commit 40c29e5 into apache:master Feb 7, 2022

HaoYang670 deleted the issue1718_cleaner_API_to_create_scalar_function branch February 8, 2022 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create built-in scalar functions programmatically #1734

Create built-in scalar functions programmatically #1734

HaoYang670 commented Feb 3, 2022

alamb left a comment

alamb Feb 3, 2022

jimexist Feb 5, 2022

houqp Feb 5, 2022 •

edited

Loading

HaoYang670 Feb 5, 2022 •

edited

Loading

HaoYang670 Feb 7, 2022

alamb Feb 7, 2022

alamb Feb 3, 2022

alamb Feb 3, 2022

HaoYang670 Feb 7, 2022

alamb Feb 7, 2022

Create built-in scalar functions programmatically #1734

Create built-in scalar functions programmatically #1734

Conversation

HaoYang670 commented Feb 3, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

HaoYang670 Feb 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp Feb 5, 2022 •

edited

Loading

HaoYang670 Feb 5, 2022 •

edited

Loading