Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create datafusion-functions crate, extract encode and decode to #8705

Merged
merged 20 commits into from
Jan 30, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 1, 2024

Note: I view this PR as strategically very important because:

  1. It proposes a pattern to make the DataFusion core more modular
  2. One we set the pattern, I think the community will likely help do the work (as they have done in Port remaining sql integration tests from rust to sqllogictest #6195)

While this PR looks large, it is a lot of boiler plate and comments. The actual changes are quite small (move two functions)

Which issue does this PR close?

Part of #8045

Closes #8046
Closes #8157 (it is not needed anymore)

Rationale for this change

This is the start of removing the distinction between "Built in" functions and "User Defined Functions"

Removing the distinction between "Built in" functions and "User Defined Functions" has two benefits:

  1. It ensures that User Defined Functions have the exact same functionality as "built in" functions
  2. It allows better customization of DataFusion (e.g. for people who want "postgres like" functionality vs people who want "spark like" functionality) by making it easier to mix different packages.

What changes are included in this PR?

  1. Add a new datafusion-functions crate (eventually I imagine there may be several subcrates, to allow faster compilation, but let's start with one)
  2. Adds a new API convention for a function "package" and examples for how to write
  3. Add a register_udf function to the FunctionRegistry trait
  4. Extract the encode and decode functions into the new crate
  5. Tests showing that protobuf serialization works for these functions

Are these changes tested?

Yes, there are new unit tests

Are there any user-facing changes?

  1. The definition of the expr fn API for encode and decode are now defined in datafusion-functions rather than datafusion-expr. For anyone who just uses the expr_fn APIs this will mean a new dependency on datafusion-physical-expr.

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate labels Jan 1, 2024
@alamb alamb mentioned this pull request Jan 1, 2024
2 tasks
@@ -69,14 +69,10 @@ pub enum BuiltinScalarFunction {
Cos,
/// cos
Cosh,
/// Decode
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key point of this PR: to begin removing items from this (giant) enum

@@ -758,30 +752,6 @@ impl BuiltinScalarFunction {
BuiltinScalarFunction::Digest => {
utf8_or_binary_to_binary_type(&input_expr_types[0], "digest")
}
BuiltinScalarFunction::Encode => Ok(match input_expr_types[0] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is moved into the ScalarUDFImpl for Encode/Decod

@@ -1310,34 +1306,4 @@ mod test {
unreachable!();
}
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are testing the expression construction which I don't think is hugely valuable -- what is far more important is that encode/decode can be called as SQL and the dataframe API

encode/decode is covered by in slt here: https://github.com/apache/arrow-datafusion/blob/d68c0fcb491d4d0a2159ff433b59fa77f78edca4/datafusion/sqllogictest/test_files/encoding.slt#L18-L17

And I have added an encode/decode dataframe test in this PR

@@ -524,6 +527,30 @@ async fn roundtrip_logical_plan_with_extension() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn roundtrip_expr_api() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are tests demonstrating that using ScalarUDF to implement functions are properly serialized

@@ -750,3 +756,48 @@ async fn test_fn_upper() -> Result<()> {

Ok(())
}

#[tokio::test]
async fn test_fn_encode() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions weren't previously tested. I added tests to show that it is still possible to use the fluent API even with the moving of the expr_fn implementation

not_impl_err!("Registering ScalarUDF")
}

// TODO add register_udaf and register_udwf
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will file follow on tickets for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


/// A [`FunctionRegistry`] that uses in memory [`HashMap`]s
#[derive(Default, Debug)]
pub struct MemoryFunctionRegistry {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently only used for testing, but I plan to reduce the duplication between SessionContext, and TaskContext using it

under the License.
-->

# DataFusion Function Library
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the basic idea -- make a new crate with all the function definitions

}
};
}

#[cfg(feature = "crypto_expressions")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other existing feature flags such as crypto_expressions are obvious candidates to extract into datafusion-functions

@alamb alamb marked this pull request as ready for review January 15, 2024 14:15
@alamb alamb changed the title Extract encode and decode to datafusion-functions crate Create datafusion-functions crate, extract encode and decode to Jan 15, 2024
@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2024

I think we should wait to merge this until after releasing #8863

@ozankabak
Copy link
Contributor

@metesynnada PTAL

Copy link
Contributor

@mustafasrepo mustafasrepo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!. I think this PR is great in accomplishing your aim:

This is the start of removing the distinction between "Built in" functions and "User Defined Functions"

This will enable force us to use, create good APIs for external users. Since now every scalar function is like User Defined function. Thanks @alamb for this PR.
I think, in the future your intent is to move all functions in the BuiltinScalarFunction enum to this package. If this is the intent, maybe we can add a recipe somewhere (can be a issue body) for explaining steps to convert existing BuiltinScalarFunctions to ScalarUDFImpls

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Jan 29, 2024

Thank you @mustafasrepo @metesynnada and @crepererum for your reviews. I plan to merge this PR tomorrow unless anyone else would like extra time to review, and file follow on issues to start organizing the bigger project

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This looks like a pretty nice feature!

}

fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
// Put a feature flag here to make sure this is only compiled when the feature is activated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a TODO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call -- I think the comment is out of date -- this whole module is only compiled when the feature is activated. I will remove it

Comment on lines +102 to +104
macro_rules! make_package {
($name:ident, $feature:literal, $DOC:expr) => {
#[cfg(feature = $feature)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder if all function packages have corresponding features? I guess for built-in scalar functions they don't have feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent observation

I think eventually it would be nice to have as many functions as possible be optional and behind a feature flat. There are some functions that back built in Expr like make_array

What I am thinking is that we'll first break out the functions that do have their own feature flags already and establish a pattern and then we can design / lift out others that don't. I hope to write down more of the plan shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a discussion / proposal on how the functions could be organized: #9100

@Omega359
Copy link
Contributor

Looks nice, I think this should really help with the readability of the code

@alamb
Copy link
Contributor Author

alamb commented Jan 30, 2024

🚀

@alamb alamb merged commit d6d35f7 into apache:main Jan 30, 2024
22 checks passed
@alamb alamb deleted the alamb/function_packages branch January 31, 2024 14:18
@alamb
Copy link
Contributor Author

alamb commented Jan 31, 2024

Thank you everyone who helped review and provide commentary on this. @viirya @sunchao @2010YOUY01 @crepererum @metesynnada and @mustafasrepo -- this is only the first step but I am pretty excited about where this will go

@Omega359
Copy link
Contributor

Omega359 commented Feb 7, 2024

A couple of questions on this:

  • Where should public rustdocs for the function go?
  • Benchmarks? Tests && test data? Is that all going to live at functions/benches and functions/tests ?

@alamb
Copy link
Contributor Author

alamb commented Feb 8, 2024

I responded to these questions above at #9100 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Expr creation for ScalarUDF: Resolve function calls by name during planning
8 participants