Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom struct field names with new scalar function named_struct #9743

Merged
merged 7 commits into from
Mar 30, 2024

Conversation

gstvg
Copy link
Contributor

@gstvg gstvg commented Mar 22, 2024

Which issue does this PR close?

Closes #5861

What changes are included in this PR?

  1. Add a new scalar function named_struct which accepts pairs of name and values, made possible with Support compute return types from argument values (not just their DataTypes) #8985
  2. Transform sqlparser::Expr::Struct into a named_struct call, instead of the existing struct, and if any expression is a sqlparser::Expr::Named, use its name as the struct field name, otherwise fallback to the currently used cN convention.

Are these changes tested?

Yes, the new scalar function has a unit test similar to the existing struct function, and a sql logic test has been added too.

Are there any user-facing changes?

New scalar function.
Existing struct function is not changed.

Alternatives considered

  1. Separate new Expr::Struct expression
  2. Return the existing struct call wrapped in a Expr::Cast to a struct with custom field names, but this would only benefit SQL users.

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Mar 22, 2024
Comment on lines +3333 to +3339
select struct(a as field_a, b) from t;
+--------------------------------------------------+
| named_struct(Utf8("field_a"),t.a,Utf8("c1"),t.b) |
+--------------------------------------------------+
| {field_a: 1, c1: 2} |
| {field_a: 3, c1: 4} |
+--------------------------------------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please tell me why would this test change?🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this syntax is not supported. This PR adds support for it by actually calling named_struct instead, which itself can't support this syntax because it would be ambiguos, e.g: named_struct('name1', 1 as name2)
Essentialy struct is treated as expression and is rewrited to a named_struct function call.
But perhaps it's a bit confusing? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! :)
I think it would be better to add a new test case to illustrate this new feature rather than changing the old one? Just my two cents, let's see what others would think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed -- please do add a new test

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @gstvg -- this PR looks very good to me. A really nice initial contribution 🏆

Thank you @yyy1000 for your review.

I think this PR is very close - it just needs some additional tests and to handle mixed scalar/column values

@@ -50,6 +50,12 @@ select struct(1, 3.14, 'e');
----
{c0: 1, c1: 3.14, c2: e}

# struct scalar function #1 with alias
query ?
select struct(1 as "name0", 3.14 as name1, 'e', true as 'name3');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very cool -- I didn't realize we supported this syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, i thought it's a generic syntax supported basically anywhere, only after your comment i realized it's specific to struct(...)

datafusion/functions/src/core/named_struct.rs Outdated Show resolved Hide resolved
Comment on lines 90 to 92
Err(datafusion_common::DataFusionError::Internal(
"return_type called instead of return_type_from_exprs".into(),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use internal_err! here and also include the name named_struct as part of the error message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can just delete this method return_type. I don't think in any case this will be called 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yyy1000 ideally yes, but it's a required method. If I remember correctly, it's because of backwards compatibility, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gstvg Ohhh, yes. Sorry for that, I forgot return_type didn't have a default implementation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came up with an idea

array_ref,
))
} else {
exec_err!("named_struct even arguments must be string literals")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to below, I think it would be valuable to explictly describe here what was received instead so a user who encounters this error will more easily be able to find their mistake and correct it

--TableScan: values projection=[a, b, c]
physical_plan
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)]
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)]
--MemoryExec: partitions=1, partition_sizes=[1]

statement ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please add a few more tests:

  1. Error case: named_struct with 0 arguments
  2. Error case: named_struct wth an odd number of arguments
  3. Error case: named_struct with an even number of arguments
  4. A test where the arguments are both columns and arrays (example below)

Here is an example (I expect it to panic in this PR at first -- see #9775 for how to fix it)

❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.014 seconds.

❯ select named_struct(x as 'col_x', 25 as scalar) from t;

Copy link
Contributor

@yyy1000 yyy1000 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, also a basic test case like the document you added.
select named_struct('field_a', a, 'field_b', b) from t;

Edit: or maybe you can add the test cases to a new test file?

if let ColumnarValue::Scalar(ScalarValue::Utf8(Some(name))) = name {
let array_ref = match value {
ColumnarValue::Array(array) => array.clone(),
ColumnarValue::Scalar(scalar) => scalar.to_array()?.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to turn this into the correct length to match any array arguments (I mention a test below that will fix the problem) otherwise the sizes of the arguments will be mismatched

I suggest using ColumnarValues::values_to_array if possible

use std::any::Any;
use std::sync::Arc;

/// put values in a struct array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a new function, I think we could instead change the existing struct UDF to support explicit named fields.

Perhaps we can do that as a follow on PR?

--TableScan: values projection=[a, b, c]
physical_plan
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)]
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)]
Copy link
Contributor

@yyy1000 yyy1000 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also have the same concern as @alamb said in #9743 (comment). It seems that from here, the current struct UDF will never be called because calling struct will rewrite into named_struct UDF 👀

We can also leave this and probably add it as a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9839 for follow up

@yyy1000
Copy link
Contributor

yyy1000 commented Mar 24, 2024

It looks nice to me overall, and I also added some comment which may help you improve the PR :) @gstvg

@gstvg
Copy link
Contributor Author

gstvg commented Mar 25, 2024

Thank you guys for all the comments and the review.

Prior to applying the requested changes, I wanna say that I think I didn't elaborated enough what this PR does and most importantly, why it does that way, and now I'm afraid maybe this is not the best way.

How @alamb noted and @yyy1000 agreed, ideally we would simply extend the current struct function to support named arguments, but because of the scalar function signature fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue>, I didn't figured out any way to associate a name or any kind of metadata with the args. Self in this case is the "global" function, not unique to any specific invocation. Changing it's signature it's a big change, and I personally like it's simplicity. We can add metadata to the ColumnarValue, but again, feels like a big change.

We could modify the existing struct func to accept pairs of name and value, but it would be breaking change, e.g: given an existing call struct("text", 1) should it create a struct with 2 fields named "c0" and "c1", or a single field named "text"?

And most importantly, It would not support the much nicer syntax struct("valueOrName?", "value" as another_name) because it would be ambiguos, is the first arg a value, or the name of the second arg? If it's a name, and it's corresponding value pair is a named arg, what name takes precedende?

So, we can create a new separate function, that accept pairs of name and value, but not named arguments, and it would give us a minimun level of support without a nice syntax, and also naming it named_struct it becomes compatible with Spark but that's not the priority now, of course.

Then, there's the improvement of supporting the struct(value as name) syntax.

Currently, this PR, while parsing sqlparser::Expr::Struct, inspect the args Expr and if it is Expr::Named, use its name, otherwise the cN convetion, then call named_struct, not struct.
But doing this, what exaclty struct(...) becomes ? A function? Expression? "Meta-function"? How should the docs call it? Users inspecting projections and plans would only see named_struct calls, it is acceptable? It is indeed a bit confusing.
Also, if Expr-based code calls struct with datafusion::Expr::Alias, this PR can't identify it and return field names with the cN convetion, which is probably unexpected. Should we delete the struct function, so Expr code can't use it?

Prior opening this PR, I though this:

Return the existing struct call wrapped in a Expr::Cast to a struct with custom field names, but this would only benefit SQL users.

I was wrapping the call with a cast while parsing sqlparser::Expr.

But if instead we parse sqlparser::Expr::Named to datafusion::Expr::Alias, then add FunctionRewrite that wraps struct calls with cast, using the name of Expr::Alias, if any, or the cN convetion, we support both SQL and Expr users via the same code in a more expected way, i think.
The problem with it is that, per current FunctionRewrite docs

Trait for rewriting [`Expr`]s into function calls.

This trait is used with `FunctionRegistry::register_function_rewrite` to
to evaluating `Expr`s using functions that may not be built in to DataFusion

For example, concatenating arrays `a || b` is represented as
`Operator::ArrowAt`, but can be implemented by calling a function
`array_concat` from the `functions-array` crate.

we are using it in the opposite way it was designed for. It is okay to use it that way, updating the docs, or it's misuse?
It also create very long projection plan strings, which may be undesired.

So, given the options:

  1. Add named_struct function
    1.1. Forward struct calls to named_struct
  2. Rewrite struct into a Expr::Cast(struct_fn_expr, struct_datatype_with_named_fields)

We can do:

  • Only 1
  • Only 2
  • 1 & 1.1 (This PR)
  • 1 & 2(for spark compatibility only, in a different PR)

What do you think?

@yyy1000
Copy link
Contributor

yyy1000 commented Mar 26, 2024

Thanks for your explain, @gstvg !
Yeah, in my previous thought, struct udf will not be called after this PR, so if we just remove struct udf, would it also work? ( I think so but not sure)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 26, 2024

@gstvg Is it possible to have a syntax like {"a": 1, "b": 2} for a named struct?

struct can remain for the cN convention when you don't care about the field name, so we don't have the ambiguous signature issue about struct() or named_struct()

Do not allow partial named struct like struct("valueOrName?", "value" as another_name), I think it is confusing and also not a helpful syntax.

After the named struct {"a": 1, "b": 2} is completed, we can easily rewrite struct to {"c0": 1, "c1": 2}, at the end we only need one struct function.

@gstvg
Copy link
Contributor Author

gstvg commented Mar 26, 2024

Thanks for your explain, @gstvg ! Yeah, in my previous thought, struct udf will not be called after this PR, so if we just remove struct udf, would it also work? ( I think so but not sure)

Yes, you're right, named_struct doesn't require the struct udf, but some expr-based code somewhere may depend on it, it is enough to justify it's existance? I'm don't know much about the retrocompatibility guidelines of datafusion.

Edit: link @yyy1000

@gstvg
Copy link
Contributor Author

gstvg commented Mar 26, 2024

@gstvg Is it possible to have a syntax like {"a": 1, "b": 2} for a named struct?

It would require a PR to sqlparser-rs
Someone already created an issue

struct can remain for the cN convention when you don't care about the field name, so we don't have the ambiguous
signature issue about struct() or named_struct()

Do not allow partial named struct like struct("valueOrName?", "value" as another_name), I think it is confusing and also > not a helpful syntax.

Fair enough

After the named struct {"a": 1, "b": 2} is completed, we can easily rewrite struct to {"c0": 1, "c1": 2}, at the end we only need one struct function.

Yes, after minimal support is merged, and support for the syntax is added on sqlparser-rs, it would be trivial to support it, leaving us with a very nice support for structs!

Edit: link @jayzhan211

@alamb
Copy link
Contributor

alamb commented Mar 27, 2024

Thank you @gstvg -- I think your solution 1.1 sounds like the best plan to me (and basically what this PR does). I think this PR is very important and will unlock several other great usecases for DataFusion so thank you again for working on it

Yes, you're right, named_struct doesn't require the struct udf, but some expr-based code somewhere may depend on it, it is enough to justify it's existance? I'm don't know much about the retrocompatibility guidelines of datafusion.

I agree that it would be nice to remove the old struct udf -- let's do that as a follow on PR (where we can discuss the merits / requirements for backwards compatibility)

It looks like @jayzhan211 and you are already busy working on the sqlparser support for supporting the duckdb style literal syntax 🙏

Thus here is what I think we should do:

  1. Polish up this PR (add the tests described in Support custom struct field names with new scalar function named_struct #9743 (comment))
  2. Work on removing the existing struct udf impl as a follow on PR
  3. Work on the named struct syntax support as a follow on PR (Support DuckDB style stuct syntax #9820)


let name = match name_column {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar,
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

--TableScan: values projection=[a, b, c]
physical_plan
ProjectionExec: expr=[struct(a@0, b@1, c@2) as struct(values.a,values.b,values.c)]
ProjectionExec: expr=[named_struct(c0, a@0, c1, b@1, c2, c@2) as named_struct(Utf8("c0"),values.a,Utf8("c1"),values.b,Utf8("c2"),values.c)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9839 for follow up

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- than you @gstvg

I filed #9839 for the follow on task

@alamb
Copy link
Contributor

alamb commented Mar 28, 2024

I think you can get the CI passing by merging this PR up from main. We are getting very close

@gstvg gstvg requested review from alamb and yyy1000 March 28, 2024 18:12
Copy link
Contributor

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you a lot! @gstvg

Comment on lines 90 to 92
Err(datafusion_common::DataFusionError::Internal(
"return_type called instead of return_type_from_exprs".into(),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Err(datafusion_common::DataFusionError::Internal(
"return_type called instead of return_type_from_exprs".into(),
))
unreachable!()

Would unreachable!() be suitable here? @alamb @gstvg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't quite figure out what code this was in relation to (the github UI is being weird). But in general I am a fan of using internal errors if possible even if they really are unreachable. The rationale being that if there is some bug that gets introduced in some future refactoring, then the failure mode is nicer than panic'ing

@alamb alamb merged commit aa879bf into apache:main Mar 30, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 30, 2024

Thanks again @gstvg 🙏

@alamb
Copy link
Contributor

alamb commented Mar 30, 2024

And thanks for the reviews and suggestions @yyy1000 and @jayzhan211

Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Apr 1, 2024
…ct (apache#9743)

* Support custom struct field names with new scalar function named_struct

* add tests and corretly handle mixed arrray and scalar values

* fix slt

* fmt

* port test to slt

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a struct with the specified field names and values
4 participants