Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support Utf8View type in starts_with function #11787

Merged
merged 13 commits into from
Aug 6, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Aug 2, 2024

Which issue does this PR close?

Closes #11786

Rationale for this change

Utf8Views don't appear to be supported starts_with yet.

What changes are included in this PR?

Update the type signature of the UDF and percolate the new data type support in the arrow interactions.

Are these changes tested?

Yes, added some unittests and did manual testing via the CLI.

Are there any user-facing changes?

No

let string_types = vec![Utf8, LargeUtf8, Utf8View];
let mut type_signatures = vec![];

for left in &string_types {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better way to do this? Maybe I should just list them out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a interesting idea, i liked it compared to earlier implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the underlying kernel supports different string types (you can't actually call it with (Utf8, LargeUtf8) so the existing function actually has a bug

In fact here is reproducer showing the probelm on main

DataFusion CLI v40.0.0
> create table foo as values (arrow_cast('foo', 'Utf8'), arrow_cast('bar', 'LargeUtf8'));
0 row(s) fetched.
Elapsed 0.046 seconds.

> select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo     | bar     |
+---------+---------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> select starts_with(column1, column2) from foo;
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<ii32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

So in other words, I think the correct signature is

            signature: Signature::one_of(
                vec![
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                    Exact(vec![Utf8View, Utf8View]),
                ],
                Volatility::Immutable,
            ),
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this and link to the tests -- very helpful!

I made this update along with simplifying the arrow starts_with call (#11787 (comment)), and I'm noticing that STARTS_WITH(column1_utf8view, 'foo') results in a cast of the first column to utf8 while keeping the second one utf8, but I think we'd want to keep the first argument as a utf8view while coercing the second argument to utf8view? It feels like calling starts_with with a second argument as a scalar is common.

For example, with the changes, this test passes:

query TT
EXPLAIN SELECT
  STARTS_WITH(column1_utf8view, 'foo') as c,
  STARTS_WITH(column1_utf8view, column2_utf8view) as c2
FROM test;
----
logical_plan
01)Projection: starts_with(CAST(test.column1_utf8view AS Utf8), Utf8("foo")) AS c, starts_with(test.column1_utf8view, test.column2_utf8view) AS c2
02)--TableScan: test projection=[column1_utf8view, column2_utf8view]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree figuring out how to coerce the constant to Utf8view is important

I think the coercion logic just goes linearly through the signature and stops at the first type that works.

Thus I wonder if you can make Utf8View preferred by listing the types in a different order and put the Utf8View signature at the first option? Like

            signature: Signature::one_of(
                vec![
                    // make `Utf8View` first
                    Exact(vec![Utf8View, Utf8View]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                ],
                Volatility::Immutable,
            ),
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still seems to be happening with this latest push 🤔 (222263c).

I'll try to poke around and see if I can understand what's happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needed this update in the argument coercion logic... before this even with Utf8View being preferred in the signature it wasn't recognized in the type to/from statement: https://github.com/apache/datafusion/pull/11787/files#diff-6e1fb265597317a8256c60670ff3ea7be6896b2df1199a40ca79419ce29b4ce3

#[test]
fn test_functions() -> Result<()> {
// Generate test cases for starts_with
let test_cases = vec![
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best way to test this change. It tests the function but doesn't a view wouldn't be coerced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tests for the coercion in #11753 -- you beat me to filing the ticket ❤️ -- I plan to make an epic / start filing items for the various other functions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using a slt test (example of sql test above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to throw it out there, if you make the top level epic, I'd be happy to work though that PR and make individual issues for the problematic functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deal! I filed #11790

What I recommend doing is starting with a single ticket and making it a complete template / description with the idea being that that someone that is not familar DataFusion or Utf8View can work on it

Then when we file tickets for the rest of the functions we can reuse the same description over again. This

  1. Reduces the barrier to contribution
  2. I have found people love working on the code when there is a clear description

I took a shot at updating #11786 with background for use a template. Let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll have a go at replicating one of the other functions following #11786, and then if we're happy with it, I can grind through the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good -- thank you. What I have found works well in the past is to do like 10 at a time (if you make more than that the number of parallel threads / work I found overwhelmed our ability to review them)

@tshauck tshauck changed the title feat: support Utf8View for starts_with feat: support Utf8View type in starts_with function Aug 2, 2024
}
_ => internal_err!("Unsupported data type"),
DataType::LargeUtf8 => make_scalar_function(starts_with::<i64>, vec![])(args),
DataType::Utf8View => make_scalar_function(starts_with::<i32>, vec![])(args),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do | operator to handle Utf8 and Utf8View together.

( DataType::Utf8View | DataType::Utf8 ) => make_scalar_function(starts_with::<i32>, vec![])(args)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tshauck and @dharanad -- this is quite cool. Thank you 🙏

Once we get this one sorted out, I think it will be a good example of how to add native StringViewArray support

let left = as_generic_string_array::<T>(&args[0])?;
let right = as_generic_string_array::<T>(&args[1])?;
let bool_result = match (args[0].data_type(), args[1].data_type()) {
(DataType::Utf8View, DataType::Utf8View) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since https://docs.rs/arrow/latest/arrow/compute/kernels/comparison/fn.starts_with.html already does this downcasting, the I think you could simply call this with two &dyn Array rather than having to check the types

In particular I think you could simply call

arrow::compute::kernels::comparison::starts_with(left, right)?

and remove the <T: OffsetSizeTrait> from this call

BTW I double checked that arrow has native StringView support (thanks to @XiangpengHao ):
https://docs.rs/arrow-string/52.2.0/src/arrow_string/like.rs.html#129-148

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool.

let string_types = vec![Utf8, LargeUtf8, Utf8View];
let mut type_signatures = vec![];

for left in &string_types {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the underlying kernel supports different string types (you can't actually call it with (Utf8, LargeUtf8) so the existing function actually has a bug

In fact here is reproducer showing the probelm on main

DataFusion CLI v40.0.0
> create table foo as values (arrow_cast('foo', 'Utf8'), arrow_cast('bar', 'LargeUtf8'));
0 row(s) fetched.
Elapsed 0.046 seconds.

> select * from foo;
+---------+---------+
| column1 | column2 |
+---------+---------+
| foo     | bar     |
+---------+---------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> select starts_with(column1, column2) from foo;
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<ii32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

So in other words, I think the correct signature is

            signature: Signature::one_of(
                vec![
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                    Exact(vec![Utf8View, Utf8View]),
                ],
                Volatility::Immutable,
            ),
        }

#[test]
fn test_functions() -> Result<()> {
// Generate test cases for starts_with
let test_cases = vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using a slt test (example of sql test above)

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 2, 2024
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Aug 2, 2024
@tshauck
Copy link
Contributor Author

tshauck commented Aug 2, 2024

Gonna mark this ready for review, though happy to make any additional updates.

@tshauck tshauck marked this pull request as ready for review August 2, 2024 23:21
@XiangpengHao
Copy link
Contributor

This pr is really nicely done! Looks good to me!

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this nice work, it looks good to me.

Given this PR can become an example for supporting StringView for remaining string functions, I suggest also:

  1. Add more sqllogictests, since they're easier to understand and follow.
  2. In sqllogictest end-to-end tests, include more edge cases like empty strings, mismatched argument types like starts_with(utf8_col, utf8view_col), this way we can encourage other contibutors to follow the pattern and write better tests for the remaining functions.

@@ -573,6 +573,8 @@ fn coerced_from<'a>(
(Interval(_), _) if matches!(type_from, Utf8 | LargeUtf8) => {
Some(type_into.clone())
}
// We can go into a Utf8View from a Utf8 or LargeUtf8
(Utf8View, _) if matches!(type_from, Utf8 | LargeUtf8) => Some(type_into.clone()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LargeUtf8 has 64 bit length, Utf8View only got 32 bit length.
Loos like LargeUtf8 -> Utf8View is not possible?

Copy link
Contributor

@alamb alamb Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow cast kernel supports this conversion

Though I am not sure if we should allow the automatic coercion as it could potentially fail if the strings were over 2GB 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the semantics is < 32 bit length LargeUtf8 conversion is supported, otherwise, some runtime error will occur. This makes sense.

let result = arrow::compute::kernels::comparison::starts_with(left, right)?;

pub fn starts_with(args: &[ArrayRef]) -> Result<ArrayRef> {
let result = arrow::compute::kernels::comparison::starts_with(&args[0], &args[1])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/functions/src/string/starts_with.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Aug 3, 2024

Thank you for this nice work, it looks good to me.

Given this PR can become an example for supporting StringView for remaining string functions, I suggest also:

  1. Add more sqllogictests, since they're easier to understand and follow.
  2. In sqllogictest end-to-end tests, include more edge cases like empty strings, mismatched argument types like starts_with(utf8_col, utf8view_col), this way we can encourage other contibutors to follow the pattern and write better tests for the remaining functions.

I think this is a great suggestion @2010YOUY01 -- I can help with this too

Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

Ideally you should be able to extend string_view.slt perhaps in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this PR could be merged as is (thank you @tshauck @dharanad @2010YOUY01 for the work)

I plan to leave it open until @tshauck has a chance to respond to the outstanding comments. While I think they could be done as follow on work, @2010YOUY01 has a good point that if this PR will be used as an example / template for how to add native StringView support getting all the changes (esp the testing) in a single PR has a lot of value

tshauck and others added 2 commits August 3, 2024 10:13
@tshauck
Copy link
Contributor Author

tshauck commented Aug 3, 2024

Thanks @alamb @2010YOUY01 @dharanad for the feedback. Please give me a bit to add additional tests in this PR, then I'll re-request review.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2024

🚀

@tshauck tshauck deleted the starts_with branch August 6, 2024 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support starts_with for Utf8View
5 participants