Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Utf8View for string function bit_length() #13221

Merged
merged 4 commits into from
Nov 10, 2024

Conversation

austin362667
Copy link
Contributor

Which issue does this PR close?

Closes #13195

Rationale for this change

Thanks to @jayzhan211 , he noticed following issue which string scalar function bit_length() doesn't support Utf8View type:

select bit_length(arrow_cast('a', 'Utf8View'));

What changes are included in this PR?

Update bit_length() scalar function to support Utf8View

Are these changes tested?

Yes for scalar function. No for array function as it's still PR in apache/arrow-rs#6671

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 1, 2024
@austin362667
Copy link
Contributor Author

certainly, thanks for review!

@@ -1096,3 +1096,20 @@ Rapha Raphael NULL datafusionДатаФус datafusionДатаФусион NULL
und under_score NULL un iść core un iść core NULL
p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL
NULL NULL NULL NULL NULL NULL

# TODO: Support Utf8View for bit_length array string function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment -- it seems like bit_length is working on Utf8View in datafusion/sqllogictest/test_files/string/string_view.slt

If you just uncomment this code out that would mean we are covering bit_length for all the string types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb
As I understand it, applying bit_length() to a column will currently fail because arrow::compute::kernels::length::bit_length does not yet support UTF-8 views (depends on PR in arrow-rs). However, it's still used in the following code path:

ColumnarValue::Array(v) => Ok(ColumnarValue::Array(bit_length(v.as_ref())?)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @austin362667

Comment on lines 94 to 102
query IIII
SELECT
BIT_LENGTH(arrow_cast('Andrew', 'Utf8View')),
BIT_LENGTH(arrow_cast('datafusion数据融合', 'Utf8View')),
BIT_LENGTH(arrow_cast('💖', 'Utf8View')),
BIT_LENGTH(arrow_cast('josé', 'Utf8View'))
;
----
48 176 32 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this test to string/string_literal.slt and then add similar tests for other string types as well (UTF8, LargeUTF8, and DictionaryString).

@@ -1096,3 +1096,20 @@ Rapha Raphael NULL datafusionДатаФус datafusionДатаФусион NULL
und under_score NULL un iść core un iść core NULL
p percent NULL pan Tadeusz ma iść w kąt pan Tadeusz ma iść w kąt NULL
NULL NULL NULL NULL NULL NULL

# TODO: Support Utf8View for bit_length array string function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cover other string types, we can add this test to each string testing file before apache/arrow-rs#6671.

;
----
48 176 32 40

query error DataFusion error: Arrow error: Compute error: bit_length not supported for Utf8View
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a TODO comment here and file an issue to track apache/arrow-rs#6671. After upgrading to the corresponding arrow-rs version, we should address the TODO comment in string/string_query.slt.part#L1100.

@alamb
Copy link
Contributor

alamb commented Nov 8, 2024

@austin362667 any chance you have some time to address @goldmedal 's suggestions?

@austin362667
Copy link
Contributor Author

Thanks @goldmedal @alamb for the review. 🙏 Just addressed these issues
So sorry for the late reply.

Signed-off-by: Austin Liu <[email protected]>
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @austin362667 and @alamb @jayzhan211 for reviewing. Although I think the tests can be improved, I can help to do that in the follow-up PR. Let's merge this PR first.

@goldmedal goldmedal merged commit 1557fce into apache:main Nov 10, 2024
25 checks passed
jayzhan211 pushed a commit to jayzhan211/datafusion that referenced this pull request Nov 12, 2024
* Support `Utf8View` for string function `bit_length()`

Signed-off-by: Austin Liu <[email protected]>

* Add scalar test case

Signed-off-by: Austin Liu <[email protected]>

* Refine tests

Signed-off-by: Austin Liu <[email protected]>

* Fix wrong format

Signed-off-by: Austin Liu <[email protected]>

---------

Signed-off-by: Austin Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Utf8View for string function bit_length
4 participants