-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cast Utf8View
to Utf8
to support ||
from StringViewArray
#11796
Conversation
Utf8View
to Utf8
to support ||
from StringViewArray
Utf8View
to Utf8
to support ||
from StringViewArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dharanad -- this is ceratinly better than what happens on main (which is the query doesn't run)
I do think the plans could be improved if they cast to Utf8Vew instead, but we can do that as a follow on PR as well
(LargeUtf8, from_type) | (from_type, LargeUtf8) => { | ||
string_concat_internal_coercion(from_type, &LargeUtf8) | ||
match (lhs_type, rhs_type) { | ||
// If Utf8View is in any side, we coerce to Utf8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to coerce to Utf8View
as that coercsion will often be faster (it is faster to cast Utf8 -> Utf8View than the other way around)
Is that possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, similar to this policy:
datafusion/datafusion/expr/src/type_coercion/binary.rs
Lines 935 to 947 in b50887f
fn string_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType> { | |
use arrow::datatypes::DataType::*; | |
match (lhs_type, rhs_type) { | |
// If Utf8View is in any side, we coerce to Utf8View. | |
(Utf8View, Utf8View | Utf8 | LargeUtf8) | (Utf8 | LargeUtf8, Utf8View) => { | |
Some(Utf8View) | |
} | |
// Then, if LargeUtf8 is in any side, we coerce to LargeUtf8. | |
(LargeUtf8, Utf8 | LargeUtf8) | (Utf8, LargeUtf8) => Some(LargeUtf8), | |
(Utf8, Utf8) => Some(Utf8), | |
_ => None, | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to coerce to
Utf8View
as that coercsion will often be faster (it is faster to cast Utf8 -> Utf8View than the other way around)Is that possible?
How about we do in a seperate PR. Previously, we were coerced to Utf8View
, so concat was failing. As a temporary workaround to resolve the issue, I've coerce Utf8 instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed #11881 to track
select column2||' is fast' from temp; | ||
---- | ||
rust is fast | ||
datafusion is fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
explain select column2 || 'is' || column3 from temp; | ||
---- | ||
logical_plan | ||
01)Projection: CAST(temp.column2 AS Utf8) || Utf8("is") || CAST(temp.column3 AS Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a better plan (likely faster) if the casting was to Utf8View
rather than Utf8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
explain select column2||' is fast' from temp; | ||
---- | ||
logical_plan | ||
01)Projection: CAST(temp.column2 AS Utf8) || Utf8(" is fast") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here it would be better to use Utf8View
Thanks again @dharanad and @XiangpengHao -- sorry for the delay |
Which issue does this PR close?
Partially make #11766 work.
Rationale for this change
What changes are included in this PR?
Utf8View
toUtf8
make concat workAre these changes tested?
Existing test cases
Are there any user-facing changes?