-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8301][SQL] Improve UTF8String substring/startsWith/endsWith/contains performance #6804
Conversation
…te function startsWith(prefix, offset) to implement the check for startsWith, endsWith and contains.
Jenkins, test this please. |
Test build #34848 has finished for PR 6804 at commit
|
cc @davies for review. |
@tarekauel Thanks for working on this, could you have some micro benchmark to show the performance difference? It's good to see the improvements by numbers. |
return true; | ||
} | ||
} | ||
return false; | ||
} | ||
|
||
private boolean startsWith(final UTF8String prefix, int offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If we use startsWith(final byte[] prefix, int offset)
, it will may save some calls of s.getBytes()
. (not sure how is the difference).
@davies thanks for your feedback. I did a micro benchmark for starsWith. I created objects for the UTF8String and called the test 1,000,000,000 times. duration starts with old: 73,586 ms |
@tarekauel LGTM. It will be better if you could also add null check for others, thanks for the numbers, good to know the improvements! |
@davies I added the null checks. If someone compares a UTF8String and null, the null value will come first and the string afterwards. For set I decided to initialise an empty byte array, in order to avoid a crash later. |
@tarekauel I think we could mark |
…on of bytes from Nullable to Nonnull
@davies I have changed the access modifier and I changed the annotation of bytes to Nonnull from Nullable. I added all non null checks, didn't I? |
@tarekauel Thanks for the change, you need to update all the usage of (new UTF8String().set) to UTF8String.fromString/Bytes, or it can not compile or fail the tests. |
Test build #914 has finished for PR 6804 at commit
|
@davies done. Intellij found only one reference. Can you trigger the build process again? |
you can |
There's just one other file |
Sorry, it should be |
Thanks! |
Test build #915 has finished for PR 6804 at commit
|
Can someone trigger Jenkins? |
Jenkins, retest this please. |
Test build #34953 has finished for PR 6804 at commit
|
@davies I fixed it. The change was correct, wasn't it? Can someone start Jenkins again? |
@@ -437,17 +437,17 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression w | |||
|
|||
case (BinaryType, StringType) => | |||
defineCodeGen (ctx, ev, c => | |||
s"new ${ctx.stringType}().set($c)") | |||
s"new ${ctx.stringType}().fromBytes($c)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be ${ctx.stringType}.fromBytes($c)
, also others
Test build #916 has finished for PR 6804 at commit
|
PlatformDependent.throwException(e); | ||
protected UTF8String set(final String str) { | ||
if (str == null) { | ||
bytes = new byte[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This's probably not right, as a null string will be converted into empty string "".
(new UTF8String().set(null).toString() == "") ? Can we revert this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's safe to remove this check, because set
is protected now, we can make sure that str
is not null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay I am going to remove this
@tarekauel Glad to see the big improvement for the performance gain! I have some comments on the API change, e.g. the |
@chenghao-intel First of all thanks for all your comments and thoughts. I guess the What do you think? |
+1, UTF8String should follow SQL semantics not JVM semantics. |
I know the null checking stuff will make our code more safe, but we usually don't do that in the API implementation, as it saves lots of redundant of code, that's almost the general conversion, otherwise we need to make a note in the scaladoc/javadoc. Expression framework (including the |
Yeah, that's a good comment. I checked the String expressions and it does already the null check, see So it won't be null and a valid result will be provided, if a value is null. I think the result is even better, because it returns null and not false which is way better. |
Probably be |
@@ -437,17 +437,17 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression w | |||
|
|||
case (BinaryType, StringType) => | |||
defineCodeGen (ctx, ev, c => | |||
s"new ${ctx.stringType}().set($c)") | |||
s"${ctx.stringType}().fromBytes($c)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No ()
here
All should be fine now. Can someone trigger Jenkins? |
LGTM, waiting for tests. |
Test build #949 has finished for PR 6804 at commit
|
Jira: https://issues.apache.org/jira/browse/SPARK-8301
Added the private method startsWith(prefix, offset) to implement startsWith, endsWith and contains without copying the array
I hope that the component SQL is still correct. I copied it from the Jira ticket.