-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8301][SQL] Improve UTF8String substring/startsWith/endsWith/contains performance #6804
Changes from 1 commit
b17909e
79cb55b
d2fb05f
a5f853a
e4530d2
3a0040f
9f17cc8
1c327eb
9ca0473
6d7b068
f5d6b9a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -131,24 +131,31 @@ public boolean contains(final UTF8String substring) { | |
} | ||
|
||
for (int i = 0; i <= bytes.length - b.length; i++) { | ||
// TODO: Avoid copying. | ||
if (bytes[i] == b[0] && Arrays.equals(Arrays.copyOfRange(bytes, i, i + b.length), b)) { | ||
if (bytes[i] == b[0] && startsWith(substring, i)) { | ||
return true; | ||
} | ||
} | ||
return false; | ||
} | ||
|
||
private boolean startsWith(final UTF8String prefix, int offset) { | ||
byte[] b = prefix.getBytes(); | ||
if (b.length + offset > bytes.length || offset < 0) { | ||
return false; | ||
} | ||
int i = 0; | ||
while (i < b.length && b[i] == bytes[i + offset]) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As @rxin suggested, we could use unsafe API to compare 8 bytes at once. But I think in most cases the suffix will be short and this while loop will terminate very fast, it may not worth that complication. The while loop is good to me. |
||
i++; | ||
} | ||
return i == b.length; | ||
} | ||
|
||
public boolean startsWith(final UTF8String prefix) { | ||
final byte[] b = prefix.getBytes(); | ||
// TODO: Avoid copying. | ||
return b.length <= bytes.length && Arrays.equals(Arrays.copyOfRange(bytes, 0, b.length), b); | ||
return startsWith(prefix, 0); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now, UTF8String is only used by catalyst, can it will check prefix is null or not. Since this is a public method, it's still good to check prefix is null or not. |
||
} | ||
|
||
public boolean endsWith(final UTF8String suffix) { | ||
final byte[] b = suffix.getBytes(); | ||
return b.length <= bytes.length && | ||
Arrays.equals(Arrays.copyOfRange(bytes, bytes.length - b.length, bytes.length), b); | ||
return startsWith(suffix, bytes.length - suffix.getBytes().length); | ||
} | ||
|
||
public UTF8String toUpperCase() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If we use
startsWith(final byte[] prefix, int offset)
, it will may save some calls ofs.getBytes()
. (not sure how is the difference).