Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8301][SQL] Improve UTF8String substring/startsWith/endsWith/contains performance #6804

Closed
wants to merge 11 commits into from
23 changes: 15 additions & 8 deletions unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Original file line number Diff line number Diff line change
Expand Up @@ -131,24 +131,31 @@ public boolean contains(final UTF8String substring) {
}

for (int i = 0; i <= bytes.length - b.length; i++) {
// TODO: Avoid copying.
if (bytes[i] == b[0] && Arrays.equals(Arrays.copyOfRange(bytes, i, i + b.length), b)) {
if (bytes[i] == b[0] && startsWith(substring, i)) {
return true;
}
}
return false;
}

private boolean startsWith(final UTF8String prefix, int offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If we use startsWith(final byte[] prefix, int offset), it will may save some calls of s.getBytes(). (not sure how is the difference).

byte[] b = prefix.getBytes();
if (b.length + offset > bytes.length || offset < 0) {
return false;
}
int i = 0;
while (i < b.length && b[i] == bytes[i + offset]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @rxin suggested, we could use unsafe API to compare 8 bytes at once. But I think in most cases the suffix will be short and this while loop will terminate very fast, it may not worth that complication.

The while loop is good to me.

i++;
}
return i == b.length;
}

public boolean startsWith(final UTF8String prefix) {
final byte[] b = prefix.getBytes();
// TODO: Avoid copying.
return b.length <= bytes.length && Arrays.equals(Arrays.copyOfRange(bytes, 0, b.length), b);
return startsWith(prefix, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, UTF8String is only used by catalyst, can it will check prefix is null or not. Since this is a public method, it's still good to check prefix is null or not.

}

public boolean endsWith(final UTF8String suffix) {
final byte[] b = suffix.getBytes();
return b.length <= bytes.length &&
Arrays.equals(Arrays.copyOfRange(bytes, bytes.length - b.length, bytes.length), b);
return startsWith(suffix, bytes.length - suffix.getBytes().length);
}

public UTF8String toUpperCase() {
Expand Down