Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8301][SQL] Improve UTF8String substring/startsWith/endsWith/contains performance #6804

Closed
wants to merge 11 commits into from

Conversation

tarekbecker
Copy link
Contributor

Jira: https://issues.apache.org/jira/browse/SPARK-8301

Added the private method startsWith(prefix, offset) to implement startsWith, endsWith and contains without copying the array

I hope that the component SQL is still correct. I copied it from the Jira ticket.

…te function startsWith(prefix, offset) to implement the check for startsWith, endsWith and contains.
@rxin
Copy link
Contributor

rxin commented Jun 13, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34848 has finished for PR 6804 at commit b17909e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 14, 2015

cc @davies for review.

@davies
Copy link
Contributor

davies commented Jun 14, 2015

@tarekauel Thanks for working on this, could you have some micro benchmark to show the performance difference? It's good to see the improvements by numbers.

return true;
}
}
return false;
}

private boolean startsWith(final UTF8String prefix, int offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If we use startsWith(final byte[] prefix, int offset), it will may save some calls of s.getBytes(). (not sure how is the difference).

@tarekbecker
Copy link
Contributor Author

@davies thanks for your feedback.
I added a null check and test cases for null. There are other public methods without null checks. Should I add a null check to them as well?

I did a micro benchmark for starsWith. I created objects for the UTF8String and called the test 1,000,000,000 times.

duration starts with old: 73,586 ms
duration starts with new: 23,246 ms (just 31.6%)

@davies
Copy link
Contributor

davies commented Jun 14, 2015

@tarekauel LGTM. It will be better if you could also add null check for others, thanks for the numbers, good to know the improvements!

@tarekbecker
Copy link
Contributor Author

@davies I added the null checks. If someone compares a UTF8String and null, the null value will come first and the string afterwards. For set I decided to initialise an empty byte array, in order to avoid a crash later.

@davies
Copy link
Contributor

davies commented Jun 15, 2015

@tarekauel I think we could mark set as protected(only used in a few places, include codegen), make sure that the parameters is not null. Could you do it in this PR?

@tarekbecker
Copy link
Contributor Author

@davies I have changed the access modifier and I changed the annotation of bytes to Nonnull from Nullable. I added all non null checks, didn't I?

@davies
Copy link
Contributor

davies commented Jun 15, 2015

@tarekauel Thanks for the change, you need to update all the usage of (new UTF8String().set) to UTF8String.fromString/Bytes, or it can not compile or fail the tests.

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #914 has finished for PR 6804 at commit a5f853a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tarekbecker
Copy link
Contributor Author

@davies done. Intellij found only one reference. Can you trigger the build process again?

@davies
Copy link
Contributor

davies commented Jun 15, 2015

you can git grep UTF8String | grep set

@tarekbecker
Copy link
Contributor Author

There's just one other file sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scalathis uses already UTF8.fromBytes(...)

@davies
Copy link
Contributor

davies commented Jun 15, 2015

Sorry, it should be git grep stringType | grep set

@tarekbecker
Copy link
Contributor Author

Thanks!

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #915 has finished for PR 6804 at commit e4530d2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tarekbecker
Copy link
Contributor Author

Can someone trigger Jenkins?

@rxin
Copy link
Contributor

rxin commented Jun 15, 2015

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34953 has finished for PR 6804 at commit 3a0040f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tarekbecker
Copy link
Contributor Author

@davies I fixed it. The change was correct, wasn't it? Can someone start Jenkins again?

@@ -437,17 +437,17 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression w

case (BinaryType, StringType) =>
defineCodeGen (ctx, ev, c =>
s"new ${ctx.stringType}().set($c)")
s"new ${ctx.stringType}().fromBytes($c)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ${ctx.stringType}.fromBytes($c), also others

@SparkQA
Copy link

SparkQA commented Jun 16, 2015

Test build #916 has finished for PR 6804 at commit 9f17cc8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

PlatformDependent.throwException(e);
protected UTF8String set(final String str) {
if (str == null) {
bytes = new byte[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This's probably not right, as a null string will be converted into empty string "".
(new UTF8String().set(null).toString() == "") ? Can we revert this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe to remove this check, because set is protected now, we can make sure that str is not null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay I am going to remove this

@chenghao-intel
Copy link
Contributor

@tarekauel Glad to see the big improvement for the performance gain! I have some comments on the API change, e.g. the null value checking, comparison with null, probably we'd better not change its original behavior for UTF8String, seems behavior as java.lang.String does will be reasonable to me, what do you think?

@tarekbecker
Copy link
Contributor Author

@chenghao-intel First of all thanks for all your comments and thoughts. I guess the UTF8String is only used in SPARK SQL, isn't it? Because of that I think the behavior should be similar to a database and not to JAVA. But I don't know if this should be achieved by modifying the SQL layer or this type implementation.

What do you think?

@marmbrus
Copy link
Contributor

+1, UTF8String should follow SQL semantics not JVM semantics.

@chenghao-intel
Copy link
Contributor

I know the null checking stuff will make our code more safe, but we usually don't do that in the API implementation, as it saves lots of redundant of code, that's almost the general conversion, otherwise we need to make a note in the scaladoc/javadoc.

Expression framework (including the UTF8String) is a quite independent system to SQL, follow the general API design principles will be OK, and I believe it's the responsibility for SQLLayer to make the right calls.

@tarekbecker
Copy link
Contributor Author

Yeah, that's a good comment. I checked the String expressions and it does already the null check, see

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala#L173-L182

So it won't be null and a valid result will be provided, if a value is null. I think the result is even better, because it returns null and not false which is way better.

@chenghao-intel
Copy link
Contributor

Probably be Nonnull, this will give more strict check by FindBug. It will be great if you can run FindBug locally after the change.

@@ -437,17 +437,17 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression w

case (BinaryType, StringType) =>
defineCodeGen (ctx, ev, c =>
s"new ${ctx.stringType}().set($c)")
s"${ctx.stringType}().fromBytes($c)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No () here

@tarekbecker
Copy link
Contributor Author

All should be fine now. Can someone trigger Jenkins?

@davies
Copy link
Contributor

davies commented Jun 21, 2015

LGTM, waiting for tests.

@SparkQA
Copy link

SparkQA commented Jun 21, 2015

Test build #949 has finished for PR 6804 at commit f5d6b9a.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants