-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50216][SQL][TESTS] Update CollationBenchmark
to invoke collationNameToId
outside of cases
#48804
[SPARK-50216][SQL][TESTS] Update CollationBenchmark
to invoke collationNameToId
outside of cases
#48804
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevomitric Could you regenerate results of the benchmark CollationBenchmark
, please. We should expect better numbers after your changes, right?
They are running, I will post them here once complete. We should expect same or better results. |
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
Gentle ping, @stevomitric . Is the regenerated result ready? Also, cc @panbingkun , FYI. |
@@ -113,13 +113,13 @@ abstract class CollationBenchmarkBase extends BenchmarkBase { | |||
warmupTime = 10.seconds, | |||
output = output) | |||
collationTypes.foreach { collationType => { | |||
val collation = CollationFactory.fetchCollation(collationType) | |||
val collationId = CollationFactory.collationNameToId(collationType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid touching the benchmark in the PR, @stevomitric ?
If we need this, we can proceed separately this CollationBenchmark
change before your regression fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId
function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's opposite. If that's true, we definitely should proceed it independently in order to measure how much they are affecting, @stevomitric . Please make a benchmark update PR first.
The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.
Yes, I also agree, because in the previous PR (#48222) that may have generated performance regression, we did not see any modifications to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job finding the issue! Changes look good, but I also think it would make sense to do benchmark changes in a separate PR
How about to update bechmark results after the revert 95b259e |
CollationBenchmark
to invoke collationNameToId
outside of cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating, @stefankandic . +1, LGTM.
Decided to make this PR update the benchmarks, and not touch the CollationFactory class. Modifying the function in the way you proposed #48804 (comment), resulted in a very small perf benefit.
Although I updated the PR title based on the AS-IS status, the PR description is still outdated. Could you make the PR description up-to-date with the AS-IS status, @stefankandic ? |
Updated the description and added non-ascii collation benchmarks, @dongjoon-hyun . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failed GA Run / Protobuf breaking change detection and Python CodeGen check is not related to the changes, I believe.
+1, LGTM. Merging to master.
Thank you, @stevomitric and @dongjoon-hyun @HyukjinKwon @stefankandic for review.
Thank you all! |
What changes were proposed in this pull request?
In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation.
Why are the changes needed?
The PR #48222 caused the regression because it changed the
collationNameToId
function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression.This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing testing surface, benchmarks.
Was this patch authored or co-authored using generative AI tooling?
No