Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases #48804

Conversation

stevomitric
Copy link
Contributor

@stevomitric stevomitric commented Nov 8, 2024

What changes were proposed in this pull request?

In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation.

Why are the changes needed?

The PR #48222 caused the regression because it changed the collationNameToId function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression.

This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testing surface, benchmarks.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Nov 8, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevomitric Could you regenerate results of the benchmark CollationBenchmark, please. We should expect better numbers after your changes, right?

@stevomitric
Copy link
Contributor Author

@stevomitric Could you regenerate results of the benchmark CollationBenchmark, please. We should expect better numbers after your changes, right?

They are running, I will post them here once complete. We should expect same or better results.

@dongjoon-hyun
Copy link
Member

Gentle ping, @stevomitric . Is the regenerated result ready?

Also, cc @panbingkun , FYI.

@@ -113,13 +113,13 @@ abstract class CollationBenchmarkBase extends BenchmarkBase {
warmupTime = 10.seconds,
output = output)
collationTypes.foreach { collationType => {
val collation = CollationFactory.fetchCollation(collationType)
val collationId = CollationFactory.collationNameToId(collationType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid touching the benchmark in the PR, @stevomitric ?

If we need this, we can proceed separately this CollationBenchmark change before your regression fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.

So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's opposite. If that's true, we definitely should proceed it independently in order to measure how much they are affecting, @stevomitric . Please make a benchmark update PR first.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

@panbingkun
Copy link
Contributor

No, it's opposite. If that's true, we definitely should proceed it independently in order to measure how much they are affecting, @stevomitric . Please make a benchmark update PR first.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

Yes, I also agree, because in the previous PR (#48222) that may have generated performance regression, we did not see any modifications to CollationBenchmark. Therefore, in order to address this issue more clearly, can we do it separately? Thanks!

Copy link
Contributor

@stefankandic stefankandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job finding the issue! Changes look good, but I also think it would make sense to do benchmark changes in a separate PR

@MaxGekk
Copy link
Member

MaxGekk commented Nov 12, 2024

How about to update bechmark results after the revert 95b259e

@stevomitric
Copy link
Contributor Author

stevomitric commented Nov 13, 2024

How about to update bechmark results after the revert 95b259e

Decided to make this PR update the benchmarks, and not touch the CollationFactory class. Modifying the function in the way you proposed here, resulted in a very small perf benefit.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-50216][SQL] Address UTF8_BINARY performance regression [SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases Nov 13, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating, @stefankandic . +1, LGTM.

Decided to make this PR update the benchmarks, and not touch the CollationFactory class. Modifying the function in the way you proposed #48804 (comment), resulted in a very small perf benefit.

@dongjoon-hyun
Copy link
Member

Although I updated the PR title based on the AS-IS status, the PR description is still outdated. Could you make the PR description up-to-date with the AS-IS status, @stefankandic ?

@stevomitric
Copy link
Contributor Author

Could you make the PR description up-to-date with the AS-IS status?

Updated the description and added non-ascii collation benchmarks, @dongjoon-hyun .

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failed GA Run / Protobuf breaking change detection and Python CodeGen check is not related to the changes, I believe.

+1, LGTM. Merging to master.
Thank you, @stevomitric and @dongjoon-hyun @HyukjinKwon @stefankandic for review.

@MaxGekk MaxGekk closed this in c1968a1 Nov 14, 2024
@dongjoon-hyun
Copy link
Member

Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants