[SPARK-50216][SQL][TESTS] Update `CollationBenchmark` to invoke `collationNameToId` outside of cases #48804

stevomitric · 2024-11-08T14:23:51Z

What changes were proposed in this pull request?

In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation.

Why are the changes needed?

The PR #48222 caused the regression because it changed the collationNameToId function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression.

This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testing surface, benchmarks.

Was this patch authored or co-authored using generative AI tooling?

No

MaxGekk

@stevomitric Could you regenerate results of the benchmark CollationBenchmark, please. We should expect better numbers after your changes, right?

stevomitric · 2024-11-08T15:30:26Z

@stevomitric Could you regenerate results of the benchmark CollationBenchmark, please. We should expect better numbers after your changes, right?

They are running, I will post them here once complete. We should expect same or better results.

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

dongjoon-hyun · 2024-11-11T16:39:33Z

Gentle ping, @stevomitric . Is the regenerated result ready?

Also, cc @panbingkun , FYI.

dongjoon-hyun · 2024-11-11T16:41:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/CollationBenchmark.scala

@@ -113,13 +113,13 @@ abstract class CollationBenchmarkBase extends BenchmarkBase {
      warmupTime = 10.seconds,
      output = output)
    collationTypes.foreach { collationType => {
-      val collation = CollationFactory.fetchCollation(collationType)
+      val collationId = CollationFactory.collationNameToId(collationType)


Can we avoid touching the benchmark in the PR, @stevomitric ?

If we need this, we can proceed separately this CollationBenchmark change before your regression fix.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.

So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

dongjoon-hyun

No, it's opposite. If that's true, we definitely should proceed it independently in order to measure how much they are affecting, @stevomitric . Please make a benchmark update PR first.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

panbingkun · 2024-11-12T01:51:27Z

No, it's opposite. If that's true, we definitely should proceed it independently in order to measure how much they are affecting, @stevomitric . Please make a benchmark update PR first.

The regression was partially due to benchmarks as well. The CollationFactory.collationNameToId function should only be called fixed amount of times per query, which shouldn't measurably impact the performance of the query.
So i think this is the right PR for it, since it was the leading contributor to the numbers seen in the results.

Yes, I also agree, because in the previous PR (#48222) that may have generated performance regression, we did not see any modifications to CollationBenchmark. Therefore, in order to address this issue more clearly, can we do it separately? Thanks!

stefankandic

Great job finding the issue! Changes look good, but I also think it would make sense to do benchmark changes in a separate PR

MaxGekk · 2024-11-12T16:08:55Z

How about to update bechmark results after the revert 95b259e

stevomitric · 2024-11-13T10:38:44Z

How about to update bechmark results after the revert 95b259e

Decided to make this PR update the benchmarks, and not touch the CollationFactory class. Modifying the function in the way you proposed here, resulted in a very small perf benefit.

Stale

dongjoon-hyun

Thank you for updating, @stefankandic . +1, LGTM.

Decided to make this PR update the benchmarks, and not touch the CollationFactory class. Modifying the function in the way you proposed #48804 (comment), resulted in a very small perf benefit.

dongjoon-hyun · 2024-11-13T19:34:46Z

Although I updated the PR title based on the AS-IS status, the PR description is still outdated. Could you make the PR description up-to-date with the AS-IS status, @stefankandic ?

stevomitric · 2024-11-14T09:49:21Z

Could you make the PR description up-to-date with the AS-IS status?

Updated the description and added non-ascii collation benchmarks, @dongjoon-hyun .

MaxGekk

The failed GA Run / Protobuf breaking change detection and Python CodeGen check is not related to the changes, I believe.

+1, LGTM. Merging to master.
Thank you, @stevomitric and @dongjoon-hyun @HyukjinKwon @stefankandic for review.

dongjoon-hyun · 2024-11-14T17:32:59Z

Thank you all!

stevomitric added 4 commits November 8, 2024 13:35

early-out check for utf8_binary

f582fe9

fix benchmark

9cbad8a

revert changes

44d56fb

add early-out check

47693a9

github-actions bot added the SQL label Nov 8, 2024

MaxGekk reviewed Nov 8, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java Outdated Show resolved Hide resolved

dongjoon-hyun requested changes Nov 11, 2024

View reviewed changes

add benchmarks

2951355

dongjoon-hyun previously requested changes Nov 11, 2024

View reviewed changes

stefankandic approved these changes Nov 12, 2024

View reviewed changes

revert collation benchmark

95b259e

MaxGekk mentioned this pull request Nov 13, 2024

[SPARK-49490][SQL] Add benchmarks for initCap #48501

Closed

stevomitric added 4 commits November 13, 2024 11:08

revert collation-factory changes, modify the benchmarks

882f980

Merge branch 'master' into stevomitric/fix-utf8_binary-regression

ef5b065

revert benchmarks

d67fd93

revert changes

daf0b8d

add benchmark

2139918

stevomitric requested review from MaxGekk and dongjoon-hyun November 13, 2024 18:27

dongjoon-hyun changed the title ~~[SPARK-50216][SQL] Address UTF8_BINARY performance regression~~ [SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases Nov 13, 2024

dongjoon-hyun approved these changes Nov 13, 2024

View reviewed changes

HyukjinKwon approved these changes Nov 13, 2024

View reviewed changes

add non-ascii benchmarks

5680954

MaxGekk approved these changes Nov 14, 2024

View reviewed changes

MaxGekk reviewed Nov 14, 2024

View reviewed changes

MaxGekk closed this in c1968a1 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50216][SQL][TESTS] Update `CollationBenchmark` to invoke `collationNameToId` outside of cases #48804

[SPARK-50216][SQL][TESTS] Update `CollationBenchmark` to invoke `collationNameToId` outside of cases #48804

stevomitric commented Nov 8, 2024 •

edited

Loading

MaxGekk left a comment

stevomitric commented Nov 8, 2024

dongjoon-hyun commented Nov 11, 2024

dongjoon-hyun Nov 11, 2024

stevomitric Nov 11, 2024

dongjoon-hyun left a comment •

edited

Loading

panbingkun commented Nov 12, 2024

stefankandic left a comment

MaxGekk commented Nov 12, 2024

stevomitric commented Nov 13, 2024 •

edited

Loading

dongjoon-hyun left a comment

dongjoon-hyun commented Nov 13, 2024

stevomitric commented Nov 14, 2024

MaxGekk left a comment

dongjoon-hyun commented Nov 14, 2024

[SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases #48804

[SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases #48804

Conversation

stevomitric commented Nov 8, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk left a comment

Choose a reason for hiding this comment

stevomitric commented Nov 8, 2024

dongjoon-hyun commented Nov 11, 2024

dongjoon-hyun Nov 11, 2024

Choose a reason for hiding this comment

stevomitric Nov 11, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

panbingkun commented Nov 12, 2024

stefankandic left a comment

Choose a reason for hiding this comment

MaxGekk commented Nov 12, 2024

stevomitric commented Nov 13, 2024 • edited Loading

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 13, 2024

stevomitric commented Nov 14, 2024

MaxGekk left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 14, 2024

[SPARK-50216][SQL][TESTS] Update `CollationBenchmark` to invoke `collationNameToId` outside of cases #48804

[SPARK-50216][SQL][TESTS] Update `CollationBenchmark` to invoke `collationNameToId` outside of cases #48804

stevomitric commented Nov 8, 2024 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

stevomitric commented Nov 13, 2024 •

edited

Loading