[SPARK-50189][BUILD][SQL] Upgrade ICU4J to `76.1` #48721

panbingkun · 2024-10-31T11:34:33Z

What changes were proposed in this pull request?

The pr aims to upgrade ICU4J from 75.1 to 76.1.

Why are the changes needed?

The full release notes:
https://github.com/unicode-org/icu/releases/tag/release-76-1
https://unicode-org.github.io/icu/download/76.html
We need to keep the version up-to-date.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

panbingkun · 2024-10-31T11:35:24Z

Let me update the benchmark result of CollationBenchmark and CollationNonASCIIBenchmark

CollationBenchmark
JDK 17: https://github.com/panbingkun/spark/actions/runs/11611219685
JDK 21: https://github.com/panbingkun/spark/actions/runs/11611222464
CollationNonASCIIBenchmark
JDK 17: https://github.com/panbingkun/spark/actions/runs/11611241052
JDK 21: https://github.com/panbingkun/spark/actions/runs/11611243469

MaxGekk

@stefankandic @uros-db Could you take a look at the PR, please.

uros-db · 2024-10-31T15:00:35Z

common/unsafe/src/test/scala/org/apache/spark/unsafe/types/CollationFactorySuite.scala

-  test("invalid collationId") {
+  ignore("invalid collationId") {


why are we making this change in this PR?

+1 for the above question.

uros-db · 2024-10-31T15:00:54Z

...lyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala

-  test("CollationKey generates correct collation key for collated string") {
+  ignore("CollationKey generates correct collation key for collated string") {


why are we making this change in this PR?

It is not the final 'answer' to this PR, I am still investigating. 😄

uros-db · 2024-10-31T15:02:31Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala

I wouldn't expect to see any changes here. Do we know why these hashes have changed?

The reason for this change is that the CollationKey returned by Collator.getCollationKey(...) are different. As for why they are different, I am investigating it.

CollationKeys#writeSortKeyUpToQuaternary

Let's use the following code to reproduce it (collation: UNICODE)

object CollationKeySuite { def main(args: Array[String]): Unit = { val builder = new ULocale.Builder builder.setLocale(ULocale.ROOT) builder.setUnicodeLocaleKeyword("ks", "level3") val resultLocale = builder.build val collator = Collator.getInstance(resultLocale) collator.freeze val s = UTF8String.fromString("aa") val hash = collator.getCollationKey(s.toValidString).hashCode() println(hash) } }

ICU4j 76.1, result:

10628395

ICU4j 75.1, result:

10381418

Through debugging, it was found that different versions of icu4j load different versions of underlying data resource files, such as nfc.nrm
A. ICU4j 76.1 -> 15.1.0.0
B. ICU4j 75.1 -> 16.0.0.0

I guess it should be related to the PR below (Unicode 15.1 -> Unicode 16)
ICU-22707 Unicode 16 alpha unicode-org/icu#2930
ICU-22707 Unicode 16 beta jun04 unicode-org/icu#3028
ICU-22707 Unicode 16 aug16 unicode-org/icu#3110
ICU-22707 adjust UTS46 for Unicode 16 unicode-org/icu#3130
ICU-22769 Rename of the ICU4J data folder to not contain a version unicode-org/icu#3000

ref docs
https://unicode-org.github.io/icu/download/76.html#release-overview

https://unicode-org.github.io/icu/download/76.html#common-changes

...lyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala

panbingkun · 2024-11-01T11:32:48Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSQLExpressionsSuite.scala

-      Murmur3HashTestCase("SQL ", "UNICODE_RTRIM", -1923567940),
-      Murmur3HashTestCase("SQL", "UNICODE_CI", 1029527950),
-      Murmur3HashTestCase("SQL ", "UNICODE_CI_RTRIM", 1029527950)
+      Murmur3HashTestCase("SQL", "UNICODE", 1483684981),


This is also related to CollationKey.

panbingkun · 2024-11-01T11:41:43Z

@uros-db @dongjoon-hyun @stefankandic @MaxGekk
The detailed explanation has been updated, this PR is ready for review.
Thank you very much for the review, if you has free time. ❤️

dongjoon-hyun · 2024-11-03T04:58:23Z

sql/core/benchmarks/CollationBenchmark-results.txt

-UTF8_LCASE                                        22007          22009           3          0.0      220067.8       3.2X
-UNICODE                                          376402         377858        2060          0.0     3764015.4      54.5X
-UNICODE_CI                                       444485         444809         458          0.0     4444850.8      64.4X
+UTF8_BINARY                                       12000          12018          26          0.0      120000.9       1.0X


Relatively, UTF8_BINARY becomes slower than before. Do you happen to know any reason?

Let me run the benchmark a few more times.

dongjoon-hyun · 2024-11-03T04:59:40Z

sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala

@@ -1979,62 +1979,63 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {

    // verify that the output ordering is as expected (UTF8_BINARY, UTF8_LCASE, etc.)
    val df = sql("SELECT * FROM collations() limit 10")
+    val icvVersion = "76.1.0.0"


Thank you for adding this variable.

dongjoon-hyun

The change looks technically correct to me. Thank you, @panbingkun .

Let's wait for more community reviews to make it sure.

panbingkun · 2024-11-04T01:46:00Z

branch-50189
- org.apache.spark.sql.execution.benchmark.CollationBenchmark
  JDK 17: https://github.com/panbingkun/spark/actions/runs/11656707625
  JDK 21: https://github.com/panbingkun/spark/actions/runs/11656709727
- org.apache.spark.sql.execution.benchmark.CollationNonASCIIBenchmark
  JDK 17: https://github.com/panbingkun/spark/actions/runs/11656712196
  JDK 21: https://github.com/panbingkun/spark/actions/runs/11656714145
master
- org.apache.spark.sql.execution.benchmark.CollationBenchmark
  JDK 17: https://github.com/panbingkun/spark/actions/runs/11660016606
  JDK 21: https://github.com/panbingkun/spark/actions/runs/11660017905
- org.apache.spark.sql.execution.benchmark.CollationNonASCIIBenchmark
  JDK 17: https://github.com/panbingkun/spark/actions/runs/11660023396
  JDK 21: https://github.com/panbingkun/spark/actions/runs/11660024905

MaxGekk · 2024-11-04T08:57:43Z

sql/core/benchmarks/CollationNonASCIIBenchmark-results.txt

-UTF8_LCASE                                         6052           6052           0          0.0      151298.7       6.0X
-UNICODE                                           74506          74551          64          0.0     1862644.2      74.3X
-UNICODE_CI                                        83607          83756         211          0.0     2090164.5      83.4X
+UTF8_BINARY                                        1778           1779           2          0.0       44450.1       1.0X


The regression ±70% seems significant. Could you figure out this is because of upgrade of ICU4J or just the bechmark is outdated.

Yes, I have already run the benchmark on the master. Let me take a look. Wait a moment. Thanks!

From the testing, it seems that the benchmark is outdated, the data is as follows:

@panbingkun Thank you for checking this.

It would be nice to identify which commit causes the regression. @panbingkun Could you open an JIRA for investigation, please. Also if you have time, please, do the investigation. Seems the regression was introduced recently.

BTW, that new JIRA issue will not be a blocker for this dependency change PR, right, @MaxGekk ?

Thank you for the confirmation.

panbingkun · 2024-11-04T11:34:29Z

dongjoon-hyun · 2024-11-04T17:21:56Z

To @MaxGekk and all.

SPARK-50216 (Investigate UTF8_BINARY regression) is filed as a blocker issue for Apache Spark 4.0.0.

dongjoon-hyun · 2024-11-04T17:24:48Z

Merged to master for Apache Spark 4.0.0 on February 2025.

Thank you, @panbingkun , @MaxGekk , @uros-db .

panbingkun · 2024-11-05T00:27:41Z

To @MaxGekk and all.

SPARK-50216 (Investigate UTF8_BINARY regression) is filed as a blocker issue for Apache Spark 4.0.0.

Thanks!

…ationNameToId` outside of cases ### What changes were proposed in this pull request? In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation. ### Why are the changes needed? The PR #48222 caused the regression because it changed the `collationNameToId` function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression. This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testing surface, benchmarks. ### Was this patch authored or co-authored using generative AI tooling? No Closes #48804 from stevomitric/stevomitric/fix-utf8_binary-regression. Authored-by: Stevo Mitric <[email protected]> Signed-off-by: Max Gekk <[email protected]>

[SPARK-50189][SQL] Upgrade ICU4J to 76.1

07ca0fa

github-actions bot added SQL BUILD labels Oct 31, 2024

MaxGekk reviewed Oct 31, 2024

View reviewed changes

update

5b5ab76

uros-db reviewed Oct 31, 2024

View reviewed changes

panbingkun added 3 commits November 1, 2024 10:00

Merge branch 'master' into SPARK-50189

ef346ed

fix

7c8b90c

Merge branch 'master' into SPARK-50189

96eeb97

panbingkun commented Nov 1, 2024

View reviewed changes

...lyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala Show resolved Hide resolved

panbingkun commented Nov 1, 2024

View reviewed changes

panbingkun marked this pull request as ready for review November 1, 2024 11:36

panbingkun requested a review from uros-db November 1, 2024 11:36

dongjoon-hyun reviewed Nov 3, 2024

View reviewed changes

dongjoon-hyun approved these changes Nov 3, 2024

View reviewed changes

Merge branch 'master' into SPARK-50189

08e578a

panbingkun added 2 commits November 4, 2024 09:53

add comment

38651fb

update benchmark of Collation*

29bb062

MaxGekk reviewed Nov 4, 2024

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-50189][SQL] Upgrade ICU4J to 76.1~~ [SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 Nov 4, 2024

dongjoon-hyun closed this in 3985a76 Nov 4, 2024

stevomitric mentioned this pull request Nov 8, 2024

[SPARK-50216][SQL][TESTS] Update CollationBenchmark to invoke collationNameToId outside of cases #48804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to `76.1` #48721

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to `76.1` #48721

panbingkun commented Oct 31, 2024 •

edited

Loading

panbingkun commented Oct 31, 2024 •

edited

Loading

MaxGekk left a comment

uros-db Oct 31, 2024

dongjoon-hyun Oct 31, 2024

uros-db Oct 31, 2024

panbingkun Nov 1, 2024

uros-db Oct 31, 2024

panbingkun Nov 1, 2024

panbingkun Nov 1, 2024

panbingkun Nov 1, 2024 •

edited

Loading

panbingkun Nov 1, 2024

panbingkun commented Nov 1, 2024

dongjoon-hyun Nov 3, 2024

panbingkun Nov 4, 2024

dongjoon-hyun Nov 3, 2024

dongjoon-hyun left a comment

panbingkun commented Nov 4, 2024 •

edited

Loading

MaxGekk Nov 4, 2024

panbingkun Nov 4, 2024

panbingkun Nov 4, 2024 •

edited

Loading

MaxGekk Nov 4, 2024

MaxGekk Nov 4, 2024

dongjoon-hyun Nov 4, 2024

MaxGekk Nov 4, 2024

dongjoon-hyun Nov 4, 2024

panbingkun commented Nov 4, 2024

dongjoon-hyun commented Nov 4, 2024

dongjoon-hyun commented Nov 4, 2024

panbingkun commented Nov 5, 2024

		test("invalid collationId") {
		ignore("invalid collationId") {

		test("CollationKey generates correct collation key for collated string") {
		ignore("CollationKey generates correct collation key for collated string") {

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 #48721

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 #48721

Conversation

panbingkun commented Oct 31, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

panbingkun commented Oct 31, 2024 • edited Loading

MaxGekk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun commented Nov 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

panbingkun commented Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

panbingkun commented Nov 4, 2024

dongjoon-hyun commented Nov 4, 2024

dongjoon-hyun commented Nov 4, 2024

panbingkun commented Nov 5, 2024

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to `76.1` #48721

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to `76.1` #48721

panbingkun commented Oct 31, 2024 •

edited

Loading

panbingkun commented Oct 31, 2024 •

edited

Loading

panbingkun Nov 1, 2024 •

edited

Loading

panbingkun commented Nov 4, 2024 •

edited

Loading

panbingkun Nov 4, 2024 •

edited

Loading