Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 #48721

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ import org.apache.spark.unsafe.types.UTF8String.{fromString => toUTF8}

class CollationFactorySuite extends AnyFunSuite with Matchers { // scalastyle:ignore funsuite

val currentIcuVersion: String = "75.1"
val currentIcuVersion: String = "76.1"

test("collationId stability") {
assert(INDETERMINATE_COLLATION_ID == -1)
Expand Down
2 changes: 1 addition & 1 deletion dev/deps/spark-deps-hadoop-3-hive-2.3
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ hk2-locator/3.0.6//hk2-locator-3.0.6.jar
hk2-utils/3.0.6//hk2-utils-3.0.6.jar
httpclient/4.5.14//httpclient-4.5.14.jar
httpcore/4.4.16//httpcore-4.4.16.jar
icu4j/75.1//icu4j-75.1.jar
icu4j/76.1//icu4j-76.1.jar
ini4j/0.5.4//ini4j-0.5.4.jar
istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
ivy/2.5.2//ivy-2.5.2.jar
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@
<datasketches.version>6.1.1</datasketches.version>
<netty.version>4.1.110.Final</netty.version>
<netty-tcnative.version>2.0.66.Final</netty-tcnative.version>
<icu4j.version>75.1</icu4j.version>
<icu4j.version>76.1</icu4j.version>
<junit-jupiter.version>5.11.0</junit-jupiter.version>
<junit-platform.version>1.11.0</junit-platform.version>
<sbt-jupiter-interface.version>0.13.0</sbt-jupiter-interface.version>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ class CollationExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
}

test("CollationKey generates correct collation key for collated string") {
val b: Byte = 0x2B
dongjoon-hyun marked this conversation as resolved.
Show resolved Hide resolved
val testCases = Seq(
("", "UTF8_BINARY", UTF8String.fromString("").getBytes),
("aa", "UTF8_BINARY", UTF8String.fromString("aa").getBytes),
Expand All @@ -180,15 +181,15 @@ class CollationExpressionSuite extends SparkFunSuite with ExpressionEvalHelper {
(" AA ", "UTF8_LCASE_RTRIM", UTF8String.fromString(" aa").getBytes),
("aA", "UTF8_LCASE", UTF8String.fromString("aa").getBytes),
("", "UNICODE", Array[Byte](1, 1, 0)),
("aa", "UNICODE", Array[Byte](42, 42, 1, 6, 1, 6, 0)),
("AA", "UNICODE", Array[Byte](42, 42, 1, 6, 1, -36, -36, 0)),
("aA", "UNICODE", Array[Byte](42, 42, 1, 6, 1, -59, -36, 0)),
("aa ", "UNICODE_RTRIM", Array[Byte](42, 42, 1, 6, 1, 6, 0)),
("aa", "UNICODE", Array[Byte](b, b, 1, 6, 1, 6, 0)),
("AA", "UNICODE", Array[Byte](b, b, 1, 6, 1, -36, -36, 0)),
("aA", "UNICODE", Array[Byte](b, b, 1, 6, 1, -59, -36, 0)),
("aa ", "UNICODE_RTRIM", Array[Byte](b, b, 1, 6, 1, 6, 0)),
("", "UNICODE_CI", Array[Byte](1, 0)),
("aa", "UNICODE_CI", Array[Byte](42, 42, 1, 6, 0)),
("aa ", "UNICODE_CI_RTRIM", Array[Byte](42, 42, 1, 6, 0)),
("AA", "UNICODE_CI", Array[Byte](42, 42, 1, 6, 0)),
("aA", "UNICODE_CI", Array[Byte](42, 42, 1, 6, 0))
("aa", "UNICODE_CI", Array[Byte](b, b, 1, 6, 0)),
("aa ", "UNICODE_CI_RTRIM", Array[Byte](b, b, 1, 6, 0)),
("AA", "UNICODE_CI", Array[Byte](b, b, 1, 6, 0)),
("aA", "UNICODE_CI", Array[Byte](b, b, 1, 6, 0))
)
for ((input, collation, expected) <- testCases) {
val str = Literal.create(input, StringType(collation))
Expand Down
60 changes: 30 additions & 30 deletions sql/core/benchmarks/CollationBenchmark-jdk21-results.txt
Original file line number Diff line number Diff line change
@@ -1,54 +1,54 @@
OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
--------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 1353 1353 1 0.1 13526.6 1.0X
UTF8_LCASE 2703 2705 3 0.0 27032.4 2.0X
UNICODE 16848 16894 65 0.0 168482.9 12.5X
UNICODE_CI 16362 16367 8 0.0 163615.6 12.1X
UTF8_BINARY 1351 1352 2 0.1 13509.2 1.0X
UTF8_LCASE 2481 2485 6 0.0 24807.0 1.8X
UNICODE 16534 16542 11 0.0 165342.4 12.2X
UNICODE_CI 16540 16567 39 0.0 165395.4 12.2X

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
---------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 2640 2642 3 0.0 26401.5 1.0X
UTF8_LCASE 3616 3618 2 0.0 36164.8 1.4X
UNICODE 17465 17470 7 0.0 174650.9 6.6X
UNICODE_CI 17251 17264 18 0.0 172510.9 6.5X
UTF8_BINARY 1760 1764 6 0.1 17601.2 1.0X
UTF8_LCASE 2638 2640 3 0.0 26379.1 1.5X
UNICODE 16741 16747 7 0.0 167414.4 9.5X
UNICODE_CI 16516 16517 0 0.0 165164.6 9.4X

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 2843 2844 1 0.0 28427.2 1.0X
UTF8_LCASE 5417 5437 28 0.0 54170.7 1.9X
UNICODE 68601 68619 26 0.0 686010.8 24.1X
UNICODE_CI 56342 56361 26 0.0 563422.2 19.8X
UTF8_BINARY 2816 2817 1 0.0 28163.2 1.0X
UTF8_LCASE 6427 6428 2 0.0 64270.2 2.3X
UNICODE 70197 70203 9 0.0 701969.6 24.9X
UNICODE_CI 57917 58002 119 0.0 579174.8 20.6X

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 7674 7674 1 0.0 76735.3 1.0X
UTF8_LCASE 20367 20376 14 0.0 203665.1 2.7X
UNICODE 377133 377909 1098 0.0 3771328.8 49.1X
UNICODE_CI 434710 435099 551 0.0 4347097.2 56.7X
UTF8_BINARY 10408 10413 7 0.0 104077.4 1.0X
UTF8_LCASE 24861 24866 7 0.0 248605.2 2.4X
UNICODE 367328 367350 31 0.0 3673280.8 35.3X
UNICODE_CI 427079 427249 240 0.0 4270791.5 41.0X

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - startsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 6956 6959 4 0.0 69561.7 1.0X
UTF8_LCASE 14246 14262 23 0.0 142459.6 2.0X
UNICODE 369940 370072 186 0.0 3699400.9 53.2X
UNICODE_CI 442072 442365 414 0.0 4420718.1 63.6X
UTF8_BINARY 11019 11024 7 0.0 110190.6 1.0X
UTF8_LCASE 19670 19672 3 0.0 196696.8 1.8X
UNICODE 373181 376402 4555 0.0 3731809.7 33.9X
UNICODE_CI 430947 431520 811 0.0 4309470.3 39.1X

OpenJDK 64-Bit Server VM 21.0.4+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 21.0.5+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - endsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 6927 6927 0 0.0 69265.2 1.0X
UTF8_LCASE 15505 15514 12 0.0 155054.5 2.2X
UNICODE 382361 382426 93 0.0 3823606.6 55.2X
UNICODE_CI 449956 450063 151 0.0 4499562.9 65.0X
UTF8_BINARY 11105 11106 1 0.0 111048.4 1.0X
UTF8_LCASE 19400 19410 15 0.0 193995.4 1.7X
UNICODE 384919 384997 111 0.0 3849187.1 34.7X
UNICODE_CI 440881 441466 828 0.0 4408806.2 39.7X

60 changes: 30 additions & 30 deletions sql/core/benchmarks/CollationBenchmark-results.txt
Original file line number Diff line number Diff line change
@@ -1,54 +1,54 @@
OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - equalsFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
--------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 1372 1372 1 0.1 13718.5 1.0X
UTF8_LCASE 3115 3116 1 0.0 31154.4 2.3X
UNICODE 19813 19820 9 0.0 198132.2 14.4X
UNICODE_CI 19669 19686 24 0.0 196694.2 14.3X
UTF8_BINARY 1374 1375 1 0.1 13741.7 1.0X
UTF8_LCASE 3163 3166 5 0.0 31630.7 2.3X
UNICODE 19230 19242 16 0.0 192304.5 14.0X
UNICODE_CI 18920 18930 15 0.0 189197.8 13.8X

OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - compareFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
---------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 1727 1728 1 0.1 17271.3 1.0X
UTF8_LCASE 3034 3035 1 0.0 30337.2 1.8X
UNICODE 19230 19243 18 0.0 192301.2 11.1X
UNICODE_CI 19080 19082 3 0.0 190802.0 11.0X
UTF8_BINARY 2674 2675 1 0.0 26744.0 1.0X
UTF8_LCASE 4595 4598 5 0.0 45951.3 1.7X
UNICODE 20327 20334 11 0.0 203265.8 7.6X
UNICODE_CI 20219 20229 14 0.0 202188.0 7.6X

OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - hashFunction: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 3080 3080 0 0.0 30796.4 1.0X
UTF8_LCASE 6436 6454 25 0.0 64360.0 2.1X
UNICODE 68095 68167 101 0.0 680951.3 22.1X
UNICODE_CI 62122 62123 2 0.0 621215.8 20.2X
UTF8_BINARY 3106 3109 4 0.0 31059.9 1.0X
UTF8_LCASE 6451 6464 19 0.0 64505.1 2.1X
UNICODE 67033 67078 63 0.0 670329.8 21.6X
UNICODE_CI 52301 52314 18 0.0 523012.5 16.8X

OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - contains: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 8260 8261 1 0.0 82604.0 1.0X
UTF8_LCASE 23629 23629 0 0.0 236286.4 2.9X
UNICODE 364843 366078 1747 0.0 3648427.9 44.2X
UNICODE_CI 425728 426449 1020 0.0 4257275.1 51.5X
UTF8_BINARY 12370 12383 19 0.0 123697.8 1.0X
UTF8_LCASE 28164 28166 3 0.0 281639.6 2.3X
UNICODE 375367 375482 162 0.0 3753668.4 30.3X
UNICODE_CI 418811 419177 517 0.0 4188111.1 33.9X

OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - startsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 6844 6848 5 0.0 68440.4 1.0X
UTF8_LCASE 21849 21870 30 0.0 218486.3 3.2X
UNICODE 363474 363811 476 0.0 3634738.4 53.1X
UNICODE_CI 427563 428029 659 0.0 4275629.8 62.5X
UTF8_BINARY 11261 11271 15 0.0 112607.7 1.0X
UTF8_LCASE 26357 26366 12 0.0 263570.5 2.3X
UNICODE 363306 364001 983 0.0 3633059.7 32.3X
UNICODE_CI 420866 421292 603 0.0 4208656.7 37.4X

OpenJDK 64-Bit Server VM 17.0.12+7-LTS on Linux 6.5.0-1025-azure
OpenJDK 64-Bit Server VM 17.0.13+11-LTS on Linux 6.5.0-1025-azure
AMD EPYC 7763 64-Core Processor
collation unit benchmarks - endsWith: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative time
------------------------------------------------------------------------------------------------------------------------
UTF8_BINARY 6904 6907 4 0.0 69039.3 1.0X
UTF8_LCASE 22007 22009 3 0.0 220067.8 3.2X
UNICODE 376402 377858 2060 0.0 3764015.4 54.5X
UNICODE_CI 444485 444809 458 0.0 4444850.8 64.4X
UTF8_BINARY 12000 12018 26 0.0 120000.9 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatively, UTF8_BINARY becomes slower than before. Do you happen to know any reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me run the benchmark a few more times.

UTF8_LCASE 26257 26274 24 0.0 262571.1 2.2X
UNICODE 383239 384294 1492 0.0 3832385.4 31.9X
UNICODE_CI 437565 438521 1352 0.0 4375654.8 36.5X

Loading