-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1
#48721
Conversation
Let me update the benchmark result of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic @uros-db Could you take a look at the PR, please.
test("invalid collationId") { | ||
ignore("invalid collationId") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the above question.
test("CollationKey generates correct collation key for collated string") { | ||
ignore("CollationKey generates correct collation key for collated string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we making this change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not the final 'answer' to this PR, I am still investigating. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect to see any changes here. Do we know why these hashes have changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this change is that the CollationKey
returned by Collator.getCollationKey(...)
are different. As for why they are different, I am investigating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CollationKeys#writeSortKeyUpToQuaternary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's use the following code to reproduce it (collation:
UNICODE
)
object CollationKeySuite {
def main(args: Array[String]): Unit = {
val builder = new ULocale.Builder
builder.setLocale(ULocale.ROOT)
builder.setUnicodeLocaleKeyword("ks", "level3")
val resultLocale = builder.build
val collator = Collator.getInstance(resultLocale)
collator.freeze
val s = UTF8String.fromString("aa")
val hash = collator.getCollationKey(s.toValidString).hashCode()
println(hash)
}
}
- ICU4j 76.1, result:
10628395
- ICU4j 75.1, result:
10381418
-
Through debugging, it was found that different versions of
icu4j
load different versions of underlying data resource files, such asnfc.nrm
A.ICU4j 76.1
->15.1.0.0
B.ICU4j 75.1
->16.0.0.0
-
I
guess
it should be related to the PR below (Unicode 15.1
->Unicode 16
)
ICU-22707 Unicode 16 alpha unicode-org/icu#2930
ICU-22707 Unicode 16 beta jun04 unicode-org/icu#3028
ICU-22707 Unicode 16 aug16 unicode-org/icu#3110
ICU-22707 adjust UTS46 for Unicode 16 unicode-org/icu#3130
ICU-22769 Rename of the ICU4J data folder to not contain a version unicode-org/icu#3000 -
ref docs
https://unicode-org.github.io/icu/download/76.html#release-overview
https://unicode-org.github.io/icu/download/76.html#common-changes
...lyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollationExpressionSuite.scala
Show resolved
Hide resolved
Murmur3HashTestCase("SQL ", "UNICODE_RTRIM", -1923567940), | ||
Murmur3HashTestCase("SQL", "UNICODE_CI", 1029527950), | ||
Murmur3HashTestCase("SQL ", "UNICODE_CI_RTRIM", 1029527950) | ||
Murmur3HashTestCase("SQL", "UNICODE", 1483684981), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also related to CollationKey
.
@uros-db @dongjoon-hyun @stefankandic @MaxGekk |
UTF8_LCASE 22007 22009 3 0.0 220067.8 3.2X | ||
UNICODE 376402 377858 2060 0.0 3764015.4 54.5X | ||
UNICODE_CI 444485 444809 458 0.0 4444850.8 64.4X | ||
UTF8_BINARY 12000 12018 26 0.0 120000.9 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively, UTF8_BINARY
becomes slower than before. Do you happen to know any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me run the benchmark a few more times.
@@ -1979,62 +1979,63 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper { | |||
|
|||
// verify that the output ordering is as expected (UTF8_BINARY, UTF8_LCASE, etc.) | |||
val df = sql("SELECT * FROM collations() limit 10") | |||
val icvVersion = "76.1.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks technically correct to me. Thank you, @panbingkun .
Let's wait for more community reviews to make it sure.
|
UTF8_LCASE 6052 6052 0 0.0 151298.7 6.0X | ||
UNICODE 74506 74551 64 0.0 1862644.2 74.3X | ||
UNICODE_CI 83607 83756 211 0.0 2090164.5 83.4X | ||
UTF8_BINARY 1778 1779 2 0.0 44450.1 1.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regression ±70% seems significant. Could you figure out this is because of upgrade of ICU4J or just the bechmark is outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have already run the benchmark on the master
. Let me take a look. Wait a moment. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@panbingkun Thank you for checking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to identify which commit causes the regression. @panbingkun Could you open an JIRA for investigation, please. Also if you have time, please, do the investigation. Seems the regression was introduced recently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, that new JIRA issue will not be a blocker for this dependency change PR, right, @MaxGekk ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the confirmation.
To @MaxGekk and all.
|
76.1
76.1
Merged to master for Apache Spark 4.0.0 on February 2025. Thank you, @panbingkun , @MaxGekk , @uros-db . |
Thanks! |
…ationNameToId` outside of cases ### What changes were proposed in this pull request? In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation. ### Why are the changes needed? The PR #48222 caused the regression because it changed the `collationNameToId` function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression. This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing testing surface, benchmarks. ### Was this patch authored or co-authored using generative AI tooling? No Closes #48804 from stevomitric/stevomitric/fix-utf8_binary-regression. Authored-by: Stevo Mitric <[email protected]> Signed-off-by: Max Gekk <[email protected]>
What changes were proposed in this pull request?
The pr aims to upgrade
ICU4J
from75.1
to76.1
.Why are the changes needed?
The full release notes:
https://github.com/unicode-org/icu/releases/tag/release-76-1
https://unicode-org.github.io/icu/download/76.html
We need to keep the version up-to-date.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Pass GA.
Was this patch authored or co-authored using generative AI tooling?
No.