Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 #48721

Closed
wants to merge 8 commits into from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Oct 31, 2024

What changes were proposed in this pull request?

The pr aims to upgrade ICU4J from 75.1 to 76.1.

Why are the changes needed?

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@panbingkun
Copy link
Contributor Author

panbingkun commented Oct 31, 2024

Let me update the benchmark result of CollationBenchmark and CollationNonASCIIBenchmark

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefankandic @uros-db Could you take a look at the PR, please.

test("invalid collationId") {
ignore("invalid collationId") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we making this change in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the above question.

test("CollationKey generates correct collation key for collated string") {
ignore("CollationKey generates correct collation key for collated string") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we making this change in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not the final 'answer' to this PR, I am still investigating. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect to see any changes here. Do we know why these hashes have changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this change is that the CollationKey returned by Collator.getCollationKey(...) are different. As for why they are different, I am investigating it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollationKeys#writeSortKeyUpToQuaternary

Copy link
Contributor Author

@panbingkun panbingkun Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's use the following code to reproduce it (collation: UNICODE)
object CollationKeySuite {

  def main(args: Array[String]): Unit = {
    val builder = new ULocale.Builder
    builder.setLocale(ULocale.ROOT)
    builder.setUnicodeLocaleKeyword("ks", "level3")
    val resultLocale = builder.build
    val collator = Collator.getInstance(resultLocale)
    collator.freeze
    val s = UTF8String.fromString("aa")
    val hash = collator.getCollationKey(s.toValidString).hashCode()
    println(hash)
  }
}
  • ICU4j 76.1, result:
10628395
  • ICU4j 75.1, result:
10381418

Murmur3HashTestCase("SQL ", "UNICODE_RTRIM", -1923567940),
Murmur3HashTestCase("SQL", "UNICODE_CI", 1029527950),
Murmur3HashTestCase("SQL ", "UNICODE_CI_RTRIM", 1029527950)
Murmur3HashTestCase("SQL", "UNICODE", 1483684981),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also related to CollationKey.

@panbingkun panbingkun marked this pull request as ready for review November 1, 2024 11:36
@panbingkun panbingkun requested a review from uros-db November 1, 2024 11:36
@panbingkun
Copy link
Contributor Author

@uros-db @dongjoon-hyun @stefankandic @MaxGekk
The detailed explanation has been updated, this PR is ready for review.
Thank you very much for the review, if you has free time. ❤️

UTF8_LCASE 22007 22009 3 0.0 220067.8 3.2X
UNICODE 376402 377858 2060 0.0 3764015.4 54.5X
UNICODE_CI 444485 444809 458 0.0 4444850.8 64.4X
UTF8_BINARY 12000 12018 26 0.0 120000.9 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relatively, UTF8_BINARY becomes slower than before. Do you happen to know any reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me run the benchmark a few more times.

@@ -1979,62 +1979,63 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {

// verify that the output ordering is as expected (UTF8_BINARY, UTF8_LCASE, etc.)
val df = sql("SELECT * FROM collations() limit 10")
val icvVersion = "76.1.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this variable.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks technically correct to me. Thank you, @panbingkun .

Let's wait for more community reviews to make it sure.

@panbingkun
Copy link
Contributor Author

panbingkun commented Nov 4, 2024

UTF8_LCASE 6052 6052 0 0.0 151298.7 6.0X
UNICODE 74506 74551 64 0.0 1862644.2 74.3X
UNICODE_CI 83607 83756 211 0.0 2090164.5 83.4X
UTF8_BINARY 1778 1779 2 0.0 44450.1 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression ±70% seems significant. Could you figure out this is because of upgrade of ICU4J or just the bechmark is outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have already run the benchmark on the master. Let me take a look. Wait a moment. Thanks!

Copy link
Contributor Author

@panbingkun panbingkun Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the testing, it seems that the benchmark is outdated, the data is as follows:
image
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@panbingkun Thank you for checking this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to identify which commit causes the regression. @panbingkun Could you open an JIRA for investigation, please. Also if you have time, please, do the investigation. Seems the regression was introduced recently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, that new JIRA issue will not be a blocker for this dependency change PR, right, @MaxGekk ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation.

@panbingkun
Copy link
Contributor Author

image

@dongjoon-hyun
Copy link
Member

To @MaxGekk and all.

  • SPARK-50216 (Investigate UTF8_BINARY regression) is filed as a blocker issue for Apache Spark 4.0.0.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-50189][SQL] Upgrade ICU4J to 76.1 [SPARK-50189][BUILD][SQL] Upgrade ICU4J to 76.1 Nov 4, 2024
@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0 on February 2025.

Thank you, @panbingkun , @MaxGekk , @uros-db .

@panbingkun
Copy link
Contributor Author

To @MaxGekk and all.

  • SPARK-50216 (Investigate UTF8_BINARY regression) is filed as a blocker issue for Apache Spark 4.0.0.

Thanks!

MaxGekk pushed a commit that referenced this pull request Nov 14, 2024
…ationNameToId` outside of cases

### What changes were proposed in this pull request?
In this PR, UTF8_BINARY performance regression is addressed, that was first identified here #48721. The regression is traced back to this PR #48222 when it first occurred, however this isn't the actual source of performance degradation.

### Why are the changes needed?
The PR #48222 caused the regression because it changed the `collationNameToId` function and made it slightly slower by removing a short-circuit for fetching the UTF8_BINARY collation. However this function should be called fixed amount of times for each query and from the benchmark framework at most once - this was not the case and it was the largest contributor to performance regression.

This PR addresses the benchmarking framework to not call this function at each expression, but once per the test case.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing testing surface, benchmarks.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #48804 from stevomitric/stevomitric/fix-utf8_binary-regression.

Authored-by: Stevo Mitric <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants