Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ" #34599

Closed
rohanjain101 opened this issue Mar 17, 2023 · 2 comments

Comments

@rohanjain101
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

In [1]: import pyarrow as pa

In [2]: pa.compute.utf8_upper("ß")
Out[2]: <pyarrow.StringScalar: 'ẞ'>

In [3]: pa.__version__
Out[3]: '11.0.0'

Python str.upper:

>>>char = "ß"
>>>char.upper()
'SS'

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title compute.utf8_upper produces different result than str.upper for "ẞ" [C++] "utf8_upper" kernel produces different result than Python's str.upper for "ẞ" Mar 17, 2023
@jorisvandenbossche
Copy link
Member

Arrow uses the utf8proc C library for UTF8 operations (https://juliastrings.github.io/utf8proc/).

And this library changed the upper case for "ß" from "SS" to "ẞ" a few years ago: JuliaStrings/utf8proc#130

It seems that there is some discussion about what the correct upper case should be. For example, see also https://bugs.openjdk.org/browse/JDK-8186073 . The unicode standard (http://unicode.org/charts/PDF/U1E00.pdf) mentions:

The capital letter sharp s is part of the official German
orthography since 2017. Along with "SS" it is an allowed
variant spelling of 00DF in "all caps" style

https://www.fileformat.info/info/unicode/char/00df/index.htm mentions "uppercase is "SS" (standard case mapping), alternatively U+1E9E"

@jorisvandenbossche
Copy link
Member

So in the end, this is not something we can change in Arrow itself. If you want this to change, you will need to bring it up at https://github.com/JuliaStrings/utf8proc/ (but given they changed this a few years back, it might not be likely they would change it again)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants