-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502
Comments
Not quite: Starlark strings are sequences of UTF-k codes (where k=8 in the Go implementation). In this respect Starlark-go behaves like Go, and C and C++ (and Rust, except that Rust disallows splitting an single rune's encoding as in your s[:2] example). What specific operations do you need? I would expect it is possible to implement many of them in pure Starlark using the various iterator methods on string (e.g. codepoints and codepoint_ords). |
Thank you for your clarification regarding Starlark strings. We apologize for the misconception in our initial statement. In terms of specific operations needed, our platform's editors utilize Starlark scripts to edit content in batch. String manipulation operations such as substring extraction, replacement, and indexing are crucial for their tasks. However, dealing with byte indexes presents challenges and is not straightforward in Starlark. While we appreciate the availability of iterator methods like codepoints and codepoint_ords, we believe that introducing an official runes library or a similar feature for Unicode substrings, string lengths, and broader string manipulation would greatly benefit global, multilingual users. This enhancement would address the limitations in handling non-ASCII characters and improve the accessibility and usability of Starlark. |
I'm not averse to the idea of a package for operations on UTF-k strings, but what operations do you need that cannot be expressed (or expressed efficiently) today in terms of codepoints? |
Ops like |
Currently you have to express that as:
but we could specify that the value returned by (BTW, I suggest you use the term "code point" not "char", since code point is defined by Unicode, and "char" seems to mean whatever the speaker wants it to mean.) |
Hi there,
I'm addressing an issue we’ve stumbled upon while using Google's Go Starlark, related to how string processing is handled. It happens due to the difference in treating strings between Starlark and Python.
In Python, all strings are Unicode, and operations like slicing or indexing take Unicode code points into account, rather than byte indices. On the flip side, Starlark treats all strings as ASCII, which can cause unexpected results when handling non-ASCII characters, especially those from non-Latin alphabets.
For instance, consider the Chinese string "你老公技术不错". In Python, a slice operation like
s[:3]
would return the first three characters '你老公'. However, in Starlark, this operation would yield "你" instead. More worryingly, for cases likes[:2]
in Starlark, the string slicing completely breaks and returns an unexpected byte combination "\xe4\xbd".These scenarios point towards a significant limitation when it comes to handling Unicode in Starlark, which could affect a wide range of applications and users worldwide, thereby dampening the reach and potential of the language.
To address these issues, it would be worthwhile to consider introducing an official runes library (or similar feature) for Unicode substrings, string lengths, and broader string manipulation capabilities that accommodate non-ASCII character sets properly.
Such an enhancement would greatly improve Starlark’s accessibility and usability for global, multilingual users, and serve to reduce unexpected errors and inconsistencies in string processing in various languages.
Your attention to this matter and your help on improving Starlark would be highly appreciated.
BR,
The text was updated successfully, but these errors were encountered: