Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Libresse · 2023-08-31T11:01:48Z

Hi there,

I'm addressing an issue we’ve stumbled upon while using Google's Go Starlark, related to how string processing is handled. It happens due to the difference in treating strings between Starlark and Python.

In Python, all strings are Unicode, and operations like slicing or indexing take Unicode code points into account, rather than byte indices. On the flip side, Starlark treats all strings as ASCII, which can cause unexpected results when handling non-ASCII characters, especially those from non-Latin alphabets.

For instance, consider the Chinese string "你老公技术不错". In Python, a slice operation like s[:3] would return the first three characters '你老公'. However, in Starlark, this operation would yield "你" instead. More worryingly, for cases like s[:2] in Starlark, the string slicing completely breaks and returns an unexpected byte combination "\xe4\xbd".

These scenarios point towards a significant limitation when it comes to handling Unicode in Starlark, which could affect a wide range of applications and users worldwide, thereby dampening the reach and potential of the language.

To address these issues, it would be worthwhile to consider introducing an official runes library (or similar feature) for Unicode substrings, string lengths, and broader string manipulation capabilities that accommodate non-ASCII character sets properly.

Such an enhancement would greatly improve Starlark’s accessibility and usability for global, multilingual users, and serve to reduce unexpected errors and inconsistencies in string processing in various languages.

Your attention to this matter and your help on improving Starlark would be highly appreciated.

BR,

The text was updated successfully, but these errors were encountered:

Libresse · 2023-08-31T11:02:27Z

#482

adonovan · 2023-08-31T14:01:08Z

Starlark treats all strings as ASCII

Not quite: Starlark strings are sequences of UTF-k codes (where k=8 in the Go implementation). In this respect Starlark-go behaves like Go, and C and C++ (and Rust, except that Rust disallows splitting an single rune's encoding as in your s[:2] example).

What specific operations do you need? I would expect it is possible to implement many of them in pure Starlark using the various iterator methods on string (e.g. codepoints and codepoint_ords).

Libresse · 2023-09-27T10:30:34Z

Starlark treats all strings as ASCII

Not quite: Starlark strings are sequences of UTF-k codes (where k=8 in the Go implementation). In this respect Starlark-go behaves like Go, and C and C++ (and Rust, except that Rust disallows splitting an single rune's encoding as in your s[:2] example).

What specific operations do you need? I would expect it is possible to implement many of them in pure Starlark using the various iterator methods on string (e.g. codepoints and codepoint_ords).

Thank you for your clarification regarding Starlark strings. We apologize for the misconception in our initial statement.

In terms of specific operations needed, our platform's editors utilize Starlark scripts to edit content in batch. String manipulation operations such as substring extraction, replacement, and indexing are crucial for their tasks. However, dealing with byte indexes presents challenges and is not straightforward in Starlark.

While we appreciate the availability of iterator methods like codepoints and codepoint_ords, we believe that introducing an official runes library or a similar feature for Unicode substrings, string lengths, and broader string manipulation would greatly benefit global, multilingual users. This enhancement would address the limitations in handling non-ASCII characters and improve the accessibility and usability of Starlark.

adonovan · 2023-09-27T14:04:49Z

I'm not averse to the idea of a package for operations on UTF-k strings, but what operations do you need that cannot be expressed (or expressed efficiently) today in terms of codepoints?

Libresse · 2023-10-05T17:14:37Z

I'm not averse to the idea of a package for operations on UTF-k strings, but what operations do you need that cannot be expressed (or expressed efficiently) today in terms of codepoints?

Ops like s[:2] to get first 2 chars. For 3-bytes Unicode, s[:6] should return first 2 chars. But if the content just normal ASCII, it should use s[:2] or will get more than 2 chars.

adonovan · 2023-10-05T18:44:36Z

Ops like s[:2] to get first 2 chars. For 3-bytes Unicode, s[:6] should return first 2 chars. But if the content just normal ASCII, it should use s[:2] or will get more than 2 chars.

Currently you have to express that as:

def codepoints(s):
    return list(s.codepoints())

codepoints("<世界>")[1:3] # ["世", "界"]

but we could specify that the value returned by s.codepoints() is indexable, so that s.codepoints()[1:3] would do what you want. Is there anything else?

(BTW, I suggest you use the term "code point" not "char", since code point is defined by Unicode, and "char" seems to mean whatever the speaker wants it to mean.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Libresse commented Aug 31, 2023

Libresse commented Aug 31, 2023

adonovan commented Aug 31, 2023

Libresse commented Sep 27, 2023

adonovan commented Sep 27, 2023

Libresse commented Oct 5, 2023

adonovan commented Oct 5, 2023

Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Enhancement Request: Official Runes Library for Unicode Substring, Length, and String Manipulation #502

Comments

Libresse commented Aug 31, 2023

Libresse commented Aug 31, 2023

adonovan commented Aug 31, 2023

Libresse commented Sep 27, 2023

adonovan commented Sep 27, 2023

Libresse commented Oct 5, 2023

adonovan commented Oct 5, 2023