-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String encodings (UTF-8/UTF-16) #14
Comments
That's a tricky topic, because of the indexing and performance
considerations. For (logically) immutable strings, perhaps the most general
scheme is something like Swift pre-version-5, and have a String object that
provides "views" of the object as either UTF-8, UTF-16, or UTF-32 (code
points). Internally it could use something like
https://swift.org/blog/utf8-string/#breadcrumbs for performance. But this
just scratches the surface of the topic.
Mark
…On Tue, Mar 24, 2020 at 4:42 PM Shane F. Carr ***@***.***> wrote:
We want to make OmnICU able to be ported to other environments where
strings are not necessarily UTF-8. For example, if called via FFI from Dart
or WebAssembly, we may want to process strings that are UTF-16. How should
we deal with strings in OmnICU that makes it easy to target UTF-16
environments while not hurting performance in UTF-8 environments?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#14>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMG6OOSLJ2GYVASCZDLRJFANHANCNFSM4LTB5TQA>
.
|
I was thinking one option here could be that we have a file like, // typedefs.rs
// Rust UTF-8 environment
#[cfg(...)]
pub use std::string::String;
// UTF-16 environment
#[cfg(...)]
pub use omnicu::types::U16String as String; And then, all public OmnICU APIs should import the String symbol from typedefs.rs, and the definition varies by configuration option. As long as omnicu::types::U16String is duck-type-compatible with std::string::String, then we should be OK, and we don't cause any performance regressions in Rust environments. |
IMO it would be better to use a trait than use cfgs |
I expect in Gecko whether we want to perform a given operation on UTF-8, on UTF-16, or, depending on caller, both, will vary over time. In the case of I also expect there will be some operations where the Gecko caller wants UTF-16 for the time being but performing a conversion at the boundary and running the operation on UTF-8 inside OmnICU would be acceptable. When writing our conversions between UTF-16 and UTF-8, it's been a goal of mine to make it so fast that people wouldn't shy away from using UTF-8 for new code just because the new code needs to be called by old code that has UTF-16 at hand. That is to say:
For |
I think I should expand on my previous comment: There aren't just two cases: UTF-16 and UTF-8. There are three cases: potentially-invalid UTF-16, potentially-invalid UTF-8 (invalidity isn't UB), and guaranteed-valid UTF-8 (invalidity is UB). (Guaranteed-valid UTF-16 doesn't exist in practice.) Additionally, there'd be Latin1, which both SpiderMonkey and V8 use internally when possible, but it's unclear if letting that abstraction leak into OmnICU is a good idea. In the caller-to-OmnICU direction, it's always okay to pass guaranteed-valid UTF-8 to an entry points that takes potentially-invalid UTF-8 if one is willing to leave some performance on the table. In the OmnICU-to-caller direction it isn't quite that simple, because the simplest loop that restores UTF-8 validity after having written to a caller-allocated buffer of guaranteed-valid UTF-8 doesn't necessarily terminate if the buffer wasn't valid to begin with. (I.e. write zeros after your meaningful write until finding a UTF-8 lead byte.) Supporting all three for output is results in just two copies of the business logic plus a wrapper for does the zeroing until the next UTF-8 lead for the third case. Supporting all three on the input side without leaving any performance on the table results in three monomorphizations. (encoding_rs doesn't support the case of encoding potentially-invalid UTF-8 into a legacy encoding.) What travels best over the FFI is a pointer to caller-owned buffer and a length. That is, for input the three cases are So I would suggest that the FFI layer didn't try to expose either iterators or growable buffers but worked with pointer-and-length and caller-allocated buffers instead. When the internals use iterators, I'm fine with exposing the iterators to Rust callers. I think it's also nice to provide wrappers that append into |
Reassigning this issue to Henri as the primary owner. Can you write up the recommended approach to ICU4X string encodings in a PR that we can review? |
@anba, My recollection is that you've worked on optimizing SpiderMonkey cases where the ECMA Intl API is called with input that SpiderMonkey represents as Latin1. Are there operation with which it would be worth binary size increase to optimize performance by making the back end library work directly with the Latin1 representation? |
There's also prior work on Gecko's I asked @emilio for early feedback since he's one of the authors of that FFI lib for strings. His response:
|
Created a PR with a proposal. Lacks a sample C++ wrapper still. |
I updated the proposal to use an allocation strategy similar to |
Filed a Rust issue. |
We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?
The text was updated successfully, but these errors were encountered: