String encodings (UTF-8/UTF-16) #14

sffc · 2020-03-24T23:41:58Z

We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments?

macchiati · 2020-03-25T00:06:09Z

That's a tricky topic, because of the indexing and performance considerations. For (logically) immutable strings, perhaps the most general scheme is something like Swift pre-version-5, and have a String object that provides "views" of the object as either UTF-8, UTF-16, or UTF-32 (code points). Internally it could use something like https://swift.org/blog/utf8-string/#breadcrumbs for performance. But this just scratches the surface of the topic. Mark

…

On Tue, Mar 24, 2020 at 4:42 PM Shane F. Carr ***@***.***> wrote: We want to make OmnICU able to be ported to other environments where strings are not necessarily UTF-8. For example, if called via FFI from Dart or WebAssembly, we may want to process strings that are UTF-16. How should we deal with strings in OmnICU that makes it easy to target UTF-16 environments while not hurting performance in UTF-8 environments? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#14>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMG6OOSLJ2GYVASCZDLRJFANHANCNFSM4LTB5TQA> .

zbraniecki · 2020-03-25T01:26:10Z

@Manishearth @hsivonen

sffc · 2020-03-25T01:46:14Z

I was thinking one option here could be that we have a file like,

// typedefs.rs

// Rust UTF-8 environment
#[cfg(...)]
pub use std::string::String;

// UTF-16 environment
#[cfg(...)]
pub use omnicu::types::U16String as String;

And then, all public OmnICU APIs should import the String symbol from typedefs.rs, and the definition varies by configuration option. As long as omnicu::types::U16String is duck-type-compatible with std::string::String, then we should be OK, and we don't cause any performance regressions in Rust environments.

Manishearth · 2020-03-25T02:45:38Z

IMO it would be better to use a trait than use cfgs

hsivonen · 2020-03-26T10:45:16Z

I expect in Gecko whether we want to perform a given operation on UTF-8, on UTF-16, or, depending on caller, both, will vary over time. In the case of encoding_rs, Gecko is pretty firmly in the "both" situation, but for specific operations to be provided by OmnICU, it could easily be one or the other for a given operation.

I also expect there will be some operations where the Gecko caller wants UTF-16 for the time being but performing a conversion at the boundary and running the operation on UTF-8 inside OmnICU would be acceptable. When writing our conversions between UTF-16 and UTF-8, it's been a goal of mine to make it so fast that people wouldn't shy away from using UTF-8 for new code just because the new code needs to be called by old code that has UTF-16 at hand.

That is to say:

I think having having one big UTF-16 vs. UTF-8 build config flag isn't a good match for Gecko.
For performance-sensitive operations, I'd prefer to have both entry points monomorphized from source code written once such that C++/Rust cross-language LTO discards a monomorphization if unused. (For languages that are known to want one or the other and that don't participate in LTO with Rust, the FFI boundary should expose only the entry points for one so that the other gets discarded when building the largest artifact that does participate in LTO with Rust.)
If a given operation has to be written for one or the other only, I'd err on the side of writing it for UTF-8.

For encoding_rs having the UTF-16 entry points means that the naming of the UTF-8 entry points is slightly less elegant for UTF-8-only consumers than it could be if UTF-16 entry points didn't exist. I think this is acceptable, though. However, in the case of encoding_rs's problem space, there would be multiple different UTF-8 convenience entry points anyway, which causes a need to disambiguate by naming since there isn't argument type-based overloading. E.g. decode_to_utf16 outputs to caller-allocated UTF-16 buffer, decode_to_utf8 outputs to a caller-allocated UTF-8 buffer that's not a UTF-8 buffer for type system purposes, decode_to_str outputs to a caller-allocated UTF-8 buffer that is a UTF-8 buffer for type system purposes, and decode_to_string decodes by extending a caller-provided String. (The construction is supposed to be such that if you only use the UTF-16 entry point, the code for the UTF-8 entry points, which share the back end code, is discarded or vice versa. However, I don't actually have testing infra in place to keep verifying that this is so, since Gecko needs both anyway.)

hsivonen · 2020-03-26T12:31:20Z

I think I should expand on my previous comment:

There aren't just two cases: UTF-16 and UTF-8. There are three cases: potentially-invalid UTF-16, potentially-invalid UTF-8 (invalidity isn't UB), and guaranteed-valid UTF-8 (invalidity is UB). (Guaranteed-valid UTF-16 doesn't exist in practice.) Additionally, there'd be Latin1, which both SpiderMonkey and V8 use internally when possible, but it's unclear if letting that abstraction leak into OmnICU is a good idea.

In the caller-to-OmnICU direction, it's always okay to pass guaranteed-valid UTF-8 to an entry points that takes potentially-invalid UTF-8 if one is willing to leave some performance on the table. In the OmnICU-to-caller direction it isn't quite that simple, because the simplest loop that restores UTF-8 validity after having written to a caller-allocated buffer of guaranteed-valid UTF-8 doesn't necessarily terminate if the buffer wasn't valid to begin with. (I.e. write zeros after your meaningful write until finding a UTF-8 lead byte.)

Supporting all three for output is results in just two copies of the business logic plus a wrapper for does the zeroing until the next UTF-8 lead for the third case. Supporting all three on the input side without leaving any performance on the table results in three monomorphizations. (encoding_rs doesn't support the case of encoding potentially-invalid UTF-8 into a legacy encoding.)

What travels best over the FFI is a pointer to caller-owned buffer and a length. That is, for input the three cases are &[u16], &[u8], and &str and for output &mut [u16], &mut [u8], and &mut str. &mut str isn't really idiomatic Rust, and encoding_rs provides the entry point that appends to a String. For the output case, this means that there has to be a way to obtain an estimate of the worst-case code unit length so that the caller can allocat buffer. If the caller is unhappy with the slop left at the end of the buffer, the caller has to afterwards reallocate an exact-length buffer and copy into it. This kind of write-and-shrink approach allocation isn't really worse than the usual way of write-expand-write-more.

So I would suggest that the FFI layer didn't try to expose either iterators or growable buffers but worked with pointer-and-length and caller-allocated buffers instead. When the internals use iterators, I'm fine with exposing the iterators to Rust callers. I think it's also nice to provide wrappers that append into String for Rust callers.

sffc · 2020-04-16T05:13:26Z

Reassigning this issue to Henri as the primary owner. Can you write up the recommended approach to ICU4X string encodings in a PR that we can review?

hsivonen · 2020-04-16T11:27:47Z

@anba, My recollection is that you've worked on optimizing SpiderMonkey cases where the ECMA Intl API is called with input that SpiderMonkey represents as Latin1. Are there operation with which it would be worth binary size increase to optimize performance by making the back end library work directly with the Latin1 representation?

zbraniecki · 2020-04-16T17:51:14Z

There's also prior work on Gecko's nsstring crate - https://searchfox.org/mozilla-central/source/xpcom/rust/nsstring/src

I asked @emilio for early feedback since he's one of the authors of that FFI lib for strings. His response:

Henri's comment is on point
You want either different entry points or generics to deal with utf-8 / utf-16
(...)
Given this is intended to be a "base library", I'd try for the APIs to try to not have much saying on how the caller's string should look
So like output should probably allow to give it a buffer or something. ICU does some of it with their "byte sink" stuff

Closes unicode-org#14.

hsivonen · 2020-04-20T14:29:34Z

Created a PR with a proposal. Lacks a sample C++ wrapper still.

Closes unicode-org#14.

hsivonen · 2020-04-21T13:02:03Z

There's also prior work on Gecko's nsstring crate

I updated the proposal to use an allocation strategy similar to nsstring. The difference is that nsstring is aware of jemalloc bucket rounding, but the sample code I included is not. From API docs, it seems that reserve on Vec should query the allocator and round up to bucket size, but I don't see that actually happening.

hsivonen · 2020-04-21T13:31:02Z

From API docs, it seems that reserve on Vec should query the allocator and round up to bucket size, but I don't see that actually happening.

Filed a Rust issue.

Closes unicode-org#14.

sffc changed the title ~~How to do string handling?~~ String encodings (UTF-8/UTF-16) Mar 24, 2020

sffc added this to the 2020 Q1 milestone Mar 24, 2020

sffc assigned zbraniecki Mar 25, 2020

zbraniecki assigned Manishearth Mar 25, 2020

sffc added the T-docs-tests Type: Code change outside core library label Apr 16, 2020

sffc assigned hsivonen and unassigned zbraniecki and Manishearth Apr 16, 2020

sffc mentioned this issue Apr 18, 2020

Import unicode-normalization or re-write from scratch? #40

Closed

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 20, 2020

First iteration of documentation for API boundary string representation

f6451c8

Closes unicode-org#14.

hsivonen mentioned this issue Apr 20, 2020

First iteration of documentation for API boundary string representation #49

Merged

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 20, 2020

First iteration of documentation for API boundary string representation

aca2a96

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 20, 2020

First iteration of documentation for API boundary string representation

27fb4ae

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 20, 2020

First iteration of documentation for API boundary string representation

3d3c5f7

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 20, 2020

First iteration of documentation for API boundary string representation

0ea2195

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 21, 2020

First iteration of documentation for API boundary string representation

57b0664

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 21, 2020

First iteration of documentation for API boundary string representation

05aa454

Closes unicode-org#14.

hsivonen added a commit to hsivonen/icu4x that referenced this issue Apr 24, 2020

First iteration of documentation for API boundary string representation

ad54efd

Closes unicode-org#14.

sffc added C-meta Component: Relating to ICU4X as a whole A-design Area: Architecture or design labels May 7, 2020

hsivonen closed this as completed in #49 May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String encodings (UTF-8/UTF-16) #14

String encodings (UTF-8/UTF-16) #14

sffc commented Mar 24, 2020

macchiati commented Mar 25, 2020 via email

zbraniecki commented Mar 25, 2020

sffc commented Mar 25, 2020

Manishearth commented Mar 25, 2020

hsivonen commented Mar 26, 2020

hsivonen commented Mar 26, 2020

sffc commented Apr 16, 2020

hsivonen commented Apr 16, 2020

zbraniecki commented Apr 16, 2020

hsivonen commented Apr 20, 2020

hsivonen commented Apr 21, 2020

hsivonen commented Apr 21, 2020

String encodings (UTF-8/UTF-16) #14

String encodings (UTF-8/UTF-16) #14

Comments

sffc commented Mar 24, 2020

macchiati commented Mar 25, 2020 via email

zbraniecki commented Mar 25, 2020

sffc commented Mar 25, 2020

Manishearth commented Mar 25, 2020

hsivonen commented Mar 26, 2020

hsivonen commented Mar 26, 2020

sffc commented Apr 16, 2020

hsivonen commented Apr 16, 2020

zbraniecki commented Apr 16, 2020

hsivonen commented Apr 20, 2020

hsivonen commented Apr 21, 2020

hsivonen commented Apr 21, 2020