Update proposal #14

eqrion · 2023-12-29T20:24:06Z

Several big changes here, hoping to resolve the remaining outstanding issues and TODOs. Apologies for the single PR.

The major changes are:

Many editorial changes to tighten up ambiguities in the definition of builtins
Reworked logic in builtins to handle edge cases discovered through implementation
Reworked some builtin names to better match JS naming
Removed WTF-8 support and replaced it with UTF-8 support through TextEncoder/TextDecoder (probably the controversial one, reasoning is in the commit message).

Fixes #12, touches on #11, #10, and #9.

@jakobkummerow PTAL. @wingo, curious your thoughts on the encodings part. No rush anyone as I know it's the holidays.

- Eliminate usage of 'builtin module' in description. This is not essential to the proposal and causes confusion around a similarly named JS proposal, which had different goals. - Clarify some minor points. - Make JS-API changes to WebIDL comprehensive. - Reword feature detection section to actually propose change to WebAssembly.validate method

- Function builtin behaviors is defined using 'create a host function' - Clarify behavior around monkey patching using standard language - Clarify edge cases around nullability - Clarify edge cases around unsigned/signed integers - Restrict 'substring' behavior to normal cases - Use wasm helpers for when wasm instructions are needed

The existing WTF-8 operation in this proposal violated one of the goals of the proposal: "don't create substantial new functionality" by introducing WTF-8 transcoding support to the web platform without prior precedent. The WTF-8 operation is removed because of this. The naming for WTF-16 operations is reworked to refer to 'charCodes' instead as that is what the JS String interface uses. We could support UTF-8 transcoding by referring to the TextEncoder/TextDecoder interfaces, so this commit adds support for that.

proposals/js-string-builtins/Overview.md

jakobkummerow · 2024-01-09T10:26:07Z

proposals/js-string-builtins/Overview.md


-TODO: formalize these better.
+There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces.


Limiting this proposal to UTF-8 (as opposed to WTF-8) sounds fine for newly written code, but I'm worried about legacy Java/Kotlin/Dart/etc code that contains unpaired surrogates in string constants. Being able to compile such string constants to i8 data segments and creatings strings from them via array-from-data + string-from-array was the original motivation for supporting WTF-8. (Using 16-bit code units in data segments is wasteful for latin-based languages and non-localized/application-internal ASCII strings, and also a bit of a headache to implement because data segments have no alignment requirements.)

Hmm, good point. The tricky part about that use-case for WTF-8 is:

There's no prior API's that operate on WTF-8 in the Web (AFAIK), it's a new encoding that would need to be adopted. It's also not really a web standard yet either, it's just a repo on github (http://simonsapin.github.io/wtf-8/). Adding it to this proposal would be a major new addition to the web platform, and would need to get consensus from other stakeholders that it's okay. If we got that, I'd feel better about it.

The WTF-8 spec says 'WTF-8 must not be used to represent text in a file format or for transmission over the Internet' and so supporting it for embedding string constants in a wasm binary seems counter to the intent of the proposal.

Is there some other way we could support this use-case? The TextDecoder interface does support ASCII/Latin-1 and so we could add support for that encoding on JS string creation path.

Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).

Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in ~~self-contained~~ systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.

All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

It's probably worth adding, I won't do it in this PR but in a future one.

jakobkummerow · 2024-01-09T10:29:42Z

proposals/js-string-builtins/Overview.md

-"start" is the index in the array where the first codeunit of the string will be written.
-
-Returns the number of codeunits written. Traps if the string doesn't fit into the array.
+### "wasm:js-string" "copyToCharCodeArray"


Why does this have "copy" as part of the name? Wouldn't fromCharCodeArray/toCharCodeArray be a more consistent pair of names?

The reason is that it takes a pre-allocated array to copy into, instead of creating a new array. I figured that if we had support for a string -> (array i8) operation that creates an immutable array from a JS string, then we'd want to call that fromCharCodeArray and need some way to distinguish.

How would intoCharChodeArray work for you? It matches the TextDecoder interface which has to and into versions for allocating/re-using existing buffers.

Sure, that's fine. (It's also just a name, not very important either way.)

If we do want to unify names, fromCharCodeArray and decodeStringFromUTF8Array could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)

Yeah, indeed my reasoning was 'the names are different because one is a JS String operation and the other a TextDecoder thing'. I think it's more compelling when you consider the full import specifier:

"wasm:js-string" "fromCharCodeArray"

"wasm:text-decoder" "fromUTF8Array" vs

"wasm:text-decoder" "decodeStringFromUTF8Array"

(1) appears to me to be creating a text decoder from a UTF-8 array.

proposals/js-string-builtins/Overview.md

jakobkummerow · 2024-01-10T15:21:40Z

proposals/js-string-builtins/Overview.md

-      end > string.length)
+  // Ensure the range is within bounds to avoid the complex behavior that
+  // `substring` performs when that is not the case.
+  if (start > string.length)


I think checking for start <= end (i.e. taking the return "" path when start > end) was useful, let's keep that.
(I also didn't mean to express strong opposition to what you had here before. If you'd like to return "" for end > string.length, I'm fine with that. I was just pointing out that handling the start < string.length < end case by capping end is rather simple, if we want to allow it.)

I reverted the change.

jakobkummerow · 2024-01-10T19:25:13Z

proposals/js-string-builtins/Overview.md


-TODO: formalize these better.
+There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces.


Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).

Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in ~~self-contained~~ systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.

All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.

I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.

jakobkummerow · 2024-01-10T19:31:03Z

proposals/js-string-builtins/Overview.md

-"start" is the index in the array where the first codeunit of the string will be written.
-
-Returns the number of codeunits written. Traps if the string doesn't fit into the array.
+### "wasm:js-string" "copyToCharCodeArray"


Sure, that's fine. (It's also just a name, not very important either way.)

If we do want to unify names, fromCharCodeArray and decodeStringFromUTF8Array could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)

eqrion · 2024-01-16T14:43:00Z

I think at this point all comments have been addressed, minus some naming stuff that we can continue in issues if important. The WTF-8 stuff has some open issues we can continue discussing there too.

It seems like a wrong copy-and-paste from Firefox.

eqrion added 4 commits December 7, 2023 12:16

Rename concatenate to concat

e1a47ee

jakobkummerow reviewed Jan 9, 2024

View reviewed changes

Review comments

440a0f6

jakobkummerow reviewed Jan 10, 2024

View reviewed changes

Update substring range checking

121bd8c

eqrion merged commit 206a3a0 into main Jan 16, 2024

eqrion deleted the refactor-2 branch January 16, 2024 14:43

eqrion pushed a commit that referenced this pull request Aug 6, 2024

Remove wabt "Done" link (#14)

fb4683b

It seems like a wrong copy-and-paste from Firefox.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update proposal #14

Update proposal #14

eqrion commented Dec 29, 2023

jakobkummerow Jan 9, 2024

eqrion Jan 9, 2024

jakobkummerow Jan 10, 2024

eqrion Jan 11, 2024

jakobkummerow Jan 9, 2024

eqrion Jan 9, 2024

eqrion Jan 9, 2024

jakobkummerow Jan 10, 2024

eqrion Jan 16, 2024

jakobkummerow Jan 10, 2024

eqrion Jan 16, 2024

jakobkummerow Jan 10, 2024

jakobkummerow Jan 10, 2024

eqrion commented Jan 16, 2024


		TODO: formalize these better.
		There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces.

Update proposal #14

Update proposal #14

Conversation

eqrion commented Dec 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eqrion commented Jan 16, 2024