-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update proposal #14
Update proposal #14
Conversation
- Eliminate usage of 'builtin module' in description. This is not essential to the proposal and causes confusion around a similarly named JS proposal, which had different goals. - Clarify some minor points. - Make JS-API changes to WebIDL comprehensive. - Reword feature detection section to actually propose change to WebAssembly.validate method
- Function builtin behaviors is defined using 'create a host function' - Clarify behavior around monkey patching using standard language - Clarify edge cases around nullability - Clarify edge cases around unsigned/signed integers - Restrict 'substring' behavior to normal cases - Use wasm helpers for when wasm instructions are needed
The existing WTF-8 operation in this proposal violated one of the goals of the proposal: "don't create substantial new functionality" by introducing WTF-8 transcoding support to the web platform without prior precedent. The WTF-8 operation is removed because of this. The naming for WTF-16 operations is reworked to refer to 'charCodes' instead as that is what the JS String interface uses. We could support UTF-8 transcoding by referring to the TextEncoder/TextDecoder interfaces, so this commit adds support for that.
|
||
TODO: formalize these better. | ||
There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Limiting this proposal to UTF-8 (as opposed to WTF-8) sounds fine for newly written code, but I'm worried about legacy Java/Kotlin/Dart/etc code that contains unpaired surrogates in string constants. Being able to compile such string constants to i8 data segments and creatings strings from them via array-from-data + string-from-array was the original motivation for supporting WTF-8. (Using 16-bit code units in data segments is wasteful for latin-based languages and non-localized/application-internal ASCII strings, and also a bit of a headache to implement because data segments have no alignment requirements.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. The tricky part about that use-case for WTF-8 is:
- There's no prior API's that operate on WTF-8 in the Web (AFAIK), it's a new encoding that would need to be adopted. It's also not really a web standard yet either, it's just a repo on github (http://simonsapin.github.io/wtf-8/). Adding it to this proposal would be a major new addition to the web platform, and would need to get consensus from other stakeholders that it's okay. If we got that, I'd feel better about it.
- The WTF-8 spec says 'WTF-8 must not be used to represent text in a file format or for transmission over the Internet' and so supporting it for embedding string constants in a wasm binary seems counter to the intent of the proposal.
Is there some other way we could support this use-case? The TextDecoder interface does support ASCII/Latin-1 and so we could add support for that encoding on JS string creation path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder
supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).
Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.
All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.
I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.
It's probably worth adding, I won't do it in this PR but in a future one.
"start" is the index in the array where the first codeunit of the string will be written. | ||
|
||
Returns the number of codeunits written. Traps if the string doesn't fit into the array. | ||
### "wasm:js-string" "copyToCharCodeArray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this have "copy" as part of the name? Wouldn't fromCharCodeArray
/toCharCodeArray
be a more consistent pair of names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that it takes a pre-allocated array to copy into, instead of creating a new array. I figured that if we had support for a string -> (array i8)
operation that creates an immutable array from a JS string, then we'd want to call that fromCharCodeArray
and need some way to distinguish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would intoCharChodeArray
work for you? It matches the TextDecoder interface which has to
and into
versions for allocating/re-using existing buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine. (It's also just a name, not very important either way.)
If we do want to unify names, fromCharCodeArray
and decodeStringFromUTF8Array
could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, indeed my reasoning was 'the names are different because one is a JS String operation and the other a TextDecoder thing'. I think it's more compelling when you consider the full import specifier:
"wasm:js-string" "fromCharCodeArray"
"wasm:text-decoder" "fromUTF8Array"
vs"wasm:text-decoder" "decodeStringFromUTF8Array"
(1) appears to me to be creating a text decoder from a UTF-8 array.
end > string.length) | ||
// Ensure the range is within bounds to avoid the complex behavior that | ||
// `substring` performs when that is not the case. | ||
if (start > string.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking for start <= end
(i.e. taking the return ""
path when start > end
) was useful, let's keep that.
(I also didn't mean to express strong opposition to what you had here before. If you'd like to return ""
for end > string.length
, I'm fine with that. I was just pointing out that handling the start < string.length < end
case by capping end
is rather simple, if we want to allow it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted the change.
|
||
TODO: formalize these better. | ||
There is however, the Encoding API for `TextEncoder`/`TextDecoder` which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate `wasm:text-encoder` `wasm:text-decoder` interfaces which are available when the host implements these interfaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re 1, I'm not so worried about that part. WTF-8 is a small and straightforward relaxation of UTF-8. It's just a way to represent the JS strings that are already prevalent on the web in a 1-byte-codeunit format. Also, it's not like we're adding WTF-8 strings as a general-purpose primitive, nor are we encouraging anyone to use it as a data interchange format. I see this closer to e.g. TextDecoder
supporting a (huge!) bunch of legacy encodings that aren't otherwise used on the web (note that we didn't have a string-to-wtf8 function, only string-from-wtf8).
Re 2, yes, that's an unresolved wrinkle that I'm also unhappy about. Wasm satisfies the "used internally in self-contained systems with components that need to support potentially ill-formed UTF-16 for legacy reasons" part, it just unfortunately also has a "transmission over the Internet" step between the module producer and the browser executing the module.
All that said, it's probably acceptable, at least for a first version, to have only UTF-8 and WTF-16 (aka "charCode") support. Since the relevant use case is string constants, toolchains can check the string for UTF-8 validity, and use the UTF-8 encoding most of the time, and fall back to WTF-16 in the rare cases where UTF-8 is too strict. We can reopen the WTF-8 discussion if and when evidence arises that the UTF-8+WTF-16 combo is insufficient.
I think having a way to create strings from ASCII data would be nice in general (for performance: one use case where it came up is number-to-string conversion). I'd like to see it added right away, but I admittedly don't have a strong reason to argue that it is necessary for an MVP.
"start" is the index in the array where the first codeunit of the string will be written. | ||
|
||
Returns the number of codeunits written. Traps if the string doesn't fit into the array. | ||
### "wasm:js-string" "copyToCharCodeArray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's fine. (It's also just a name, not very important either way.)
If we do want to unify names, fromCharCodeArray
and decodeStringFromUTF8Array
could also be made more similar to each other, but again it's just a name. (Some might argue "yeah, the names are different because one is a JS String operation and the other a TextDecoder thing", some might reply "what do I care? They're utf8 and wtf16 versions of the same operation, so they should have similar names".)
I think at this point all comments have been addressed, minus some naming stuff that we can continue in issues if important. The WTF-8 stuff has some open issues we can continue discussing there too. |
It seems like a wrong copy-and-paste from Firefox.
Several big changes here, hoping to resolve the remaining outstanding issues and TODOs. Apologies for the single PR.
The major changes are:
Fixes #12, touches on #11, #10, and #9.
@jakobkummerow PTAL. @wingo, curious your thoughts on the encodings part. No rush anyone as I know it's the holidays.