-
Notifications
You must be signed in to change notification settings - Fork 452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jsapi] inconsistent utf-8 decoding #915
Comments
I brought up something similar when I was reviewing the JS API tests here: #883 (comment)
You can see that here:
"Converting a string..." links to here, which replaces U+DC01:
It's worth mentioning that ECMAScript doesn't have this restriction -- you can use So I think there actually might be two issues here:
|
@littledan @titzer thoughts? Maybe we're OK with issue 1, but I think we should probably address issue 2. The simplest solution would be to skip any custom section whose name can't be decoded using the WhatWG decoder. That would mean that there would be sections that you couldn't access via JavaScript (without decoding the module on your own). Another solution would be to restrict the core spec to make surrogate pairs invalid. A third solution would be to rewrite the JS API spec to allow an unmatched trailing surrogate. |
This seems like a very poor idea in the first place. Can we disallow that in the core spec instead? |
I think it was an oversight on my part that the core spec does not currently rule out the surrogate space. It should, and the reference interpreter indeed flags any such code point as an error. I think we also have various tests for that. (Note that not just unmatched surrogates are illegal, any occurrence of a codepoint in the space of surrogates is illegal as the result of UTF-8 decoding.) I prepared a fix for the core spec. As for replacements, it is very much intentional that no decoding or interpretation is needed to compare Wasm names. That was the basis of the agreement to require UTF-8 in the first place. So for the core spec the choice to not do any replacements is working as intended. Note that JS is not the only customer of Wasm names. It doesn't sound like a wise idea nor necessary that the JS API specifies something different regarding replacements. I don't think that was the case originally, and maybe that should be fixed? |
If I understand correctly, when matching, we've got the name (which is conceptually a sequence of Unicode scalar values, and could be stored as UTF-8 or UTF-16), and the argument, which is a JS string (so UTF-16 plus unpaired surrogates). I can think of three reasonable ways to compare right now:
I suppose the first one is cheapest if we assume JS engines will decode the names to UTF-16 ahead of time anyway, but I'm not entirely sure it's the most helpful. (I have no strong personal preference, though.) |
So the case in question is the one where the user-provided argument is invalid UTF-16? In that case, it seems like a bad idea to make that magically succeed on some accidental occasions (option 2). Invalid UTF-16 is a reality of JavaScript strings. Of all string-consuming functions out there, it's not the responsibility of the Wasm API to suddenly start worrying about that as a malformed argument (option 3). So option 1 is the natural choice IMO. |
I agree with the option one. FYI SpiderMonkey has the same issue, and I would expect all the other engines to use the UTF8 decoder "for the web", which is the one specified by WHATWG. |
I'm happy with the solution this thread ended up on. With #923 merged, is this issue ready to close? |
@littledan The WHATWG algorithm stills uses
Which means that it will truncate some code points compared to the WebAssembly decoder. |
@xtuc Which kinds of strings do you think will be truncated in one algorithm and valid in another? |
To be clear, the surrogates aren't truncated, they are replaced with U+FFFD (see #915 (comment)) |
I don't think lower/upper boundaries result in truncation either. |
If I understand correctly, the current Wasm decoder would allow much high code points than the WHATWG does, and these would be truncated (or set the upper boundary) when going to JS. @rossberg is the goal to comply with the WHATWG decoder? Since the WebAssembly UTF8 decoder is referenced elsewhere it seems like a restriction for Wasm usage. I agree with @Ms2ger first option, which doesn't require such a spec change, only about the custom sections or the js-api. |
The Wasm spec clearly allows all legal code points up to U+110000, as specified by Unicode. I think the same is true for the algorithm given in the WHATWG spec, though it's much more difficult to deduce (you have to look at the table in Unicode section 3.9 to understand why this algorithm makes sense). |
It took me some time to understand and after a chat with @Ms2ger, here are the changes needed:
|
The first check with the explicit U+FFFD will still match, but the second call with the surrogate should indeed start returning an empty array. |
Yes, what I mean with "normalizing surrogates" is replacing them with |
To clarify, does this apply only to the parameter of |
Enables WebAssembly's js-api module/customSection. The specification has been updated; see WebAssembly/spec#915. V8 was already using DOMString. Bug: v8:8633 Change-Id: I4c3e93c21594dbba84b3697e7e85069c3ff8b441 Cq-Include-Trybots: luci.chromium.try:linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/1415554 Commit-Queue: Sven Sauleau <[email protected]> Reviewed-by: Adam Klein <[email protected]> Cr-Commit-Position: refs/heads/master@{#59182}
Two UTF-8 decoders are specified:
WebAssembly.Module.customSections
's name decoding (defined in the WHATWG): https://encoding.spec.whatwg.org/#utf-8.From the WHATWG algorythm, the rules
are not used in Wasm.
My understanding is that
U+DC01
andU+FFFD
should be equal in the JS API, as tested herespec/test/js-api/module/customSections.any.js
Lines 156 to 160 in 5aaea96
Note that this is the only occurence of UTF-8 decoding in the JS spec.
The text was updated successfully, but these errors were encountered: