Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jsapi] inconsistent utf-8 decoding #915

Closed
xtuc opened this issue Nov 20, 2018 · 20 comments
Closed

[jsapi] inconsistent utf-8 decoding #915

xtuc opened this issue Nov 20, 2018 · 20 comments

Comments

@xtuc
Copy link
Contributor

xtuc commented Nov 20, 2018

Two UTF-8 decoders are specified:

From the WHATWG algorythm, the rules

  1. Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.
  1. Set UTF-8 code point to (UTF-8 code point << 6) | (byte & 0x3F)

are not used in Wasm.

My understanding is that U+DC01 and U+FFFD should be equal in the JS API, as tested here

assert_sections(WebAssembly.Module.customSections(module, "na\uFFFDme"), [
bytes,
]);
assert_sections(WebAssembly.Module.customSections(module, "na\uDC01me"), [
bytes,
While in Wasm they would be considered as two different sections, which could cause sublte mismatchs.

Note that this is the only occurence of UTF-8 decoding in the JS spec.

@binji
Copy link
Member

binji commented Nov 27, 2018

I brought up something similar when I was reviewing the JS API tests here: #883 (comment)

@binji: This is a bit surprising to me, since the customSections algorithm says it uses UTF-8 decode without BOM or fail, which doesn't seem to convert the 0xdc01 to a 0xfffd (though I see that elsewhere in the spec). Does the replacement happen somewhere else?

@Ms2ger: customSections takes a USVString, which introduces the replacement character; the "UTF-8 decode without BOM or fail" algorithm is only applied to the bytes in the module, which contain an explicit U+FFFD from the emit_string call.

You can see that here:

An ECMAScript value V is converted to an IDL USVString value by running the following algorithm:

  1. Let string be the result of converting V to a DOMString.
  2. Return an IDL USVString value that is the result of converting string to a sequence of Unicode scalar values.

"Converting a string..." links to here, which replaces U+DC01:

0xDC00 ≤ c ≤ 0xDFFF
Append to U a U+FFFD REPLACEMENT CHARACTER.

It's worth mentioning that ECMAScript doesn't have this restriction -- you can use '\udc01' and it is not converted to '\ufffd'.

So I think there actually might be two issues here:

  1. The WebAssembly JS API spec is defined using WebIDL, so it inherits restrictions from WebIDL that perhaps should not apply to JavaScript. In particular, the conversion of '\udc01' to '\ufffd' in the API surface for WebAssembly.Module.customSections because it uses USVString.
  2. customSections uses the WhatWG UTF-8 decoder, which is more restrictive than the one defined in the WebAssembly core spec. If the wasm module has a custom section with an unmatched trailing surrogate (U+DC00 through U+DFFF), the module will be valid according to the core spec, but will fail in the "UTF-8 decode without BOM or fail" algorithm, and thus the assertion "Assert: name is not failure (moduleObject.[[Module]] is valid)." will be false.

@binji
Copy link
Member

binji commented Nov 27, 2018

@littledan @titzer thoughts? Maybe we're OK with issue 1, but I think we should probably address issue 2.

The simplest solution would be to skip any custom section whose name can't be decoded using the WhatWG decoder. That would mean that there would be sections that you couldn't access via JavaScript (without decoding the module on your own). Another solution would be to restrict the core spec to make surrogate pairs invalid. A third solution would be to rewrite the JS API spec to allow an unmatched trailing surrogate.

@Ms2ger
Copy link
Collaborator

Ms2ger commented Nov 28, 2018

If the wasm module has a custom section with an unmatched trailing surrogate (U+DC00 through U+DFFF), the module will be valid according to the core spec

This seems like a very poor idea in the first place. Can we disallow that in the core spec instead?

@rossberg
Copy link
Member

I think it was an oversight on my part that the core spec does not currently rule out the surrogate space. It should, and the reference interpreter indeed flags any such code point as an error. I think we also have various tests for that. (Note that not just unmatched surrogates are illegal, any occurrence of a codepoint in the space of surrogates is illegal as the result of UTF-8 decoding.)

I prepared a fix for the core spec.

As for replacements, it is very much intentional that no decoding or interpretation is needed to compare Wasm names. That was the basis of the agreement to require UTF-8 in the first place. So for the core spec the choice to not do any replacements is working as intended. Note that JS is not the only customer of Wasm names.

It doesn't sound like a wise idea nor necessary that the JS API specifies something different regarding replacements. I don't think that was the case originally, and maybe that should be fixed?

@Ms2ger
Copy link
Collaborator

Ms2ger commented Nov 28, 2018

If I understand correctly, when matching, we've got the name (which is conceptually a sequence of Unicode scalar values, and could be stored as UTF-8 or UTF-16), and the argument, which is a JS string (so UTF-16 plus unpaired surrogates). I can think of three reasonable ways to compare right now:

  1. convertToJSString(name) == argument (so an argument with unpaired surrogates will never match)
  2. name == convertToUnicodeScalarValues(argument, ReplaceUnpairedSurrogates) (this is what the spec currently does)
  3. name == convertToUnicodeScalarValues(argument, ThrowOnUnpairedSurrogates)

I suppose the first one is cheapest if we assume JS engines will decode the names to UTF-16 ahead of time anyway, but I'm not entirely sure it's the most helpful.

(I have no strong personal preference, though.)

@rossberg
Copy link
Member

So the case in question is the one where the user-provided argument is invalid UTF-16?

In that case, it seems like a bad idea to make that magically succeed on some accidental occasions (option 2).

Invalid UTF-16 is a reality of JavaScript strings. Of all string-consuming functions out there, it's not the responsibility of the Wasm API to suddenly start worrying about that as a malformed argument (option 3).

So option 1 is the natural choice IMO.

@xtuc
Copy link
Contributor Author

xtuc commented Dec 3, 2018

I agree with the option one.

FYI SpiderMonkey has the same issue, and I would expect all the other engines to use the UTF8 decoder "for the web", which is the one specified by WHATWG.

@littledan
Copy link
Collaborator

I'm happy with the solution this thread ended up on. With #923 merged, is this issue ready to close?

@xtuc
Copy link
Contributor Author

xtuc commented Dec 5, 2018

@littledan The WHATWG algorithm stills uses

Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.

Which means that it will truncate some code points compared to the WebAssembly decoder.

@littledan
Copy link
Collaborator

@xtuc Which kinds of strings do you think will be truncated in one algorithm and valid in another?

@binji
Copy link
Member

binji commented Dec 5, 2018

To be clear, the surrogates aren't truncated, they are replaced with U+FFFD (see #915 (comment))

@littledan
Copy link
Collaborator

I don't think lower/upper boundaries result in truncation either.

@rossberg
Copy link
Member

rossberg commented Dec 6, 2018

Doh, @xtuc is right, there was another side condition missing in the Wasm spec, see #928. Without, it the utf8 function would not actually be a function.

@xtuc
Copy link
Contributor Author

xtuc commented Dec 6, 2018

If I understand correctly, the current Wasm decoder would allow much high code points than the WHATWG does, and these would be truncated (or set the upper boundary) when going to JS.

@rossberg is the goal to comply with the WHATWG decoder? Since the WebAssembly UTF8 decoder is referenced elsewhere it seems like a restriction for Wasm usage.

I agree with @Ms2ger first option, which doesn't require such a spec change, only about the custom sections or the js-api.

@rossberg
Copy link
Member

rossberg commented Dec 6, 2018

The Wasm spec clearly allows all legal code points up to U+110000, as specified by Unicode. I think the same is true for the algorithm given in the WHATWG spec, though it's much more difficult to deduce (you have to look at the table in Unicode section 3.9 to understand why this algorithm makes sense).

@xtuc
Copy link
Contributor Author

xtuc commented Jan 17, 2019

It took me some time to understand and after a chat with @Ms2ger, here are the changes needed:

  • Switching from USVString to DOMString in the WebIDL specification.
    • The result of using the WHATWG UTF-8 Decoder is a USVString (which corresponds to a WHATWG scalar value string and does not contain surrogates. When doing the comparison between the WebAssembly name (USVString) and the JavaScript argument (DOMString), the name needs to be upward-casted into a DOMString.
  • The current test
    assert_sections(WebAssembly.Module.customSections(module, "na\uFFFDme"), [
    bytes,
    ]);
    assert_sections(WebAssembly.Module.customSections(module, "na\uDC01me"), [
    bytes,
    won't match anymore since the surrogates from the JavaScript argument (DOMString) are not normalized.

@Ms2ger
Copy link
Collaborator

Ms2ger commented Jan 17, 2019

The first check with the explicit U+FFFD will still match, but the second call with the surrogate should indeed start returning an empty array.

@xtuc
Copy link
Contributor Author

xtuc commented Jan 17, 2019

Yes, what I mean with "normalizing surrogates" is replacing them with U+FFFD.

@rossberg
Copy link
Member

  • Switching from USVString to DOMString in the WebIDL specification.

To clarify, does this apply only to the parameter of customSections, or are you also talking about the descriptor records?

@xtuc
Copy link
Contributor Author

xtuc commented Jan 17, 2019

@rossberg I would like to switch every USVString to a DOMString, otherwise it theoretically won't match when the USVString contained non noramlized surrogates. I expect all the JavaScript engines to use a DOMString anyway.

Edit:
See @Ms2ger review on #947.

brn pushed a commit to brn/v8 that referenced this issue Jan 30, 2019
Enables WebAssembly's js-api module/customSection. The specification has
been updated; see WebAssembly/spec#915. V8 was
already using DOMString.

Bug: v8:8633
Change-Id: I4c3e93c21594dbba84b3697e7e85069c3ff8b441
Cq-Include-Trybots: luci.chromium.try:linux-blink-rel
Reviewed-on: https://chromium-review.googlesource.com/c/1415554
Commit-Queue: Sven Sauleau <[email protected]>
Reviewed-by: Adam Klein <[email protected]>
Cr-Commit-Position: refs/heads/master@{#59182}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants