[jsapi] inconsistent utf-8 decoding #915

xtuc · 2018-11-20T12:14:25Z

Two UTF-8 decoders are specified:

Wasm's binary format name decoding: https://webassembly.github.io/spec/core/binary/values.html#binary-name.
WebAssembly.Module.customSections's name decoding (defined in the WHATWG): https://encoding.spec.whatwg.org/#utf-8.

From the WHATWG algorythm, the rules

Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.

Set UTF-8 code point to (UTF-8 code point << 6) | (byte & 0x3F)

are not used in Wasm.

My understanding is that U+DC01 and U+FFFD should be equal in the JS API, as tested here

spec/test/js-api/module/customSections.any.js

Lines 156 to 160 in 5aaea96

    
           assert_sections(WebAssembly.Module.customSections(module, "na\uFFFDme"), [ 
        
             bytes, 
        
           ]); 
        
           assert_sections(WebAssembly.Module.customSections(module, "na\uDC01me"), [ 
        
             bytes,

While in Wasm they would be considered as two different sections, which could cause sublte mismatchs.

Note that this is the only occurence of UTF-8 decoding in the JS spec.

The text was updated successfully, but these errors were encountered:

binji · 2018-11-27T18:40:59Z

I brought up something similar when I was reviewing the JS API tests here: #883 (comment)

@binji: This is a bit surprising to me, since the customSections algorithm says it uses UTF-8 decode without BOM or fail, which doesn't seem to convert the 0xdc01 to a 0xfffd (though I see that elsewhere in the spec). Does the replacement happen somewhere else?

@Ms2ger: customSections takes a USVString, which introduces the replacement character; the "UTF-8 decode without BOM or fail" algorithm is only applied to the bytes in the module, which contain an explicit U+FFFD from the emit_string call.

You can see that here:

An ECMAScript value V is converted to an IDL USVString value by running the following algorithm:

Let string be the result of converting V to a DOMString.

Return an IDL USVString value that is the result of converting string to a sequence of Unicode scalar values.

"Converting a string..." links to here, which replaces U+DC01:

0xDC00 ≤ c ≤ 0xDFFF
Append to U a U+FFFD REPLACEMENT CHARACTER.

It's worth mentioning that ECMAScript doesn't have this restriction -- you can use '\udc01' and it is not converted to '\ufffd'.

So I think there actually might be two issues here:

The WebAssembly JS API spec is defined using WebIDL, so it inherits restrictions from WebIDL that perhaps should not apply to JavaScript. In particular, the conversion of '\udc01' to '\ufffd' in the API surface for WebAssembly.Module.customSections because it uses USVString.
customSections uses the WhatWG UTF-8 decoder, which is more restrictive than the one defined in the WebAssembly core spec. If the wasm module has a custom section with an unmatched trailing surrogate (U+DC00 through U+DFFF), the module will be valid according to the core spec, but will fail in the "UTF-8 decode without BOM or fail" algorithm, and thus the assertion "Assert: name is not failure (moduleObject.[[Module]] is valid)." will be false.

binji · 2018-11-27T18:50:27Z

@littledan @titzer thoughts? Maybe we're OK with issue 1, but I think we should probably address issue 2.

The simplest solution would be to skip any custom section whose name can't be decoded using the WhatWG decoder. That would mean that there would be sections that you couldn't access via JavaScript (without decoding the module on your own). Another solution would be to restrict the core spec to make surrogate pairs invalid. A third solution would be to rewrite the JS API spec to allow an unmatched trailing surrogate.

Ms2ger · 2018-11-28T07:10:23Z

If the wasm module has a custom section with an unmatched trailing surrogate (U+DC00 through U+DFFF), the module will be valid according to the core spec

This seems like a very poor idea in the first place. Can we disallow that in the core spec instead?

rossberg · 2018-11-28T09:07:36Z

I think it was an oversight on my part that the core spec does not currently rule out the surrogate space. It should, and the reference interpreter indeed flags any such code point as an error. I think we also have various tests for that. (Note that not just unmatched surrogates are illegal, any occurrence of a codepoint in the space of surrogates is illegal as the result of UTF-8 decoding.)

I prepared a fix for the core spec.

As for replacements, it is very much intentional that no decoding or interpretation is needed to compare Wasm names. That was the basis of the agreement to require UTF-8 in the first place. So for the core spec the choice to not do any replacements is working as intended. Note that JS is not the only customer of Wasm names.

It doesn't sound like a wise idea nor necessary that the JS API specifies something different regarding replacements. I don't think that was the case originally, and maybe that should be fixed?

Ms2ger · 2018-11-28T09:50:43Z

If I understand correctly, when matching, we've got the name (which is conceptually a sequence of Unicode scalar values, and could be stored as UTF-8 or UTF-16), and the argument, which is a JS string (so UTF-16 plus unpaired surrogates). I can think of three reasonable ways to compare right now:

convertToJSString(name) == argument (so an argument with unpaired surrogates will never match)
name == convertToUnicodeScalarValues(argument, ReplaceUnpairedSurrogates) (this is what the spec currently does)
name == convertToUnicodeScalarValues(argument, ThrowOnUnpairedSurrogates)

I suppose the first one is cheapest if we assume JS engines will decode the names to UTF-16 ahead of time anyway, but I'm not entirely sure it's the most helpful.

(I have no strong personal preference, though.)

rossberg · 2018-11-28T12:51:59Z

So the case in question is the one where the user-provided argument is invalid UTF-16?

In that case, it seems like a bad idea to make that magically succeed on some accidental occasions (option 2).

Invalid UTF-16 is a reality of JavaScript strings. Of all string-consuming functions out there, it's not the responsibility of the Wasm API to suddenly start worrying about that as a malformed argument (option 3).

So option 1 is the natural choice IMO.

xtuc · 2018-12-03T15:27:32Z

I agree with the option one.

FYI SpiderMonkey has the same issue, and I would expect all the other engines to use the UTF8 decoder "for the web", which is the one specified by WHATWG.

littledan · 2018-12-05T21:33:04Z

I'm happy with the solution this thread ended up on. With #923 merged, is this issue ready to close?

xtuc · 2018-12-05T21:49:06Z

@littledan The WHATWG algorithm stills uses

Set UTF-8 lower boundary to 0x80 and UTF-8 upper boundary to 0xBF.

Which means that it will truncate some code points compared to the WebAssembly decoder.

littledan · 2018-12-05T21:59:23Z

@xtuc Which kinds of strings do you think will be truncated in one algorithm and valid in another?

binji · 2018-12-05T22:05:35Z

To be clear, the surrogates aren't truncated, they are replaced with U+FFFD (see #915 (comment))

littledan · 2018-12-06T04:55:56Z

I don't think lower/upper boundaries result in truncation either.

rossberg · 2018-12-06T07:50:32Z

Doh, @xtuc is right, there was another side condition missing in the Wasm spec, see #928. Without, it the utf8 function would not actually be a function.

xtuc · 2018-12-06T11:26:36Z

If I understand correctly, the current Wasm decoder would allow much high code points than the WHATWG does, and these would be truncated (or set the upper boundary) when going to JS.

@rossberg is the goal to comply with the WHATWG decoder? Since the WebAssembly UTF8 decoder is referenced elsewhere it seems like a restriction for Wasm usage.

I agree with @Ms2ger first option, which doesn't require such a spec change, only about the custom sections or the js-api.

rossberg · 2018-12-06T13:47:51Z

The Wasm spec clearly allows all legal code points up to U+110000, as specified by Unicode. I think the same is true for the algorithm given in the WHATWG spec, though it's much more difficult to deduce (you have to look at the table in Unicode section 3.9 to understand why this algorithm makes sense).

xtuc · 2019-01-17T15:28:45Z

It took me some time to understand and after a chat with @Ms2ger, here are the changes needed:

Switching from USVString to DOMString in the WebIDL specification.
- The result of using the WHATWG UTF-8 Decoder is a USVString (which corresponds to a WHATWG scalar value string and does not contain surrogates. When doing the comparison between the WebAssembly name (USVString) and the JavaScript argument (DOMString), the name needs to be upward-casted into a DOMString.

The current test

spec/test/js-api/module/customSections.any.js

Lines 156 to 160 in 5aaea96

    
           assert_sections(WebAssembly.Module.customSections(module, "na\uFFFDme"), [ 
        
             bytes, 
        
           ]); 
        
           assert_sections(WebAssembly.Module.customSections(module, "na\uDC01me"), [ 
        
             bytes,

won't match anymore since the surrogates from the JavaScript argument (DOMString) are not normalized.

Ms2ger · 2019-01-17T15:30:35Z

The first check with the explicit U+FFFD will still match, but the second call with the surrogate should indeed start returning an empty array.

xtuc · 2019-01-17T15:32:49Z

Yes, what I mean with "normalizing surrogates" is replacing them with U+FFFD.

rossberg · 2019-01-17T15:51:26Z

Switching from USVString to DOMString in the WebIDL specification.

To clarify, does this apply only to the parameter of customSections, or are you also talking about the descriptor records?

xtuc · 2019-01-17T15:55:01Z

@rossberg I would like to switch every USVString to a DOMString, otherwise it theoretically won't match when the USVString contained non noramlized surrogates. I expect all the JavaScript engines to use a DOMString anyway.

Edit:
See @Ms2ger review on #947.

Enables WebAssembly's js-api module/customSection. The specification has been updated; see WebAssembly/spec#915. V8 was already using DOMString. Bug: v8:8633 Change-Id: I4c3e93c21594dbba84b3697e7e85069c3ff8b441 Cq-Include-Trybots: luci.chromium.try:linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/1415554 Commit-Queue: Sven Sauleau <[email protected]> Reviewed-by: Adam Klein <[email protected]> Cr-Commit-Position: refs/heads/master@{#59182}

xtuc mentioned this issue Jan 17, 2019

[js-api] use DOMString #947

Merged

binji closed this as completed in #947 Jan 25, 2019

xtuc mentioned this issue Mar 4, 2019

Normative: GlobalDescriptor shoud use DOMString #976

Closed

binji mentioned this issue May 17, 2019

[WASM] CR transition for WebAssembly JavaScript Interface w3c/transitions#137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jsapi] inconsistent utf-8 decoding #915

[jsapi] inconsistent utf-8 decoding #915

xtuc commented Nov 20, 2018 •

edited

Loading

binji commented Nov 27, 2018 •

edited

Loading

binji commented Nov 27, 2018

Ms2ger commented Nov 28, 2018

rossberg commented Nov 28, 2018

Ms2ger commented Nov 28, 2018

rossberg commented Nov 28, 2018

xtuc commented Dec 3, 2018

littledan commented Dec 5, 2018

xtuc commented Dec 5, 2018 •

edited

Loading

littledan commented Dec 5, 2018

binji commented Dec 5, 2018

littledan commented Dec 6, 2018

rossberg commented Dec 6, 2018

xtuc commented Dec 6, 2018

rossberg commented Dec 6, 2018

xtuc commented Jan 17, 2019

Ms2ger commented Jan 17, 2019

xtuc commented Jan 17, 2019

rossberg commented Jan 17, 2019

xtuc commented Jan 17, 2019 •

edited

Loading

[jsapi] inconsistent utf-8 decoding #915

[jsapi] inconsistent utf-8 decoding #915

Comments

xtuc commented Nov 20, 2018 • edited Loading

binji commented Nov 27, 2018 • edited Loading

binji commented Nov 27, 2018

Ms2ger commented Nov 28, 2018

rossberg commented Nov 28, 2018

Ms2ger commented Nov 28, 2018

rossberg commented Nov 28, 2018

xtuc commented Dec 3, 2018

littledan commented Dec 5, 2018

xtuc commented Dec 5, 2018 • edited Loading

littledan commented Dec 5, 2018

binji commented Dec 5, 2018

littledan commented Dec 6, 2018

rossberg commented Dec 6, 2018

xtuc commented Dec 6, 2018

rossberg commented Dec 6, 2018

xtuc commented Jan 17, 2019

Ms2ger commented Jan 17, 2019

xtuc commented Jan 17, 2019

rossberg commented Jan 17, 2019

xtuc commented Jan 17, 2019 • edited Loading

xtuc commented Nov 20, 2018 •

edited

Loading

binji commented Nov 27, 2018 •

edited

Loading

xtuc commented Dec 5, 2018 •

edited

Loading

xtuc commented Jan 17, 2019 •

edited

Loading