[encoding] Tests for BOM detection. #22276

andreubotella · 2020-03-16T13:54:36Z

This change tests for whatwg/html#5359, but since I didn't see any tests for whether the BOM influences the decoding at all, I added one as well.

inexorabletash

LGTM, thanks!

inexorabletash

Maybe to be completely pedantic have the comment specify that it starts with a UTF-8 BOM.

~~(Also.... is a comment before the doctype declaration valid? Not sure offhand...)~~

annevk

Not sure if this works, but I'd prefer newlines at the end of files.

encoding/bom-handling.html

encoding/bom-handling.html.headers

Applying suggestions from code review. Co-Authored-By: Anne van Kesteren <[email protected]>

annevk · 2020-03-18T13:59:33Z

Thanks! I pinged @hsivonen since it seems somewhat suspect there would not be coverage for this. So would like to wait for that a bit, but do let me know if it's taking too long.

hsivonen · 2020-03-18T14:42:49Z

I believe we do have a separate test for testing that UTF-8 isn't heuristically detected. Still, I think this test would be more reliable if the only non-ASCII bytes were the BOM. I suggest removing the Japanese text from the test.

If you want to actually test decoding in addition to what document.characterSet reports, I suggest testing, in a separate test file, one (and only one) of the German sequences that ced deliberately doesn't score towards UTF-8.

hsivonen · 2020-03-18T14:44:10Z

(Of course, if the BOM was ignored, the meta should take precedence over heuristic sniffing, but let's still avoid making the test too easy to pass for the wrong reason.)

hsivonen · 2020-03-18T14:53:57Z

it seems somewhat suspect there would not be coverage for this

It indeed seem the encoding/ directory does not have a test for this. The existing BOM tests are for checking that the right number of BOMs is removed.

andreubotella · 2020-03-18T16:11:31Z

Since this test is for whatwg/html#5359, which fixes the long-standing spec bug where the document's character encoding doesn't take the BOM into account, I thought it made sense to test both the decoding and document.characterSet. But I guess the chances that some implementation will get the document's character encoding right but the actual decoding wrong are so low that there's no need to test for detection here.

hsivonen · 2020-03-19T10:55:28Z

Looks good to me. Thanks!

This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Tests: web-platform-tests/wpt#22276. Closes #1077.

[encoding] Tests for BOM detection.

172d338

wpt-pr-bot added the encoding label Mar 16, 2020

wpt-pr-bot assigned inexorabletash Mar 16, 2020

wpt-pr-bot requested review from annevk and inexorabletash March 16, 2020 13:54

andreubotella mentioned this pull request Mar 16, 2020

Make document's character encoding reflect byte-order marks whatwg/html#5359

Merged

3 tasks

inexorabletash approved these changes Mar 17, 2020

View reviewed changes

inexorabletash reviewed Mar 17, 2020

View reviewed changes

Clarifying that the BOM is a UTF-8 BOM.

01cf673

annevk reviewed Mar 18, 2020

View reviewed changes

encoding/bom-handling.html Outdated Show resolved Hide resolved

encoding/bom-handling.html.headers Outdated Show resolved Hide resolved

Adding newlines at the end of files.

e1c054e

Applying suggestions from code review. Co-Authored-By: Anne van Kesteren <[email protected]>

Don't test for decoding.

d1debad

annevk merged commit 7d9b5a5 into web-platform-tests:master Mar 19, 2020

andreubotella deleted the bom-handling branch March 19, 2020 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[encoding] Tests for BOM detection. #22276

[encoding] Tests for BOM detection. #22276

andreubotella commented Mar 16, 2020

inexorabletash left a comment

inexorabletash left a comment •

edited

Loading

annevk left a comment

annevk commented Mar 18, 2020

hsivonen commented Mar 18, 2020

hsivonen commented Mar 18, 2020

hsivonen commented Mar 18, 2020

andreubotella commented Mar 18, 2020

hsivonen commented Mar 19, 2020

[encoding] Tests for BOM detection. #22276

[encoding] Tests for BOM detection. #22276

Conversation

andreubotella commented Mar 16, 2020

inexorabletash left a comment

Choose a reason for hiding this comment

inexorabletash left a comment • edited Loading

Choose a reason for hiding this comment

annevk left a comment

Choose a reason for hiding this comment

annevk commented Mar 18, 2020

hsivonen commented Mar 18, 2020

hsivonen commented Mar 18, 2020

hsivonen commented Mar 18, 2020

andreubotella commented Mar 18, 2020

hsivonen commented Mar 19, 2020

inexorabletash left a comment •

edited

Loading