Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[encoding] Tests for BOM detection. #22276

Merged
merged 4 commits into from
Mar 19, 2020
Merged

[encoding] Tests for BOM detection. #22276

merged 4 commits into from
Mar 19, 2020

Conversation

andreubotella
Copy link
Member

This change tests for whatwg/html#5359, but since I didn't see any tests for whether the BOM influences the decoding at all, I added one as well.

Copy link
Member

@inexorabletash inexorabletash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@inexorabletash inexorabletash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to be completely pedantic have the comment specify that it starts with a UTF-8 BOM.

(Also.... is a comment before the doctype declaration valid? Not sure offhand...)

Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this works, but I'd prefer newlines at the end of files.

encoding/bom-handling.html Outdated Show resolved Hide resolved
encoding/bom-handling.html.headers Outdated Show resolved Hide resolved
Applying suggestions from code review.

Co-Authored-By: Anne van Kesteren <[email protected]>
@annevk
Copy link
Member

annevk commented Mar 18, 2020

Thanks! I pinged @hsivonen since it seems somewhat suspect there would not be coverage for this. So would like to wait for that a bit, but do let me know if it's taking too long.

@hsivonen
Copy link
Member

I believe we do have a separate test for testing that UTF-8 isn't heuristically detected. Still, I think this test would be more reliable if the only non-ASCII bytes were the BOM. I suggest removing the Japanese text from the test.

If you want to actually test decoding in addition to what document.characterSet reports, I suggest testing, in a separate test file, one (and only one) of the German sequences that ced deliberately doesn't score towards UTF-8.

@hsivonen
Copy link
Member

(Of course, if the BOM was ignored, the meta should take precedence over heuristic sniffing, but let's still avoid making the test too easy to pass for the wrong reason.)

@hsivonen
Copy link
Member

it seems somewhat suspect there would not be coverage for this

It indeed seem the encoding/ directory does not have a test for this. The existing BOM tests are for checking that the right number of BOMs is removed.

@andreubotella
Copy link
Member Author

Since this test is for whatwg/html#5359, which fixes the long-standing spec bug where the document's character encoding doesn't take the BOM into account, I thought it made sense to test both the decoding and document.characterSet. But I guess the chances that some implementation will get the document's character encoding right but the actual decoding wrong are so low that there's no need to test for detection here.

@hsivonen
Copy link
Member

Looks good to me. Thanks!

@annevk annevk merged commit 7d9b5a5 into web-platform-tests:master Mar 19, 2020
@andreubotella andreubotella deleted the bom-handling branch March 19, 2020 11:24
annevk pushed a commit to whatwg/html that referenced this pull request Mar 24, 2020
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical.

Tests: web-platform-tests/wpt#22276.

Closes #1077.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants