-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[encoding] Tests for BOM detection. #22276
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to be completely pedantic have the comment specify that it starts with a UTF-8 BOM.
(Also.... is a comment before the doctype declaration valid? Not sure offhand...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this works, but I'd prefer newlines at the end of files.
Applying suggestions from code review. Co-Authored-By: Anne van Kesteren <[email protected]>
Thanks! I pinged @hsivonen since it seems somewhat suspect there would not be coverage for this. So would like to wait for that a bit, but do let me know if it's taking too long. |
I believe we do have a separate test for testing that UTF-8 isn't heuristically detected. Still, I think this test would be more reliable if the only non-ASCII bytes were the BOM. I suggest removing the Japanese text from the test. If you want to actually test decoding in addition to what |
(Of course, if the BOM was ignored, the |
It indeed seem the |
Since this test is for whatwg/html#5359, which fixes the long-standing spec bug where the document's character encoding doesn't take the BOM into account, I thought it made sense to test both the decoding and |
Looks good to me. Thanks! |
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Tests: web-platform-tests/wpt#22276. Closes #1077.
This change tests for whatwg/html#5359, but since I didn't see any tests for whether the BOM influences the decoding at all, I added one as well.