You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.
Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.
Given a character encoding, the bytes in the input byte stream must be converted to characters for the tokenizer's input stream, by passing the input byte stream and character encoding to decode.
So per spec, if you have a page with, say, Content-Type: text/html;charset=windows-1252 whose first few bytes are a UTF-8 BOM:
The encoding sniffing algorithm detects windows-1252 as "the character encoding"
It passes the input stream bytes + windows-1252 to the Encoding Standard's decode algorithm, which then decodes as UTF-8.
However, per the first quote, the document's character encoding is immediately set to windows-1252, not UTF-8, since the BOM override is hidden inside the decode algorithm and invisible to the HTML spec.
One way of fixing this is to make the decode algorithm return both an output stream and an encoding, but I guess that will involve updating a lot of call sites, and is fairly inelegant. Another is to have a special operation that HTML uses instead of decode, which returns those two things. Maybe there is something better.
The text was updated successfully, but these errors were encountered:
(This is a spec factoring issue, mostly.)
The only place in the spec that I can find that sets the document's character encoding is https://html.spec.whatwg.org/#determining-the-character-encoding:document's-character-encoding-3 which says
Here, "this algorithm" is the " encoding sniffing algorithm". However, this algorithm doesn't deal with BOMs. BOMs are dealt with later, in https://html.spec.whatwg.org/#the-input-byte-stream:encoding-sniffing-algorithm:
So per spec, if you have a page with, say,
Content-Type: text/html;charset=windows-1252
whose first few bytes are a UTF-8 BOM:One way of fixing this is to make the decode algorithm return both an output stream and an encoding, but I guess that will involve updating a lot of call sites, and is fairly inelegant. Another is to have a special operation that HTML uses instead of decode, which returns those two things. Maybe there is something better.
The text was updated successfully, but these errors were encountered: