-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup around the HTML parser and how it sniffs encodings #1077
Comments
I created a fix for the regression. The other problem is that the Encoding standard ultimately decides what encoding is used for encoding, but has no means to communicate that back to the caller (the caller just gets a stream of bytes). Not sure how this would be described ideally. @hsivonen @inexorabletash, opinions? |
In HTML decode is used only by HTM parser and when fetching script. In first case BOM should correct document's encoding, but for script I don't know if BOM affects any properties or slots. Decode is also used in XHR (text response) and once again I don't know if BOM affects any properties or slots. But more interesting is document response, I wondering if BOM has any effect, must find some test. |
Re: Encoding - can the "decode" hook just return a tuple, i.e. "Return output and encoding" ? I see more work in HTML to refer to e.g. the "sniffed encoding" vs. the "used encoding" and to actually assign the latter to the document's character encoding, like @ArkadiuszMichalski mentions. And then update XHR to ignore/use the returned encoding (depending on results of @ArkadiuszMichalski's test). Or, if we don't want to touch other dependent specs, then same thing we'd do in a real language: rename "decode" to "detect encoding and decode" and add a "decode" with the old signature. (But that seems obvious so I feel like I'm missing something.) |
@inexorabletash I guess it is rather obvious, was just curious whether you had a preference. |
Backing to XHR, text response is correct, browsers recognize BOM at the begining, remove them and change encoding, Firefox has bug when using UTF-16 BOM (but UTF-8 BOM works), and test was created. For document response (step 5.) Firefox and Chrome works the same as for HTML parser, so respect |
FWIW, "encoding in editor" has no effect here. It's not an input to the algorithm since it cannot be determined. |
I think the main problem with the integration of HTML and the Encoding Standard is that the Encoding Standard doesn't accommodate 1) incremental decoding and 2) internal encoding declarations. The Encoding Standard makes BOM sniffing part of "decode" and the HTML Standard does not sniff for the BOM on its own, so spec-wise the UA is supposed to buffer up to 1024 bytes and sniff it for before letting the "decode" step sniff for the BOM. This does not match a reasonable implementation, such as Gecko. Gecko sniffs for the BOM incrementally as bytes arrive (doesn't even wait for 3 bytes like the Encoding Standard says if fewer bytes make a decision possible) and skips sniffing if a BOM was found. I think the Encoding Standard should cater to the existence of languages like HTML and CSS that support non-BOM internal encoding declarations and allow the language-specific sniffing algorithms to be plugged in between BOM sniffing and the actual decode phase. Spec-wise, if you want the Encoding Standard to continue to own BOM sniffing, I suggest extending the current "decode" algorithm such that it takes three inputs: byte stream, fallback encoding and a (potentially null) sniffing state machine for an internal encoding declaration. Then make it output a stream of characters and the encoding actually used. Require that the sniffing state machine be able to operate incrementally so that it yields an encoding as soon as it has seen enough and make the "decode" algorithm feed it up to 1024 bytes (as soon as the bytes arrive--not waiting for 1024 bytes first; it would be nice for the BOM sniffing not to buffer 3 bytes first if the first byte already can't be the first byte of an UTF-8 BOM). Then have HTML and CSS specify their calls to decode as passing in a sniffing state machine. Implementation wise, I wouldn't make users of a streaming encoding conversion API pass in the state machine but make them perform BOM sniffing first, search for an internal encoding declaration second and then invoke an incremental decoder. FWIW, encoding-rs provides a low-level streaming API for encoders and decoders as well as high-level non-streaming convenience methods. The convenience methods
I checked recently and UTF-8 BOM didn't work. It's one of the Gecko bugs I hope encoding-rs to fix by providing BOM sniffing right in the decoder objects for callers like |
Oh, and the above doesn't account for the HTML spec currently allowing heuristic sniffing. If you still want to allow that, it's probably easier to make the HTML spec use a distinct BOM sniffing hook than to let BOM sniffing in the HTML context be baked into "decode". |
Ah, you're right, all BOM didn't work, but when I get UTF-8 file with UTF-8 BOM (first 3 byte), then this BOM is somehow removed, because first char when reading responseText is not BOM (this confused me), so I set header to other encoding (like ISO-8859-2) and when reading responseText this BOM exist but encoding was not change to UTF-8 (is still ISO-8859-2). |
I was recently pointed to this issue, and I find it surprising that this problem hasn't yet been fixed in the specifications, even if it's not an issue in all three major browser engines, where the document's character encoding follows the BOM. To recap, the several fixes proposed, from easier to harder to implement, are:
While it's worth matching the internal behavior of implementations, any of these options would agree with the observable behavior. Furthermore, option 3 fails to take into account MIME Sniffing's algorithm to sniff a mislabeled feed, which should be implemented in parallel with HTML's encoding sniffing (though I haven't checked what actual browsers do). In any case, if a BOM is found, the encoding's confidence should be certain. While there is no real need to prevent user agents from changing the encoding while parsing, since the effective encoding would be identical, this helps make clear that there's no need to reparse. |
I like 1. It adds some redundancy, but that's okay as these are legacy callers. New places would exclusively use UTF-8 anyway. |
I'm working on PR's for fix # 1 which I'll be opening in a few days if no one disagrees. @inexorabletash? @hsivonen? |
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte-order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Closes whatwg#1077.
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte-order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Closes whatwg#1077.
This change moves the BOM splitting part of the decode hook into a separate hook which does not consume any bytes of the token stream. This will allow fixing a long-standing issue in the HTML encoding sniffing algorithm with the document's character encoding being set to the wrong result when there is a BOM: whatwg/html#1077. Closes #128.
This change fixes a bug where document's character encoding was set to the return value of the encoding sniffing algorithm rather than to the actual encoding used, which differed when the stream started with a byte order mark. This change incorporates BOM sniffing into the encoding sniffing algorithm, ensuring both encodings are identical. Tests: web-platform-tests/wpt#22276. Closes #1077.
Context
From actual description I can't determine behavior for
Document.characterSet|charset|inputEncoding
when loading a document. In addition, @annevk indicates that getting an output encoding should be used only when encoding, not decoding, but algorithms that cooperate with the parser use it. Would be cleare to write "getting an output encoding to encode" because "encoding" means both (thing and action). Looks like unnecessary change was done here 6a31c26. You can revert or define "getting an output encoding to decode" (maps UTF-16 to UTF-8 and x-user-defined to Windows-1252) and use it if necessary.But still wondering where exactly document's encoding is set when input has BOM. Where is the prose for that action? Sniffing algorithm doesn't mention anything about the BOM and decode doesn't change document's encoding.
The text was updated successfully, but these errors were encountered: