-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conform encoding handling to Encoding spec #48
base: master
Are you sure you want to change the base?
Conversation
593fab4
to
bf69e45
Compare
I’m seeing some possible regressions in the HTML checker as a result of this —
That is, this change seems to have caused the parser to fail to parse And then for some reason, the parser seems to randomly picking some other encoding to use — the encoding name in the last error message above is non-deterministic. When I re-run the tests, different encoding names show up:
or:
… etc. |
d’oh — never mind my previous comment; I see now that I just neglected to negate the condition. Will push an update |
818e72f
to
92275d9
Compare
OK, I found that the existing code had effectively been using, as the internal canonical name, names with all dashes and underscores removed. That results in a conformance violation, because it means that, for example, we would treat the string So with 92275d9, I’ve updated the checking to conform to exactly what the Encoding spec requires — which is to do an exact comparison of the lower-cased input string to the set of labels in https://encoding.spec.whatwg.org/#names-and-labels |
92275d9
to
39f050a
Compare
61325c3
to
5ae26d8
Compare
39f050a
to
622c90b
Compare
OK, I’ve now pushed a large number of commits to this branch that together make the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec. The key changes are the following: 81c40ef Conform MetaSniffer "x-user-defined" handling The branch also includes some related minor changes to get things working smoothly. I broke the changes out into separate commits in order to facilitate easier review. So in reviewing this, you probably want to be starting from the https://github.com/validator/htmlparser/pull/48/commits view rather than the https://github.com/validator/htmlparser/pull/48/files view. |
OK, I’m now a bit embarrassed to admit: it hadn’t dawned on me until now that the changes to the So I’m now wondering whether we should just not spend any more time at all on this patch but instead try to get the code on the I’m certainly willing to put time into helping with that. |
My understanding was that |
This change renames the private encodingByCookedName map in the Encoding class to encodingByLabel (for consistency with Encoding spec terminology).
This change drops or replaces all code for checking whether a particular encoding is an ASCII superset — because the code no longer corresponds to actual requirements in the Encoding spec (which instead now requires checking only whether an encoding is utf-16be or utf-16le).
This change adds some error-message strings to the Encoding class — for shared used by the htmlparser.io.Driver and htmlparser.io.MetaSniffer classes, which need to emit the same error messages in a number of cases.
This change removes all code related to checking and using the “actual HTML encoding”, which no longer corresponds to current spec requirements.
This change drops some code which performs various encoding checks that no longer correspond to any current requirements in the Encoding spec.
This change conforms the set of supported encodings for the parser to the requirements in the Encoding spec. Specifically, this restricts the set of encoding names and labels to those listed in the table at https://encoding.spec.whatwg.org/#names-and-labels, and the statement: > The table below lists all encodings and their labels user agents must > support. User agents must not support any other encodings or labels.
This change drops the Encoding.toAsciiLowerCase() method and replaces calls to it with calls to string.toLowerCase() — because none of the conforming encoding names have any special characters for which the behavior of Encoding.toAsciiLowerCase() would produce any different results than string.toLowerCase() does.
This change renames the whineAboutEncodingAndReturnActual() method to whineAboutEncodingAndReturnCanonical() — in order to make more clear it’s reporting that the canonical name of the encoding is preferred over any particular label for that encoding.
This change drops all handling for UTF-32 (which is a completely invalid/ unsupported encoding per the Encoding spec), as well as replacing handling for “UTF-16” (which also isn’t a valid/supported encoding) with, instead, handling for the valid/supported encodings UTF-16BE and UTF-16LE.
This change brings the parser into conformance with current requirements in the “Changing the encoding while parsing” algorithm in the HTML spec.
This change just updates HtmlInputStreamReader to use shared message strings from the Encoding class for some of its error messages.
This change just updates MetaSniffer to use shared message strings from the Encoding class for some of its error messages.
This change makes the parser’s meta-prescan code conform to the requirements in the spec for handling of the "x-user-defined" encoding.
This change corrects the code to set the right encoding in the case when the external encoding has been determined to be non-UTF8.
622c90b
to
c5a88d6
Compare
This makes the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec.
To that end, this PR branch includes the following important changes:
81c40ef Conform MetaSniffer "x-user-defined" handling
f9b3796 Sync w/ “Changing the encoding while parsing” algo
ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE
9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase()
beb060b Conform encoding-label matching to Encoding spec
8c62d2d Conform supported encodings to Encoding spec
5cfabbe Drop superseded encoding-checking code
343d86f Drop “actual HTML encoding”-related code
03954fb Drop/replace “is ASCII superset”-checking code
It also includes some related minor changes to get things working smoothly.
The beb060b change is particularly notable in that it fixes the parser’s encoding-name matching to conform to the current Encoding spec at https://encoding.spec.whatwg.org/#concept-encoding-get — which requires that only leading and trailing whitespace be removed from a string before checking if it matches any valid encoding names.
Otherwise, without that change, the parser instead implements https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching — which requires deleting “all characters except a-z, A-Z, and 0-9” from a string before checking if it matches any valid encoding names. That difference makes us fail two html5lib-tests cases.
The html5lib-tests cases that we fail without that beb060b change are the following: