Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conform encoding handling to Encoding spec #48

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

sideshowbarker
Copy link
Contributor

@sideshowbarker sideshowbarker commented Aug 20, 2020

This makes the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec.

To that end, this PR branch includes the following important changes:

81c40ef Conform MetaSniffer "x-user-defined" handling
f9b3796 Sync w/ “Changing the encoding while parsing” algo
ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE
9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase()
beb060b Conform encoding-label matching to Encoding spec
8c62d2d Conform supported encodings to Encoding spec
5cfabbe Drop superseded encoding-checking code
343d86f Drop “actual HTML encoding”-related code
03954fb Drop/replace “is ASCII superset”-checking code

It also includes some related minor changes to get things working smoothly.

The beb060b change is particularly notable in that it fixes the parser’s encoding-name matching to conform to the current Encoding spec at https://encoding.spec.whatwg.org/#concept-encoding-get — which requires that only leading and trailing whitespace be removed from a string before checking if it matches any valid encoding names.

Otherwise, without that change, the parser instead implements https://www.unicode.org/reports/tr22/tr22-8.html#Charset_Alias_Matching — which requires deleting “all characters except a-z, A-Z, and 0-9” from a string before checking if it matches any valid encoding names. That difference makes us fail two html5lib-tests cases.

The html5lib-tests cases that we fail without that beb060b change are the following:

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 593fab4 to bf69e45 Compare August 20, 2020 15:46
@sideshowbarker
Copy link
Contributor Author

I’m seeing some possible regressions in the HTML checker as a result of this —

error: The encoding “utf-8” is not an IANA-registered encoding and did not use the “x-” prefix. (Charmod C023)
info warning: The character encoding “utf-8” is not widely supported. Better interoperability may be achieved by using “UTF-8”.
info warning: Using “x-iso-2022-cn-gb” instead of the declared encoding “utf-8”.

That is, this change seems to have caused the parser to fail to parse utf-8 case-insensitively.

And then for some reason, the parser seems to randomly picking some other encoding to use — the encoding name in the last error message above is non-deterministic. When I re-run the tests, different encoding names show up:

info warning: Using “x-machebrew” instead of the declared encoding “utf-8”.

or:

info warning: Using “x-maccroatian” instead of the declared encoding “utf-8”.

… etc.

@sideshowbarker
Copy link
Contributor Author

d’oh — never mind my previous comment; I see now that I just neglected to negate the condition. Will push an update

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch 2 times, most recently from 818e72f to 92275d9 Compare August 20, 2020 17:51
@sideshowbarker
Copy link
Contributor Author

sideshowbarker commented Aug 20, 2020

OK, I found that the existing code had effectively been using, as the internal canonical name, names with all dashes and underscores removed. That results in a conformance violation, because it means that, for example, we would treat the string ut-f8 (invalid as an encoding name) the same as utf-8 or utf8.

So with 92275d9, I’ve updated the checking to conform to exactly what the Encoding spec requires — which is to do an exact comparison of the lower-cased input string to the set of labels in https://encoding.spec.whatwg.org/#names-and-labels

@sideshowbarker sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 39f050a to 622c90b Compare September 13, 2020 01:04
@sideshowbarker sideshowbarker changed the title Conform encoding-label matching to Encoding spec Conform encoding handling to Encoding spec Sep 13, 2020
@sideshowbarker
Copy link
Contributor Author

OK, I’ve now pushed a large number of commits to this branch that together make the parser’s handling of encodings conformant with current requirements in the Encoding spec and HTML spec. The key changes are the following:

81c40ef Conform MetaSniffer "x-user-defined" handling
f9b3796 Sync w/ “Changing the encoding while parsing” algo
ceb9fe3 Drop UTF-32 & “UTF-16”; use UTF-16BE and UTF-16LE
9d2d39d Drop Encoding.toAsciiLowerCase; use toLowerCase()
beb060b Conform encoding-label matching to Encoding spec
8c62d2d Conform supported encodings to Encoding spec
5cfabbe Drop superseded encoding-checking code
343d86f Drop “actual HTML encoding”-related code
03954fb Drop/replace “is ASCII superset”-checking code

The branch also includes some related minor changes to get things working smoothly.

I broke the changes out into separate commits in order to facilitate easier review.

So in reviewing this, you probably want to be starting from the https://github.com/validator/htmlparser/pull/48/commits view rather than the https://github.com/validator/htmlparser/pull/48/files view.

@sideshowbarker
Copy link
Contributor Author

OK, I’m now a bit embarrassed to admit: it hadn’t dawned on me until now that the changes to the Encoding class in this patch are completely superseded by the code in the encodings branch of the repo.

So I’m now wondering whether we should just not spend any more time at all on this patch but instead try to get the code on the encodings branch finished and ready to merge to the main branch and to release.

I’m certainly willing to put time into helping with that.

@carlosame
Copy link

get the code on the encodings branch finished and ready to merge

My understanding was that encodings is an archive branch kept for reference, not initially intended for merging back. Perhaps I'm missing something.

This change renames the private encodingByCookedName map in the Encoding
class to encodingByLabel (for consistency with Encoding spec terminology).
This change drops or replaces all code for checking whether a particular
encoding is an ASCII superset — because the code no longer corresponds
to actual requirements in the Encoding spec (which instead now requires
checking only whether an encoding is utf-16be or utf-16le).
This change adds some error-message strings to the Encoding class — for
shared used by the htmlparser.io.Driver and htmlparser.io.MetaSniffer
classes, which need to emit the same error messages in a number of cases.
This change removes all code related to checking and using the “actual
HTML encoding”, which no longer corresponds to current spec requirements.
This change drops some code which performs various encoding checks that
no longer correspond to any current requirements in the Encoding spec.
This change conforms the set of supported encodings for the parser to
the requirements in the Encoding spec. Specifically, this restricts the
set of encoding names and labels to those listed in the table at
https://encoding.spec.whatwg.org/#names-and-labels, and the statement:

> The table below lists all encodings and their labels user agents must
> support. User agents must not support any other encodings or labels.
This change drops the Encoding.toAsciiLowerCase() method and replaces
calls to it with calls to string.toLowerCase() — because none of the
conforming encoding names have any special characters for which the
behavior of Encoding.toAsciiLowerCase() would produce any different
results than string.toLowerCase() does.
This change renames the whineAboutEncodingAndReturnActual() method to
whineAboutEncodingAndReturnCanonical() — in order to make more clear
it’s reporting that the canonical name of the encoding is preferred over
any particular label for that encoding.
This change drops all handling for UTF-32 (which is a completely invalid/
unsupported encoding per the Encoding spec), as well as replacing handling
for “UTF-16” (which also isn’t a valid/supported encoding) with, instead,
handling for the valid/supported encodings UTF-16BE and UTF-16LE.
This change brings the parser into conformance with current requirements
in the “Changing the encoding while parsing” algorithm in the HTML spec.
This change just updates HtmlInputStreamReader to use shared message
strings from the Encoding class for some of its error messages.
This change just updates MetaSniffer to use shared message strings from
the Encoding class for some of its error messages.
This change makes the parser’s meta-prescan code conform to the
requirements in the spec for handling of the "x-user-defined" encoding.
This change corrects the code to set the right encoding in the case when
the external encoding has been determined to be non-UTF8.
@sideshowbarker sideshowbarker force-pushed the sideshowbarker/encoding-label-matching branch from 622c90b to c5a88d6 Compare September 2, 2021 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants