-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make &aaa; a parse error #1257
Comments
Option 2 sounds good to me - it allows for continuous streaming, yet reports parse errors if implementation chooses to do so. |
I have no strong preference here but am willing in principle to write patches for the v.nu HTML parser to line it up with whatever we can agree on.
https://checker.html5.org/ is currently running from an off-master branch of the v.nu HTML parser that has a patch I wrote to align it as well as I could with the current HTML parsing algorithm requirements for dealing with ampersands and character references. There are some parts of the code that I found it pretty difficult to align fully to the spec without getting more review. But I wasn’t able to get the patch reviewed more thoroughly, so ended up landing it on a branch and building v.nu from that. |
@zcorpan A related thought: Actually, if an implementation strictly follows the spec as a guide for the real algorithm (we don't in our implementation exactly to prevent infinite buffering, but some might do), this is already a case in other places. Just as an example - Proper implementation would be comparing letters one by one against the last start tag name for the early bailout instead. Anyway, my point is that we already have this form of a problem around the spec, and we need to either fix it in all those places in the spec or just add a top-level warning/note and use your proposal number 1. |
Hmm, I think we should go with option (2) here and also fix the other problem you mention. Filed #1306 - thanks! |
See #1220 (comment)
In particular,
&aaa;
or<p title="&ersand;">
should be parse errors but are not right now because characters are only consumed so long as they match something in the named character references table, and then the temporary buffer is checked for a trailing;
(for the above examples it would contain&aa
or&e
, respectively).In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).
Possible fixes:
A new state that continues to consume (and append to the temporary buffer) alphanumerics when no match could be made, and that report a parse error on ';' and reconsume in character reference end state, and reconsume without parse error for anything else.
This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...
Use new states that insert stuff in the right places (emit a char token or append to attribute value), and parse error on
;
.Like (1) but have a fixed length of characters to consume in the temporary buffer (i.e. like v.nu HTML parser).
Also need to update the note in the spec about
¬it;
in an attribute value.@RReverser @sideshowbarker
The text was updated successfully, but these errors were encountered: