Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make &aaa; a parse error #1257

Closed
zcorpan opened this issue May 16, 2016 · 4 comments
Closed

Make &aaa; a parse error #1257

zcorpan opened this issue May 16, 2016 · 4 comments

Comments

@zcorpan
Copy link
Member

zcorpan commented May 16, 2016

See #1220 (comment)

In particular, &aaa; or <p title="&ampersand;"> should be parse errors but are not right now because characters are only consumed so long as they match something in the named character references table, and then the temporary buffer is checked for a trailing ; (for the above examples it would contain &aa or &ampe, respectively).

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

Possible fixes:

  1. A new state that continues to consume (and append to the temporary buffer) alphanumerics when no match could be made, and that report a parse error on ';' and reconsume in character reference end state, and reconsume without parse error for anything else.

    This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...

  2. Use new states that insert stuff in the right places (emit a char token or append to attribute value), and parse error on ;.

  3. Like (1) but have a fixed length of characters to consume in the temporary buffer (i.e. like v.nu HTML parser).

Also need to update the note in the spec about &notit; in an attribute value.

@RReverser @sideshowbarker

@RReverser
Copy link
Member

Option 2 sounds good to me - it allows for continuous streaming, yet reports parse errors if implementation chooses to do so.

@sideshowbarker
Copy link
Member

I have no strong preference here but am willing in principle to write patches for the v.nu HTML parser to line it up with whatever we can agree on.

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

https://checker.html5.org/ is currently running from an off-master branch of the v.nu HTML parser that has a patch I wrote to align it as well as I could with the current HTML parsing algorithm requirements for dealing with ampersands and character references. There are some parts of the code that I found it pretty difficult to align fully to the spec without getting more review. But I wasn’t able to get the patch reviewed more thoroughly, so ended up landing it on a branch and building v.nu from that.

@RReverser
Copy link
Member

This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...

@zcorpan A related thought:

Actually, if an implementation strictly follows the spec as a guide for the real algorithm (we don't in our implementation exactly to prevent infinite buffering, but some might do), this is already a case in other places. Just as an example - RCDATA end tag name state collects all the letters and only when it finds a non-letter, it either emits the temporary buffer or checks it against the last start tag name and makes further decisions. So if such a straightforward algorithm gets an input of endless HTTP stream starting with e.g. <textarea></aaaaaaaaaa..., in the best case it might be buffering all those letters until it runs out of memory.

Proper implementation would be comparing letters one by one against the last start tag name for the early bailout instead.

Anyway, my point is that we already have this form of a problem around the spec, and we need to either fix it in all those places in the spec or just add a top-level warning/note and use your proposal number 1.

@zcorpan
Copy link
Member Author

zcorpan commented May 23, 2016

Hmm, I think we should go with option (2) here and also fix the other problem you mention. Filed #1306 - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants