Make &aaa; a parse error #1257

zcorpan · 2016-05-16T11:10:10Z

In particular, &aaa; or <p title="&ampersand;"> should be parse errors but are not right now because characters are only consumed so long as they match something in the named character references table, and then the temporary buffer is checked for a trailing ; (for the above examples it would contain &aa or &ampe, respectively).

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

Possible fixes:

A new state that continues to consume (and append to the temporary buffer) alphanumerics when no match could be made, and that report a parse error on ';' and reconsume in character reference end state, and reconsume without parse error for anything else.

This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...
Use new states that insert stuff in the right places (emit a char token or append to attribute value), and parse error on ;.
Like (1) but have a fixed length of characters to consume in the temporary buffer (i.e. like v.nu HTML parser).

Also need to update the note in the spec about &notit; in an attribute value.

@RReverser @sideshowbarker

The text was updated successfully, but these errors were encountered:

RReverser · 2016-05-16T11:14:13Z

Option 2 sounds good to me - it allows for continuous streaming, yet reports parse errors if implementation chooses to do so.

sideshowbarker · 2016-05-16T14:43:15Z

I have no strong preference here but am willing in principle to write patches for the v.nu HTML parser to line it up with whatever we can agree on.

In https://checker.html5.org/ it reports an error for these cases. It does not report an error for really long entities (like 1000 chars).

https://checker.html5.org/ is currently running from an off-master branch of the v.nu HTML parser that has a patch I wrote to align it as well as I could with the current HTML parsing algorithm requirements for dealing with ampersands and character references. There are some parts of the code that I found it pretty difficult to align fully to the spec without getting more review. But I wasn’t able to get the patch reviewed more thoroughly, so ended up landing it on a branch and building v.nu from that.

RReverser · 2016-05-20T17:13:39Z

This means that the temporary buffer would need to be able to hold an arbitrary amount of characters, which seems bad...

@zcorpan A related thought:

Actually, if an implementation strictly follows the spec as a guide for the real algorithm (we don't in our implementation exactly to prevent infinite buffering, but some might do), this is already a case in other places. Just as an example - RCDATA end tag name state collects all the letters and only when it finds a non-letter, it either emits the temporary buffer or checks it against the last start tag name and makes further decisions. So if such a straightforward algorithm gets an input of endless HTTP stream starting with e.g. <textarea></aaaaaaaaaa..., in the best case it might be buffering all those letters until it runs out of memory.

Proper implementation would be comparing letters one by one against the last start tag name for the early bailout instead.

Anyway, my point is that we already have this form of a problem around the spec, and we need to either fix it in all those places in the spec or just add a top-level warning/note and use your proposal number 1.

zcorpan · 2016-05-23T07:38:00Z

Hmm, I think we should go with option (2) here and also fix the other problem you mention. Filed #1306 - thanks!

Closes whatwg#1257.

zcorpan added the document conformance label May 16, 2016

zcorpan mentioned this issue May 16, 2016

Formalize character reference states. #1220

Closed

zcorpan added the topic: parser label May 16, 2016

mathiasbynens mentioned this issue May 16, 2016

Make &aaa; a parse error mathiasbynens/he#46

Open

zcorpan mentioned this issue May 23, 2016

<textarea></aaaaa... should insert a text node? #1306

Open

zcorpan mentioned this issue Dec 11, 2016

Character reference definition #2099

Open

zcorpan mentioned this issue May 24, 2017

[Parser] Ambiguous ampersand is incorrectly handled in tokenizer #2704

Closed

inikulin added a commit to inikulin/html that referenced this issue Jun 1, 2017

Handle ambiguous ampersands of arbitary length (closes whatwg#1257)

b6d082b

inikulin added a commit to inikulin/html that referenced this issue Jun 1, 2017

Handle ambiguous ampersands of arbitrary length (closes whatwg#1257)

ceb2659

inikulin mentioned this issue Jun 1, 2017

[Parser] Handle ambiguous ampersands of arbitrary length (closes #1257) #2731

Merged

inikulin added a commit to inikulin/html that referenced this issue Jun 1, 2017

Handle ambiguous ampersands of arbitrary length (closes whatwg#1257)

b0e2b27

domenic closed this as completed in ee19894 Jun 23, 2017

alice pushed a commit to alice/html that referenced this issue Jan 8, 2019

Handle ambiguous ampersands of arbitrary length

6671c79

Closes whatwg#1257.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make &aaa; a parse error #1257

Make &aaa; a parse error #1257

zcorpan commented May 16, 2016

RReverser commented May 16, 2016

sideshowbarker commented May 16, 2016

RReverser commented May 20, 2016

zcorpan commented May 23, 2016

Make &aaa; a parse error #1257

Make &aaa; a parse error #1257

Comments

zcorpan commented May 16, 2016

RReverser commented May 16, 2016

sideshowbarker commented May 16, 2016

RReverser commented May 20, 2016

zcorpan commented May 23, 2016