Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parser] Handle ambiguous ampersands of arbitrary length (closes #1257) #2731

Merged
merged 5 commits into from
Jun 23, 2017

Conversation

inikulin
Copy link
Member

@inikulin inikulin commented Jun 1, 2017

This implementation avoids infinite buffering of a possible ambiguous ampersand as discussed in #1257.

Preview: https://inikulin.github.io/html-build/output/multipage/syntax.html
POC: HTMLParseErrorWG/parse5#25

source Outdated
that for each <span>code point</span> in the <var data-x="temporary buffer">temporary
buffer</var> (in the order they were added to the buffer) user agent must append the code point
from the buffer to the current attribute's value if the character reference was <span
data-x="charref-in-attribute">consumed as part of an attribute</span>, or emit a character token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or emit the code point as a character token otherwise."

Copy link
Member

@zcorpan zcorpan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % nit

@inikulin
Copy link
Member Author

inikulin commented Jun 2, 2017

Fixed

@inikulin
Copy link
Member Author

inikulin commented Jun 2, 2017

Also added note that ambiguous ampersand state is optional for implementations that don't report errors.

source Outdated
@@ -104257,6 +104257,11 @@ dictionary <dfn>StorageEventInit</dfn> : <span>EventInit</span> {

</dl>

<p class="note">For performance reasons, an implementation that does not report errors and that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have this for any other states that are optional for no-parse-error implementations. I'm not sure it's a good idea to have them; it could add confusion and in itself be a source of bugs (e.g. if the note says to switch to the wrong state). Better to make sure behavior is correct by writing tests in my opinion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have such note in tree construction stage about switch to RCDATA as far as I remember.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it's about RAWTEXT and PLAINTEXT in https://html.spec.whatwg.org/multipage/syntax.html#html-fragment-parsing-algorithm
However, I agree with your argument, so I guess I'll better remove it.

zcorpan added a commit to html5lib/html5lib-tests that referenced this pull request Jun 5, 2017
@zcorpan
Copy link
Member

zcorpan commented Jun 5, 2017

Tests at html5lib/html5lib-tests#94

@inikulin
Copy link
Member Author

inikulin commented Jun 5, 2017

@zcorpan I've removed the note.

@zcorpan
Copy link
Member

zcorpan commented Jun 13, 2017

As noted in html5lib/html5lib-tests#94 (comment) I think there's a spec bug in attribute values also. We could either fix it in this PR or do that separately in a followup.

@zcorpan
Copy link
Member

zcorpan commented Jun 13, 2017

Testing <p title="&ammp &amp= &ammp;"></p> in https://checker.html5.org/#textarea gives an error (FWIW).

@inikulin
Copy link
Member Author

@zcorpan Not quite sure about it to be honest, because by definition:

An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section.

and we have a match here, but it's omitted due to the attribute quirks.

@zcorpan
Copy link
Member

zcorpan commented Jun 21, 2017

Let's move discussion about ambiguous ampersand in attribute value to #2776 . This PR is good as-is, just need to fix the test to match.

zcorpan added a commit to html5lib/html5lib-tests that referenced this pull request Jun 21, 2017
@inikulin
Copy link
Member Author

@zcorpan Tests were merged, so I guess we're good to go.

@domenic
Copy link
Member

domenic commented Jun 23, 2017

Simon has gone on vacation, but it does look like everything is in order, so let me help merge this.

@domenic domenic merged commit ee19894 into whatwg:master Jun 23, 2017
@inikulin inikulin deleted the amb-amp branch June 23, 2017 19:11
@RReverser
Copy link
Member

Shame on me, how come I didn't see this PR. Looks good though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants