Attribute DOM representation and parsing is inconsistent #4275

gijsk · 2019-01-07T11:14:58Z

(Filed as a result of mozilla/readability#392 ; I'm not 100% sure whether this should be considered a DOM issue or an HTML parser issue; feel free to move as appropriate )

STR:

open https://opinion.udn.com/opinion/story/10124/3561413 in recent versions of Chrome or Firefox
in their respective devtools console, run something like this:

console.log(Array.from(document.querySelectorAll("table")).map(t => t.outerHTML))

At the moment, the DOM includes 2 or 3 tables with an attribute whose name is "0", as evidenced from the console log.

The original markup of the page as inspected via "View Source", at time of writing, looked something like:

<table width="90% border="0>

Note the opening quote before 90% and 'closing' quote after border=.

Obviously the markup's intent is to have a table with 2 attributes, width="90%" and border="0". But both browsers parse this as attributes with name '0' and the empty string as a value. I assume this parsing is proscribed by the spec, but I haven't tried to look for the specifics there.

The problem arises when rote DOM manipulation reads through element.attributes, and on a new element, tries to set these same attributes. Element.setAttribute throws an InvalidCharacterError because as noted in https://dom.spec.whatwg.org/#dom-element-setattribute , 0 "does not match the Name production in XML", viz. https://www.w3.org/TR/xml/#NT-Name .

Scripts can currently work around this issue (in reasonably complete DOM implementations) by using element.attributes.setNamedItem(otherElement.attributes[i].cloneNode()), though this isn't very elegant.

I think the inconsistency here is unfortunate. I would argue for one of the following improvements:

parsing an HTML document should validate attributes the same way the DOM spec says to validate them (cf. https://dom.spec.whatwg.org/#validate and https://dom.spec.whatwg.org/#dom-element-setattribute ), or if that is too problematic for backwards compatibility reasons (ie where document authors apparently intend for the element to have an attribute e.g. with name "1" or "." or somesuch), that it should only do so where it is doing parsing for questionable markup such as the above.
setAttribute DOM API validation should be relaxed to the same standard that the HTML parsing uses; if not possible for backwards compatibility reasons, it should be relaxed for documents with text/html content types and/or HTML (rather than XHTML/XML-based) parsing models.

The text was updated successfully, but these errors were encountered:

annevk · 2019-01-07T12:59:51Z

See also whatwg/dom#449. To summarize, this problem is known, but resolving it requires a lot of careful compatibility testing that nobody seems to be willing to invest in.

domenic · 2019-01-07T19:24:38Z

Yep. Let's discuss over there.

gijsk changed the title ~~Attribute DOM representation and parsing is confusing~~ Attribute DOM representation and parsing is inconsistent Jan 7, 2019

gijsk mentioned this issue Jan 7, 2019

Uncaught DOMException on https://losst.ru/obnovlenie-debian-9 mozilla/readability#392

Closed

domenic closed this as completed Jan 7, 2019

domenic mentioned this issue Jun 8, 2022

Allow more characters in element/attribute names and prefixes whatwg/dom#1079

Open

3 tasks

gabriele-tomassetti mentioned this issue Feb 4, 2023

AngleSharp.Dom.DomException: Invalid character detected. Strumenta/SmartReader#54

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attribute DOM representation and parsing is inconsistent #4275

Attribute DOM representation and parsing is inconsistent #4275

gijsk commented Jan 7, 2019

annevk commented Jan 7, 2019

domenic commented Jan 7, 2019

Attribute DOM representation and parsing is inconsistent #4275

Attribute DOM representation and parsing is inconsistent #4275

Comments

gijsk commented Jan 7, 2019

annevk commented Jan 7, 2019

domenic commented Jan 7, 2019