Parser failure on unclosed <head> inside unclosed <html> #75

clarfonthey · 2024-01-16T23:36:15Z

Hello, I ran into this issue with difftastic and managed to trace it back to this parser. After narrowing down the HTML that was failing to parse, I managed this:

<!doctype html><html><head>

Essentially, instead of being labelled as an implicitly closed element inside an implicitly closed element, it's labelled as an error with two start tags.

This feels undesirable, considering how a missing </head>, </body>, or </html> will cause the entire document (or half of it) to be enclosed in an error node, which breaks the parsing of the individual parts.

The text was updated successfully, but these errors were encountered:

milahu · 2024-02-21T09:35:41Z

implicitly closed element

i guess youre confusing this with <img> and   and <hr> and ...

in other words, youre looking for a fault-tolerant html parser as used by web browsers
imo this is out of scope for tree-sitter-html

see also How do browsers deal with malformed HTML?

clarfonthey · 2024-02-21T15:16:56Z

This isn't malformed HTML; it's listed explicitly in the spec:

The end tags for html, head, and body can all be omitted in valid HTML.

amaanq · 2024-02-21T15:34:40Z

wow, never knew that. thanks for the links

clarfonthey · 2024-02-21T19:41:49Z

Small note, but this isn't 100% fixed; you can also implicitly close a <head> or <body> element by adding the other (not just by EOF), and this isn't accounted for in #84. In other words, this doesn't work:

<!DOCTYPE><html><head><body>

The specific wording of "ASCII whitespace" and "comment" is used to detail the way that content is inferred to be in either the head or body if the end (or start!) tags are missing. Basically, there are two:

Comments are just inferred to be in the tag that was most recently opened, so, if you want to explicitly include them in either the head or body, you need an explicit tag.
ASCII whitespace has a different interpretation in the head and body, where it's ignored in the head, but treated as a text node in the body. The default interpretation is that any leading ASCII whitespace is treated as part of the <head> unless you explicitly close it and make it part of the body.

By these rules, you can explicitly omit the head and body altogether and it'll interpret what is what based upon where the tags are usually located, but you can also choose to simply omit the </head> and use <body> to end the head, since it's well-understood that the <html> tag just contains a head and body, in that order.

I included the <html><head> example since that was the simplest one, but technically there should also be a <html><head><body> test as well to indicate the body too. And perhaps a few other simple examples, like:

<!DOCTYPE><html><meta>:

(document (doctype)
  (element
    (start_tag (tag_name))
  (element
    (start_tag (tag_name)))

<!DOCTYPE><html><head><meta>:

(document (doctype)
  (element
    (element
      (start_tag (tag_name)))
  (element
    (start_tag (tag_name)))

<!DOCTYPE><html><meta><body>

(document (doctype)
  (element
    (start_tag (tag_name))
  (element
    (element
      (start_tag (tag_name))))

<!DOCTYPE><html><head><meta><body>

(document (doctype)
  (element
    (element
      (start_tag (tag_name)))
  (element
    (element
      (start_tag (tag_name))))

I don't think that tree-sitter needs to explicitly sort the tags into a head and body (it's fine with other elements inside <html> directly) but I think that it should be able to implicitly close a </head> tag based upon the presence of body elements. Right now, it just complains still about the main case, which is a <head> being implicitly closed by a <body>.

Also to add a bit more context: the spec has a more specific explanation of the algorithm for parsing documents that goes over the way these two elements are parsed: https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhtml

clarfonthey changed the title ~~Parser failure on unclosed <head>~~ Parser failure on unclosed <head> inside unclosed <html> Jan 16, 2024

amaanq mentioned this issue Feb 21, 2024

feat: support omitted html, head, and body end tags #84

Merged

amaanq closed this as completed in #84 Feb 21, 2024

clarfonthey mentioned this issue Oct 6, 2024

Investigate tree-sitter to replace syntect getzola/zola#1787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser failure on unclosed <head> inside unclosed <html> #75

Parser failure on unclosed <head> inside unclosed <html> #75

clarfonthey commented Jan 16, 2024

milahu commented Feb 21, 2024

clarfonthey commented Feb 21, 2024

amaanq commented Feb 21, 2024

clarfonthey commented Feb 21, 2024 •

edited

Loading

Parser failure on unclosed <head> inside unclosed <html> #75

Parser failure on unclosed <head> inside unclosed <html> #75

Comments

clarfonthey commented Jan 16, 2024

milahu commented Feb 21, 2024

clarfonthey commented Feb 21, 2024

amaanq commented Feb 21, 2024

clarfonthey commented Feb 21, 2024 • edited Loading

clarfonthey commented Feb 21, 2024 •

edited

Loading