Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser failure on unclosed <head> inside unclosed <html> #75

Closed
clarfonthey opened this issue Jan 16, 2024 · 4 comments · Fixed by #84
Closed

Parser failure on unclosed <head> inside unclosed <html> #75

clarfonthey opened this issue Jan 16, 2024 · 4 comments · Fixed by #84

Comments

@clarfonthey
Copy link

Hello, I ran into this issue with difftastic and managed to trace it back to this parser. After narrowing down the HTML that was failing to parse, I managed this:

<!doctype html><html><head>

Essentially, instead of being labelled as an implicitly closed element inside an implicitly closed element, it's labelled as an error with two start tags.

This feels undesirable, considering how a missing </head>, </body>, or </html> will cause the entire document (or half of it) to be enclosed in an error node, which breaks the parsing of the individual parts.

@clarfonthey clarfonthey changed the title Parser failure on unclosed <head> Parser failure on unclosed <head> inside unclosed <html> Jan 16, 2024
@milahu
Copy link

milahu commented Feb 21, 2024

implicitly closed element

i guess youre confusing this with <img> and <br> and <hr> and ...

in other words, youre looking for a fault-tolerant html parser as used by web browsers
imo this is out of scope for tree-sitter-html

see also How do browsers deal with malformed HTML?

@clarfonthey
Copy link
Author

This isn't malformed HTML; it's listed explicitly in the spec:

The end tags for html, head, and body can all be omitted in valid HTML.

@amaanq
Copy link
Member

amaanq commented Feb 21, 2024

wow, never knew that. thanks for the links

@clarfonthey
Copy link
Author

clarfonthey commented Feb 21, 2024

Small note, but this isn't 100% fixed; you can also implicitly close a <head> or <body> element by adding the other (not just by EOF), and this isn't accounted for in #84. In other words, this doesn't work:

<!DOCTYPE><html><head><body>

The specific wording of "ASCII whitespace" and "comment" is used to detail the way that content is inferred to be in either the head or body if the end (or start!) tags are missing. Basically, there are two:

  1. Comments are just inferred to be in the tag that was most recently opened, so, if you want to explicitly include them in either the head or body, you need an explicit tag.
  2. ASCII whitespace has a different interpretation in the head and body, where it's ignored in the head, but treated as a text node in the body. The default interpretation is that any leading ASCII whitespace is treated as part of the <head> unless you explicitly close it and make it part of the body.

By these rules, you can explicitly omit the head and body altogether and it'll interpret what is what based upon where the tags are usually located, but you can also choose to simply omit the </head> and use <body> to end the head, since it's well-understood that the <html> tag just contains a head and body, in that order.

I included the <html><head> example since that was the simplest one, but technically there should also be a <html><head><body> test as well to indicate the body too. And perhaps a few other simple examples, like:

  • <!DOCTYPE><html><meta><p>:
(document (doctype)
  (element
    (start_tag (tag_name))
  (element
    (start_tag (tag_name)))
  • <!DOCTYPE><html><head><meta><p>:
(document (doctype)
  (element
    (element
      (start_tag (tag_name)))
  (element
    (start_tag (tag_name)))
  • <!DOCTYPE><html><meta><body><p>
(document (doctype)
  (element
    (start_tag (tag_name))
  (element
    (element
      (start_tag (tag_name))))
  • <!DOCTYPE><html><head><meta><body><p>
(document (doctype)
  (element
    (element
      (start_tag (tag_name)))
  (element
    (element
      (start_tag (tag_name))))

I don't think that tree-sitter needs to explicitly sort the tags into a head and body (it's fine with other elements inside <html> directly) but I think that it should be able to implicitly close a </head> tag based upon the presence of body elements. Right now, it just complains still about the main case, which is a <head> being implicitly closed by a <body>.

Also to add a bit more context: the spec has a more specific explanation of the algorithm for parsing documents that goes over the way these two elements are parsed: https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inhtml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants