[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

dan42 · 2022-09-13T21:05:42Z

I'm not sure if the title above is correct, but please consider these two ways of parsing a <table> within a <p>:

require "nokogiri"
tags = "<p>A <table><tr><td>B</table> C"
doc = Nokogiri::HTML5.parse("<div>"+tags)
puts doc
doc.at("div").inner_html = tags
puts doc

output:

<html><head></head><body><div><p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p></div></body></html>
<html><head></head><body><div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div></body></html>

I'm fairly certain the output should not be different. In both cases we're parsing the same tags, and the parent element is a <div>. Note that Nokogiri::HTML5.fragment(tags) produces the same structure as Nokogiri::HTML5.parse(tags). So this bug/behavior looks specific to coerce somehow. But in the code sample above if we replace HTML5.parse by HTML5.fragment, the two puts doc produce the same output. So honestly I'm not sure what's going on.

The text was updated successfully, but these errors were encountered:

flavorjones · 2022-09-14T14:09:51Z

Thanks for opening this issue! Great question.

I dug in a bit and here's a code snippet that explains a bit about what's happening, which is that inner_html= parses a HTML5::DocumentFragment from the tags string:

#! /usr/bin/env ruby

require "nokogiri"

tags = "<p>A <table><tr><td>B</table> C"
doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>"

# making explicit the calls which implement Node#inner_html=
doc = Nokogiri::HTML5.parse("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(doc, tags, doc.at_css("div"))
frag.to_s # => "<p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C"
doc.at_css("div").children = frag
doc.at_css("div").to_s # => "<div><p>A </p><table><tbody><tr><td>B</td></tr></tbody></table> C</div>"

I expect that the DOM structure should be identical, whether parsed as a document or as a fragment with a <div> context node.

I'll schedule some time to dig into the gumbo fragment parsing. Pinging @stevecheckoway for awareness and advice.

dan42 · 2022-09-14T16:44:58Z

But the strange thing is that this isn't just about fragment parsing.

require "nokogiri"

tags = "<p>A <table><tr><td>B</table> C"

base = Nokogiri::HTML5.fragment("<div>")
frag = Nokogiri::HTML5::DocumentFragment.new(base.document, tags, base.at_css("div"))
frag.to_s #=> "<p>A <table><tbody><tr><td>B</td></tr></tbody></table> C</p>"

In this case the result is correct even though we're using fragment parsing. The bug seems to occur only when parsing a fragment WHILE using a document that was previously created via document parsing.

flavorjones · 2022-11-27T05:15:33Z

OK, I've narrowed this down to whether the parser is in "quirks mode" or not.

Minimal repro

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

require "nokogiri"

tags = "<p><table></table>"

doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p></p><table></table></div>"

Diagnosis

In the C code for Gumbo.fragment there's this clause which in this case sets NO_QUIRKS mode:

  // Quirks mode.
  VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
  VALUE dtd = rb_funcall(doc, internal_subset, 0);
  if (NIL_P(dtd)) {
    quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
  } else {
    VALUE dtd_name = rb_funcall(dtd, name, 0);
    VALUE pubid = rb_funcall(dtd, rb_intern_const("external_id"), 0);
    VALUE sysid = rb_funcall(dtd, rb_intern_const("system_id"), 0);
    quirks_mode = gumbo_compute_quirks_mode(
                    NIL_P(dtd_name) ? NULL : StringValueCStr(dtd_name),
                    NIL_P(pubid) ? NULL : StringValueCStr(pubid),
                    NIL_P(sysid) ? NULL : StringValueCStr(sysid)
                  );
  }

What's happening in this case is that if we parse <div><p><table></table> as a document, then gumbo puts the document into GUMBO_DOCTYPE_QUIRKS and so when handle_in_body sees the table tag this clause doesn't close the p tag:

  if (tag_is(token, kStartTag, GUMBO_TAG_TABLE)) {
    if (
      get_document_node(parser)->v.document.doc_type_quirks_mode
        != GUMBO_DOCTYPE_QUIRKS
    ) {
      maybe_implicitly_close_p_tag(parser, token);
    }

However, when we parse <p><table></table> as a fragment in the context of the div tag, then quirks mode is reset to NO_QUIRKS and so that clause does close the p tag.

Possible solutions

@stevecheckoway I wanted to ask you about this behavior. Rather than try to re-compute quirks mode during fragment parsing initialization, would it be preferable to persist the quirks mode as an attribute on the Nokogiri::HTML5::Document object, and then have any fragment parsing within that document use that value?

This is what I'm suggesting:

diff --git a/ext/nokogiri/gumbo.c b/ext/nokogiri/gumbo.c
index c4211d3..5150145 100644
--- a/ext/nokogiri/gumbo.c
+++ b/ext/nokogiri/gumbo.c
@@ -361,6 +361,7 @@ parse_continue(VALUE parse_args)
   build_tree(doc, (xmlNodePtr)doc, output->document);
   VALUE rdoc = noko_xml_document_wrap(args->klass, doc);
   rb_iv_set(rdoc, "@url", args->url_or_frag);
+  rb_iv_set(rdoc, "@quirks_mode", INT2NUM(output->document->v.document.doc_type_quirks_mode));
   args->doc = NULL; // The Ruby runtime now owns doc so don't delete it.
   add_errors(output, rdoc, args->input, args->url_or_frag);
   return rdoc;
@@ -517,7 +518,10 @@ fragment(
   // Quirks mode.
   VALUE doc = rb_funcall(doc_fragment, rb_intern_const("document"), 0);
   VALUE dtd = rb_funcall(doc, internal_subset, 0);
-  if (NIL_P(dtd)) {
+  VALUE document_quirks_mode = rb_iv_get(doc, "@quirks_mode");
+  if (!NIL_P(document_quirks_mode)) {
+    quirks_mode = NUM2INT(document_quirks_mode);
+  } else if (NIL_P(dtd)) {
     quirks_mode = GUMBO_DOCTYPE_NO_QUIRKS;
   } else {
     VALUE dtd_name = rb_funcall(dtd, name, 0);

The result of this patch is that the fragment "inherits" the quirks mode from the document, and applies it, and so the result is consistent:

#! /usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

require "nokogiri"

tags = "<p><table></table>"

doc = Nokogiri::HTML5.parse("<div>"+tags)
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

doc = Nokogiri::HTML5.parse("<div>")
doc.at_css("div").inner_html = tags
doc.at_css("div").to_s # => "<div><p><table></table></p></div>"

If that seems like the right thing to do, I'm happy to write up some tests and submit a PR. Let me know if you have other ideas?

Update: I guess one more idea is simply to have the fragment default to QUIRKS_MODE? I haven't done the work to understand when/why the document is put into quirks mode, so I'm not sure if this is appropriate ... but handle_initial does seem to default to QUIRKS.

stevecheckoway · 2022-11-27T06:27:37Z

I'm sorry I didn't respond earlier. I missed this issue entirely.

The underlying bug is the fragment() incorrectly detects the quirks mode of the document. If there is no DOCTYPE, then the document should be in quirks mode.

See "Anything else" in The "initial" insertion mode.

@stevecheckoway I wanted to ask you about this behavior. Rather than try to re-compute quirks mode during fragment parsing initialization, would it be preferable to persist the quirks mode as an attribute on the Nokogiri::HTML5::Document object, and then have any fragment parsing within that document use that value?

I think this is exactly the right approach (modulo fixing my bug when there's no DOCTYPE). The rules for parsing HTML fragments explicitly say that the document created for fragment parsing has the same quirks as the context node's document.

Update: I guess one more idea is simply to have the fragment default to QUIRKS_MODE? I haven't done the work to understand when/why the document is put into quirks mode, so I'm not sure if this is appropriate ... but handle_initial does seem to default to QUIRKS.

If the context node comes from a real document, then we should do what the standard says and copy the quirks. If there is no real context node, then this is a judgment call but I think we should go with no quirks mode for two reasons.

The document will be in quirks mode when a DOCTYPE is omitted but that's a parse error.
p elements contain phrasing content which doesn't include table elements and the quirks mode puts the table inside the p.

This is the only place in the parser that I'm aware of that the quirks mode actually matters. (There's a limited quirks mode that never seems to matter.)

irb(main):001:0> Nokogiri::HTML5.parse('<!DOCTYPE html><p><table></table>').to_s
=> "<!DOCTYPE html><html><head></head><body><p></p><table></table></body></html>"
irb(main):002:0> Nokogiri::HTML5.parse('<p><table></table>').to_s
=> "<html><head></head><body><p><table></table></p></body></html>"

flavorjones · 2022-11-28T02:38:58Z

@stevecheckoway Thanks for the complete and thoughtful response. I'll open a PR in the next day or so and ask for your review to make sure the implementation matches this description.

and make sure the behavior of quirks mode matches the discussion at issue #2646.

flavorjones · 2022-12-20T17:30:15Z

see PR at #2736

and make sure the behavior of quirks mode matches the discussion at issue #2646.

flavorjones · 2022-12-21T19:15:36Z

Will be fixed in the v1.14.0 release!

This more completely implements the guidance originally agreed upon in issue #2646, as there is no "context node com[ing] from a real document" when a tag name is provided.

…ext (#3246) **What problem is this PR intended to solve?** Coming from the discussion at #3203, I wanted to improve the fragment parsing API - as discussed in #2646, parse in no-quirks mode if a context element name is provided (not a context `Node`, just the name) - allow passing `:context` kwarg to `DocumentFragment.new` and `.parse` - deprecate the positional options hash to `.parse` per notes at #3200 **Have you included adequate test coverage?** Included additional coverage for the API changes **Does this change affect the behavior of either the C or the Java implementations?** HTML5 is only in CRuby.

dan42 added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Sep 13, 2022

flavorjones added topic/gumbo Gumbo HTML5 parser and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Sep 14, 2022

flavorjones added this to the v1.15.0 milestone Nov 22, 2022

flavorjones changed the title ~~[bug] parsing of <p> is different when a context is specified via coerce~~ [bug] parsing of <p> is different when a context is specified via coerce Nov 27, 2022

flavorjones changed the title ~~[bug] parsing of <p> is different when a context is specified via coerce~~ [bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling Nov 27, 2022

flavorjones modified the milestones: v1.15.0, v1.14.0 Nov 27, 2022

flavorjones added a commit that referenced this issue Dec 20, 2022

feat: HTML5::DocumentFragment#quirks_mode

39c2d5f

and make sure the behavior of quirks mode matches the discussion at issue #2646.

flavorjones mentioned this issue Dec 20, 2022

fix: fragment context parsing #2736

Merged

flavorjones added a commit that referenced this issue Dec 20, 2022

feat: HTML5::DocumentFragment#quirks_mode

3edd6c8

and make sure the behavior of quirks mode matches the discussion at issue #2646.

flavorjones closed this as completed in #2736 Dec 21, 2022

flavorjones mentioned this issue Jun 21, 2024

Fix fragment quirks mode, and make it easier to specify fragment context #3246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

dan42 commented Sep 13, 2022

flavorjones commented Sep 14, 2022

dan42 commented Sep 14, 2022

flavorjones commented Nov 27, 2022 •

edited

Loading

stevecheckoway commented Nov 27, 2022

flavorjones commented Nov 28, 2022 •

edited

Loading

flavorjones commented Dec 20, 2022

flavorjones commented Dec 21, 2022

[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

[bug] parsing a fragment in-context may yield a different DOM than parsing the same document due to quirks mode handling #2646

Comments

dan42 commented Sep 13, 2022

flavorjones commented Sep 14, 2022

dan42 commented Sep 14, 2022

flavorjones commented Nov 27, 2022 • edited Loading

Minimal repro

Diagnosis

Possible solutions

stevecheckoway commented Nov 27, 2022

flavorjones commented Nov 28, 2022 • edited Loading

flavorjones commented Dec 20, 2022

flavorjones commented Dec 21, 2022

flavorjones commented Nov 27, 2022 •

edited

Loading

flavorjones commented Nov 28, 2022 •

edited

Loading