Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 parser closes <math> before <span> #2335

Closed
syakovyn opened this issue Oct 1, 2021 · 13 comments
Closed

HTML5 parser closes <math> before <span> #2335

syakovyn opened this issue Oct 1, 2021 · 13 comments

Comments

@syakovyn
Copy link

syakovyn commented Oct 1, 2021

HTML5 parser differs in output from HTML4 parser. We consider the HTML4 parser output is the right one.

A script to reproduce the bug:

#! /usr/bin/env ruby

require 'nokogiri'
require 'minitest/autorun'

class Test < MiniTest::Spec
  describe "HTML5 parser" do
    it "produces the same output as HTML4" do
      html = <<-HTML
<mjx-container>
<mjx-assistive-mml>
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
            <msqrt><span>\n\n  </span>
                <mn>3</mn>
            </msqrt>
        </math>
        <script>window.destroyAllHumans()</script>
        \n\n
        <msqrt>
            <mn>4</mn>
        </msqrt>
    </mjx-assistive-mml>
</mjx-container>
      HTML
      html5 = Nokogiri::HTML5.fragment(html).to_s
      html4 = Nokogiri::HTML4.fragment(html).to_s

      assert_equal html4, html5
    end
  end
end

Environment

# Nokogiri (1.12.5)
    ---
    warnings: []
    nokogiri:
      version: 1.12.5
      cppflags:
      - "-I/Users/syakovyn/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      - "-I/Users/syakovyn/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include"
      - "-I/Users/syakovyn/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.7.4
      platform: x86_64-darwin20
      gem_platform: x86_64-darwin-20
      description: ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-darwin20]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - 0007-Fix-XPath-recursion-limit.patch
      libxml2_path: "/Users/syakovyn/.rvm/gems/ruby-2.7.4/gems/nokogiri-1.12.5-x86_64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.12
      loaded: 2.9.12
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      - 0002-Fix-xml2-config-check-in-configure-script.patch
      datetime_enabled: true
      compiled: 1.1.34
      loaded: 1.1.34
    other_libraries:
      zlib: 1.2.11
      libiconv: '1.15'
      libgumbo: 1.0.0-nokogiri
@syakovyn syakovyn added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Oct 1, 2021
@flavorjones
Copy link
Member

@syakovyn Thanks for submitting this, and sorry you're having problems.

I've reproduced the problem, but have a slightly different interpretation of what's happening. Here's the script I'm using:

#! /usr/bin/env ruby

require "nokogiri"

html = <<~EOF
  <math xmlns=\"http://www.w3.org/1998/Math/MathML\">
    <msqrt><mn>4</mn></msqrt>
  </math>
EOF

puts Nokogiri::HTML5.fragment(html).to_html
# => <math xmlns="http://www.w3.org/1998/Math/MathML">
#      <msqrt><mn>4</mn></msqrt>
#    </math>

html = <<~EOF
  <math xmlns=\"http://www.w3.org/1998/Math/MathML\">
    <msqrt><span></span><mn>3</mn></msqrt>
  </math>
EOF

puts Nokogiri::HTML5.fragment(html).to_html
# => <math xmlns="http://www.w3.org/1998/Math/MathML">
#      <msqrt></msqrt></math><span></span><mn>3</mn>

My guess at the root cause is that the <span> tag is forcing early closure of the <math> tag.

@stevecheckoway Can you validate my interpretation of what Gumbo is doing here? If you're able, I'd love to pair with you on debugging and fixing the behavior.

@flavorjones flavorjones changed the title HTML5 parser moves <mn> tag out of <msqrt> HTML5 parser closes <math> before <span> Oct 1, 2021
@flavorjones
Copy link
Member

More information:

html = <<~EOF
  <math xmlns=\"http://www.w3.org/1998/Math/MathML\">
    <msqrt><span></span><mn>3</mn></msqrt>
  </math>
EOF

pp Nokogiri::HTML5.fragment(html, max_errors: 100).errors

outputs

[#<Nokogiri::XML::SyntaxError: 2:10: ERROR: That tag isn't allowed here  Currently open tags: html, math, .
  <msqrt><span></span><mn>3</mn></msqrt>
         ^>,
 #<Nokogiri::XML::SyntaxError: 2:33: ERROR: That tag isn't allowed here  Currently open tags: html.
  <msqrt><span></span><mn>3</mn></msqrt>
                                ^>,
 #<Nokogiri::XML::SyntaxError: 3:1: ERROR: That tag isn't allowed here  Currently open tags: html.
</math>
^>]

@flavorjones
Copy link
Member

This looks to me like Gumbo considers a <span> element within a <math> block to be invalid. I'm not familiar with the MathML spec and so will defer to @syakovyn and @stevecheckoway and others to discuss.

@flavorjones flavorjones added topic/gumbo Gumbo HTML5 parser topic/HTML5 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Oct 1, 2021
@stevecheckoway
Copy link
Contributor

stevecheckoway commented Oct 1, 2021

I can track down the exact portion of the spec that mandates this behavior, but I think the easiest thing to do is to open a document in a browser and use the browser's web tools to inspect the DOM. Alternatively, you can use JavaScript to print the body element's innerHTML. Let's do that:

<!DOCTYPE html>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msqrt><span></span><mn>4</mn></msqrt>
</math>
<script>
alert(document.body.innerHTML)
</script>

If you save that as a file and open it, it'll pop up an alert containing (and I've reformatted the HTML slightly)

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msqrt></msqrt>
</math>
<span></span>
<mn>4</mn>

In other words, I believe the HTML5 parser is correctly parsing this document.

@stevecheckoway
Copy link
Contributor

stevecheckoway commented Oct 1, 2021

And for what it's worth, this is the section of the HTML standard that mandates this behavior. When the parser encounters the <span>, it is in a foreign context (because the parent element is a math element in the MathML namespace). The rule

A start tag whose tag name is one of: … "span" …

applies and that mandates a parse error and, in this case, closing the msqrt and math elements.

The other two errors @flavorjones's showed come from the browser encountering </msqrt> and then </math> without there being open msqrt and math elements.

@flavorjones
Copy link
Member

Thanks for that additional context, @stevecheckoway. I'm going to close this issue, but @syakovyn please feel free to comment or ask questions if we haven't helped you!

@syakovyn
Copy link
Author

As far as I understood, we were relying on the bug Nokogumbo HTML5 parser has by producing the same output as Nokogiri::HTML4 parser. In turn, we were able to cleanup the resulting HTML with https://github.com/rgrove/sanitize transformers afterwards.

Now, in order to upgrade to Nokogiri 1.12 we need to come up with a way to do that before we have the parsed HTML. I'd really appreciate If someone can give me an idea how to achieve that.

Thanks.

@stevecheckoway
Copy link
Contributor

I'm not sure I understand. Which bug are you referring to?

The HTML5 parser (which is the same as Nokogumbo's parser) parses according to HTML living standard (or if it doesn't, that's a bug we should fix). The idea being that the DOM you get from parsing is exactly the same as the DOM a modern browser is going to construct.

The HTML4 parser relies on libxml2 which does not parse HTML the same way that modern browsers do.

It sounds like you want the HTML4 parser behavior. Is that correct?

@syakovyn
Copy link
Author

Sorry for not being clear.

We use Sanitize gem to cleanup user's input. In this particular case, https://github.com/rgrove/sanitize/blob/main/lib/sanitize.rb#L135-L141 works well with Nokogiri 1.11.7 but is broken with Nokogiri 1.12.5.

I assume there is a bug in Nokogumbo 2.0.5 that allows us to get parsed HTML and remove illegal spans from MathML expressions afterwards.
However, that does not work with Nokogiri 1.12.5.

Is there a way to plug-into the parsing process and remove the illegal spans before they break the output?

Hope, I described our issue clearer this time.

Thanks.

@flavorjones
Copy link
Member

@syakovyn I think it's easier to communicate with code whenever possible. I think I understand what you are saying:

#! /usr/bin/env ruby

require "nokogumbo"

puts "nokogiri: #{Nokogiri::VERSION}"
puts "nokogumbo: #{Nokogumbo::VERSION}"

html = <<-HTML
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
  <msqrt>
    <span>hello</span>
    <mn>3</mn>
  </msqrt>
</math>
HTML

puts Nokogiri::HTML5.fragment(html).to_s

When running with Nokogumbo 2.0.5 and Nokogiri 1.11 the output is:

nokogiri: 1.11.7
nokogumbo: 2.0.5
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msqrt>
    <span>hello</span>
    <mn>3</mn>
  </msqrt>
</math>

When running with Nokogiri 1.12 the output is:

NOTE: nokogumbo: Using Nokogiri::HTML5 provided by Nokogiri. See https://github.com/sparklemotion/nokogiri/issues/2205 for more information.
nokogiri: 1.12.5
nokogumbo: 2.0.5
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <msqrt>
    </msqrt></math><span>hello</span>
    <mn>3</mn>

@stevecheckoway It looks like this behavior changed post-nokogumbo-2.0.5 in b317bb8 (after a git bisect), and I believe that behavior change is what @syakovyn is asking about.

My interpretation of the current behavior (introduced in b317bb8) is that gumbo is correctly reflecting the MathML and HTML5 specs by treating <span> like this in a foreign context. Correct?

@syakovyn
Copy link
Author

@flavorjones, exactly! You perfectly described the issue we are facing!

@stevecheckoway
Copy link
Contributor

stevecheckoway commented Oct 12, 2021

Ah ha, nice investigation @flavorjones!

It looks like that commit changed two things. It changed the handling of </p> and </br> in foreign contexts and it changed the handling of the fragment case to be the same as the non-fragment case.

The first of those was required by whatwg/html@5333b04 which is what the commit was trying to address.

Based on the comment about fragments in the code and in the commit message, I appear to have noticed that the standard changed to mandate the same behavior in the fragment case as in the non-fragment case but failed to remove the outdated comment. Here's the change for that behavior whatwg/html@1d271f2. (That change caused a regression that was fixed in whatwg/html#6455.)

So I now understand the change in behavior @syakovyn described. But I'm not really sure what to do about it. Whatever is producing the span in the math element is incorrect. Can that be changed?

I don't think we want to have a parsing mode that respects some of the semantics of HTML but not others. And I'm not even sure how we could do that in a consistent way. E.g.,

<!DOCTYPE html>
<p>A<p>B

should produce the same thing as <p>A</p><p>B</p> as that's valid HTML, but what about this?

<!DOCTYPE html>
<p>A<p>B</p>C</p>D

It seems like what @syakovyn would like is akin to the second p element to be a child of the first in this example. In @syakovyn's case, ignore what the standard explicitly says about <span> inside an open math element. In this case, ignore what the standard explicitly says about <p> inside an open p element.

@flavorjones
Copy link
Member

@stevecheckoway Thanks for confirming that gumbo is following the spec here.

I think as a guiding policy, Nokogiri's gumbo implementation should follow the spec and we should decline requests for customizing behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants