-
-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML5 parser closes <math> before <span> #2335
Comments
@syakovyn Thanks for submitting this, and sorry you're having problems. I've reproduced the problem, but have a slightly different interpretation of what's happening. Here's the script I'm using: #! /usr/bin/env ruby
require "nokogiri"
html = <<~EOF
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
<msqrt><mn>4</mn></msqrt>
</math>
EOF
puts Nokogiri::HTML5.fragment(html).to_html
# => <math xmlns="http://www.w3.org/1998/Math/MathML">
# <msqrt><mn>4</mn></msqrt>
# </math>
html = <<~EOF
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
<msqrt><span></span><mn>3</mn></msqrt>
</math>
EOF
puts Nokogiri::HTML5.fragment(html).to_html
# => <math xmlns="http://www.w3.org/1998/Math/MathML">
# <msqrt></msqrt></math><span></span><mn>3</mn> My guess at the root cause is that the @stevecheckoway Can you validate my interpretation of what Gumbo is doing here? If you're able, I'd love to pair with you on debugging and fixing the behavior. |
More information: html = <<~EOF
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
<msqrt><span></span><mn>3</mn></msqrt>
</math>
EOF
pp Nokogiri::HTML5.fragment(html, max_errors: 100).errors outputs
|
This looks to me like Gumbo considers a |
I can track down the exact portion of the spec that mandates this behavior, but I think the easiest thing to do is to open a document in a browser and use the browser's web tools to inspect the DOM. Alternatively, you can use JavaScript to print the <!DOCTYPE html>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<msqrt><span></span><mn>4</mn></msqrt>
</math>
<script>
alert(document.body.innerHTML)
</script> If you save that as a file and open it, it'll pop up an alert containing (and I've reformatted the HTML slightly)
In other words, I believe the HTML5 parser is correctly parsing this document. |
And for what it's worth, this is the section of the HTML standard that mandates this behavior. When the parser encounters the
applies and that mandates a parse error and, in this case, closing the The other two errors @flavorjones's showed come from the browser encountering |
Thanks for that additional context, @stevecheckoway. I'm going to close this issue, but @syakovyn please feel free to comment or ask questions if we haven't helped you! |
As far as I understood, we were relying on the bug Nokogumbo HTML5 parser has by producing the same output as Nokogiri::HTML4 parser. In turn, we were able to cleanup the resulting HTML with https://github.com/rgrove/sanitize transformers afterwards. Now, in order to upgrade to Nokogiri 1.12 we need to come up with a way to do that before we have the parsed HTML. I'd really appreciate If someone can give me an idea how to achieve that. Thanks. |
I'm not sure I understand. Which bug are you referring to? The HTML5 parser (which is the same as Nokogumbo's parser) parses according to HTML living standard (or if it doesn't, that's a bug we should fix). The idea being that the DOM you get from parsing is exactly the same as the DOM a modern browser is going to construct. The HTML4 parser relies on libxml2 which does not parse HTML the same way that modern browsers do. It sounds like you want the HTML4 parser behavior. Is that correct? |
Sorry for not being clear. We use Sanitize gem to cleanup user's input. In this particular case, https://github.com/rgrove/sanitize/blob/main/lib/sanitize.rb#L135-L141 works well with Nokogiri 1.11.7 but is broken with Nokogiri 1.12.5. I assume there is a bug in Nokogumbo 2.0.5 that allows us to get parsed HTML and remove illegal spans from MathML expressions afterwards. Is there a way to plug-into the parsing process and remove the illegal spans before they break the output? Hope, I described our issue clearer this time. Thanks. |
@syakovyn I think it's easier to communicate with code whenever possible. I think I understand what you are saying: #! /usr/bin/env ruby
require "nokogumbo"
puts "nokogiri: #{Nokogiri::VERSION}"
puts "nokogumbo: #{Nokogumbo::VERSION}"
html = <<-HTML
<math xmlns=\"http://www.w3.org/1998/Math/MathML\">
<msqrt>
<span>hello</span>
<mn>3</mn>
</msqrt>
</math>
HTML
puts Nokogiri::HTML5.fragment(html).to_s When running with Nokogumbo 2.0.5 and Nokogiri 1.11 the output is:
When running with Nokogiri 1.12 the output is:
@stevecheckoway It looks like this behavior changed post-nokogumbo-2.0.5 in b317bb8 (after a git bisect), and I believe that behavior change is what @syakovyn is asking about. My interpretation of the current behavior (introduced in b317bb8) is that gumbo is correctly reflecting the MathML and HTML5 specs by treating |
@flavorjones, exactly! You perfectly described the issue we are facing! |
Ah ha, nice investigation @flavorjones! It looks like that commit changed two things. It changed the handling of The first of those was required by whatwg/html@5333b04 which is what the commit was trying to address. Based on the comment about fragments in the code and in the commit message, I appear to have noticed that the standard changed to mandate the same behavior in the fragment case as in the non-fragment case but failed to remove the outdated comment. Here's the change for that behavior whatwg/html@1d271f2. (That change caused a regression that was fixed in whatwg/html#6455.) So I now understand the change in behavior @syakovyn described. But I'm not really sure what to do about it. Whatever is producing the span in the I don't think we want to have a parsing mode that respects some of the semantics of HTML but not others. And I'm not even sure how we could do that in a consistent way. E.g., <!DOCTYPE html>
<p>A<p>B should produce the same thing as <!DOCTYPE html>
<p>A<p>B</p>C</p>D It seems like what @syakovyn would like is akin to the second |
@stevecheckoway Thanks for confirming that gumbo is following the spec here. I think as a guiding policy, Nokogiri's gumbo implementation should follow the spec and we should decline requests for customizing behavior. |
HTML5 parser differs in output from HTML4 parser. We consider the HTML4 parser output is the right one.
A script to reproduce the bug:
Environment
The text was updated successfully, but these errors were encountered: