Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] HTML5 parser loses reference to url parameter value #2583

Closed
jgarber623 opened this issue Jun 23, 2022 · 3 comments · Fixed by #2584
Closed

[bug] HTML5 parser loses reference to url parameter value #2583

jgarber623 opened this issue Jun 23, 2022 · 3 comments · Fixed by #2584

Comments

@jgarber623
Copy link

Please describe the bug

Invocations of Nokogiri::HTML4() and Nokogiri::HTML5() both support optional second parameters representing the parsed document's URL:

The resulting Nokogiri::HTMLx::Document instance responds to a url method that should return the string value passed to Nokogiri::HTML4/Nokogiri::HTML5. This works with the HTML4 parser (alternatively invoked as Nokogiri::HTML()) but the HTML5 parser consistently returns nil.

I did my best to trace the code path from Nokogiri::HTML5() downward, but I got lost in Nokogiri::HTML5::Document's private do_parse method (source):

def do_parse(string_or_io, url, encoding, options)
  string = HTML5.read_and_encode(string_or_io, encoding)
  max_attributes = options[:max_attributes] || Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES
  max_errors = options[:max_errors] || options[:max_parse_errors] || Nokogiri::Gumbo::DEFAULT_MAX_ERRORS
  max_depth = options[:max_tree_depth] || Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH
  doc = Nokogiri::Gumbo.parse(string, url, max_attributes, max_errors, max_depth, self)
  doc.encoding = "UTF-8"
  doc
end

Specifically, I couldn't locate source for the Nokogiri::Gumbo.parse call on line 66. I suspect this code may be generated from C code which is beyond my level of understanding. Is that the case? Did I miss something?

Help us reproduce what you're seeing

Save the following script as test.rb (or similar) and run via ruby test.rb.

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'nokogiri', '~> 1.13'
end

url = 'https://hello.example'
markup = '<body>Hello, world!</body>'

puts 'Testing Nokogiri::HTML...'
puts Nokogiri::HTML(markup, url).url
# => https://hello.example

puts 'Testing Nokogiri::HTML4...'
puts Nokogiri::HTML4(markup, url).url
# => https://hello.example

puts 'Testing Nokogiri::HTML5...'
puts Nokogiri::HTML5(markup, url).url
# => nil

Expected behavior

I expected that a call to Nokogiri::HTML5(markup, url).url would return the value passed as the second parameter, consistent with the HTML4 parser.

Environment

# Nokogiri (1.13.6)
    ---
    warnings: []
    nokogiri:
      version: 1.13.6
      cppflags:
      - "-I/Users/jason/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/nokogiri-1.13.6-arm64-darwin/ext/nokogiri"
      - "-I/Users/jason/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/nokogiri-1.13.6-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/jason/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/nokogiri-1.13.6-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 2.7.6
      platform: arm64-darwin21
      gem_platform: arm64-darwin-21
      description: ruby 2.7.6p219 (2022-04-12 revision c9c2245c0a) [arm64-darwin21]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - 0004-use-glibc-strlen.patch
      - 0005-avoid-isnan-isinf.patch
      - 0006-update-automake-files-for-arm64.patch
      - '0008-htmlParseComment-handle-abruptly-closed-comments.patch'
      - '0009-allow-wildcard-namespaces.patch'
      libxml2_path: "/Users/jason/.rbenv/versions/2.7.6/lib/ruby/gems/2.7.0/gems/nokogiri-1.13.6-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.9.14
      loaded: 2.9.14
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-automake-files-for-arm64.patch
      datetime_enabled: true
      compiled: 1.1.35
      loaded: 1.1.35
    other_libraries:
      zlib: 1.2.12
      libiconv: '1.16'
      libgumbo: 1.0.0-nokogiri

Additional context

I'm currently using the HTML4 parser in some code and relying on the url parameter to resolve relative URLs in parsed markup. I discovered the above documented inconsistency while updating that code to use the newer HTML5 parser.

Thanks for reading and I appreciate your help!

@jgarber623 jgarber623 added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jun 23, 2022
@flavorjones
Copy link
Member

@jgarber623 Thanks for reporting this! I'll take a look as soon as I can, but after a brief glance it seems like something we can easily fix.

@flavorjones flavorjones removed the state/needs-triage Inbox for non-installation-related bug reports or help requests label Jun 23, 2022
flavorjones added a commit that referenced this issue Jun 23, 2022
This matches the behavior of HTML4::Document and
XML::Document. Previously this method always returned `nil`.

Fixes #2583
@flavorjones
Copy link
Member

PR is at #2584

@jgarber623
Copy link
Author

@flavorjones Thanks for the quick fix! Tested and reviewed your changes in #2584 and everything looks good.

jgarber623 added a commit to jgarber623/micromicro that referenced this issue Mar 14, 2023
MicroMicro can use this now that Nokogiri 1.14 is out, which includes:

sparklemotion/nokogiri#2583
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants