A Ruby gem extending Nokogiri with several useful HTML-centric features.
- Resolves all relative URLs in a Nokogiri-parsed HTML document.
- Adds helpers for getting and setting a document's
<base>
element'shref
attribute. - Supports Ruby 2.7 and newer
Before installing and using nokogiri-html-ext, you'll want to have Ruby 2.7 (or newer) installed. Using a Ruby version managment tool like rbenv, chruby, or rvm is recommended.
nokogiri-html-ext is developed using Ruby 2.7.8 and is tested against additional Ruby versions using GitHub Actions.
Add nokogiri-html-ext to your project's Gemfile
and run bundle install
:
source "https://rubygems.org"
gem "nokogiri-html-ext"
nokogiri-html-ext provides two helper methods for getting and setting a document's <base>
element's href
attribute. The first, base_href
, retrieves the element's href
attribute value if it exists.
require "nokogiri/html-ext"
doc = Nokogiri::HTML(%(<html><body>Hello, world!</body></html>))
doc.base_href
#=> nil
doc = Nokogiri::HTML(%(<html><head><base target="_top"><body>Hello, world!</body></html>))
doc.base_href
#=> nil
doc = Nokogiri::HTML(%(<html><head><base href="/foo"><body>Hello, world!</body></html>))
doc.base_href
#=> "/foo"
The base_href=
method allows you to manipulate the document's <base>
element.
require "nokogiri/html-ext"
doc = Nokogiri::HTML(%(<html><body>Hello, world!</body></html>))
doc.base_href = "/foo"
#=> "/foo"
doc.at_css("base").to_s
#=> "<base href=\"/foo\">"
doc = Nokogiri::HTML(%(<html><head><base href="/foo"><body>Hello, world!</body></html>))
doc.base_href = "/bar"
#=> "/bar"
doc.at_css("base").to_s
#=> "<base href=\"/bar\">"
nokogiri-html-ext will resolve a document's relative URLs against a provided source URL. The source URL should be an absolute URL (e.g. https://jgarber.example
) representing the location of the document being parsed. The source URL may be any String
(or any Ruby object that responds to #to_s
).
nokogiri-html-ext takes advantage of the Nokogiri::XML::Document.parse
method's second positional argument to set the parsed document's URL.Nokogiri's source code is very complex, but in short: the Nokogiri::HTML
method is an alias to the Nokogiri::HTML4
method which eventually winds its way to the aforementioned Nokogiri::XML::Document.parse
method. Phew. 🥵
URL resolution uses Ruby's built-in URL parsing and normalizing capabilities. Absolute URLs will remain unmodified.
Note: If the document's markup includes a <base>
element whose href
attribute is an absolute URL, that URL will take precedence when performing URL resolution.
An abbreviated example:
require "nokogiri/html-ext"
markup = <<-HTML
<html>
<body>
<a href="/home">Home</a>
<img src="/foo.png" srcset="../bar.png 720w">
</body>
</html>
HTML
doc = Nokogiri::HTML(markup, "https://jgarber.example")
doc.url
#=> "https://jgarber.example"
doc.base_href
#=> nil
doc.base_href = "/foo/bar/biz"
#=> "/foo/bar/biz"
doc.resolve_relative_urls!
doc.at_css("base")["href"]
#=> "https://jgarber.example/foo/bar/biz"
doc.at_css("a")["href"]
#=> "https://jgarber.example/home"
doc.at_css("img").to_s
#=> "<img src=\"https://jgarber.example/foo.png\" srcset=\"https://jgarber.example/foo/bar.png 720w\">"
You may also resolve an arbitrary String
representing a relative URL against the document's URL (or <base>
element's href
attribute value):
doc = Nokogiri::HTML(%(<html><base href="/foo/bar"></html>), "https://jgarber.example")
doc.resolve_relative_url("biz/baz")
#=> "https://jgarber.example/foo/biz/baz"
nokogiri-html-ext wouldn't exist without the Nokogiri project and its community.
nokogiri-html-ext is written and maintained by Jason Garber.
nokogiri-html-ext is freely available under the MIT License. Use it, learn from it, fork it, improve it, change it, tailor it to your needs.