Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% Cpu usage on parse html with a lot of inline css #2020

Closed
rusikf opened this issue Apr 6, 2020 · 7 comments
Closed

100% Cpu usage on parse html with a lot of inline css #2020

rusikf opened this issue Apr 6, 2020 · 7 comments

Comments

@rusikf
Copy link

rusikf commented Apr 6, 2020

Describe the bug
Hi, if I use nokogiri with big html where 90% is inline css it cause 100% cpu usage

To Reproduce

#! /usr/bin/env ruby

require 'nokogiri'
require 'net/http'

url = URI('https://baliaquaponics.com')
html = Net::HTTP.get_response(url).body

puts "html size", html.size
doc = Nokogiri::HTML.parse(html)

puts 'OK'

Expected behavior

Not to have cpu usage 100%

Environment
# Nokogiri (1.10.9) --- warnings: [] nokogiri: 1.10.9 ruby: version: 2.6.4 platform: x86_64-linux description: ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-linux] engine: ruby libxml: binding: extension source: packaged libxml2_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxml2/2.9.10" libxslt_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxslt/1.1.34" libxml2_patches: - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch - 0002-Remove-script-macro-support.patch - 0003-Update-entities-to-remove-handling-of-ssi.patch - 0004-libxml2.la-is-in-top_builddir.patch - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch libxslt_patches: [] compiled: 2.9.10 loaded: 2.9.10

This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.

Additional context
The problem is fixed by hack - removing inline css from html before parse:
html.gsub!(/<style((.|\n|\r)*?)<\/style>/, '')

@flavorjones
Copy link
Member

Hi @rusikf, thanks for reporting, and sorry you're having trouble. I'll try to take a look shortly.

@flavorjones
Copy link
Member

OK, I got some time this morning to look into this.

The summary: that you're describing performance characteristics of libxml2 (the underlying parsing library used by Nokogiri) and there's nothing we can easily do to change this behavior.

I've posted a gist with all the code/scripts/profiling so these results can be reproduced: https://gist.github.com/flavorjones/fd27b0f62dd08812d830b82fbe5477f0

First, the baseline: running a simple ruby script using Nokogiri to parse the example document:

$ ruby ./foo.rb
       user     system      total        real
  3.725381   0.003732   3.729113 (  3.729237)

Next, reproducing this result in C calling libxml2 directly (that is, no Ruby or Nokogiri involved):

$ time ./foo
3808 ms

real	0m3.811s
user	0m3.802s
sys	0m0.008s

Great! This shows that Ruby/Nokogiri isn't significantly slower than calling libxml2 from C directly. Let's see what it's doing by using gperftools against the C executable:

image

@flavorjones
Copy link
Member

However, what's interesting is that the above is with the vendored libxml v2.9.10; but running this same code against libxml v2.9.4 (which is my local system's distro version), the code runs in about 1/3 of this time:

$ time ./foo
1010 ms

real	0m1.015s
user	0m1.010s
sys	0m0.004s

And the call graph is different:

image

@flavorjones
Copy link
Member

OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2.

@rusikf
Copy link
Author

rusikf commented Apr 17, 2020 via email

@flavorjones
Copy link
Member

OK, so this problem is exacerbated by the problem described in new issue #2022 which is that compiler optimization is not turned on when building the vendored libraries.

Closing this for now, since you have a workaround. Another workaround would be to use your distro's system libraries (see nokogiri.org installation docs at https://nokogiri.org/tutorials/installing_nokogiri.html).

Please watch #2022 for the permanent fix.

@ilyazub
Copy link
Contributor

ilyazub commented Oct 12, 2020

Another workaround is to pass CFLAGS="-O2" environment variable while installing nokogiri. That's the same as #2022 until it's done.

gem uninstall nokogiri
CFLAGS="-O2" bundle install

It works because CFLAGS are passed here and there in ext/nokogiri/extconf.rb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants