-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
100% Cpu usage on parse html with a lot of inline css #2020
Comments
Hi @rusikf, thanks for reporting, and sorry you're having trouble. I'll try to take a look shortly. |
OK, I got some time this morning to look into this. The summary: that you're describing performance characteristics of libxml2 (the underlying parsing library used by Nokogiri) and there's nothing we can easily do to change this behavior. I've posted a gist with all the code/scripts/profiling so these results can be reproduced: https://gist.github.com/flavorjones/fd27b0f62dd08812d830b82fbe5477f0 First, the baseline: running a simple ruby script using Nokogiri to parse the example document:
Next, reproducing this result in C calling libxml2 directly (that is, no Ruby or Nokogiri involved):
Great! This shows that Ruby/Nokogiri isn't significantly slower than calling libxml2 from C directly. Let's see what it's doing by using gperftools against the C executable: |
However, what's interesting is that the above is with the vendored libxml v2.9.10; but running this same code against libxml v2.9.4 (which is my local system's distro version), the code runs in about 1/3 of this time:
And the call graph is different: |
OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2. |
Ok , cool !I deleted style tags with regexp - as a quick fix - works
without high CPU.
…On Fri, 17 Apr 2020 19:55 Mike Dalessio, ***@***.***> wrote:
OK, placeholder for further investigation: the ~3x slowdown appears to be
correlated with the vendored libraries, not with the version of libxml2.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2020 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAXQUOYMUAQF23EK62GZZELRNCCXTANCNFSM4MCECAKQ>
.
|
OK, so this problem is exacerbated by the problem described in new issue #2022 which is that compiler optimization is not turned on when building the vendored libraries. Closing this for now, since you have a workaround. Another workaround would be to use your distro's system libraries (see nokogiri.org installation docs at https://nokogiri.org/tutorials/installing_nokogiri.html). Please watch #2022 for the permanent fix. |
Describe the bug
Hi, if I use nokogiri with big html where 90% is inline css it cause 100% cpu usage
To Reproduce
Expected behavior
Not to have cpu usage 100%
Environment
# Nokogiri (1.10.9) --- warnings: [] nokogiri: 1.10.9 ruby: version: 2.6.4 platform: x86_64-linux description: ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-linux] engine: ruby libxml: binding: extension source: packaged libxml2_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxml2/2.9.10" libxslt_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxslt/1.1.34" libxml2_patches: - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch - 0002-Remove-script-macro-support.patch - 0003-Update-entities-to-remove-handling-of-ssi.patch - 0004-libxml2.la-is-in-top_builddir.patch - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch libxslt_patches: [] compiled: 2.9.10 loaded: 2.9.10
This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.
Additional context
The problem is fixed by hack - removing inline css from html before parse:
html.gsub!(/<style((.|\n|\r)*?)<\/style>/, '')
The text was updated successfully, but these errors were encountered: