-
-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge input lookup when parsing (SAX) #2028
Comments
@juskoljo Thanks for reporting, and sorry you're having a problem. I'll try to take a look later today. |
Hi, any updates on this? 👍 |
@juskoljo Hi! Apologies for the slow response, it's been hard for me to find time to work on OSS recently. I'm having difficulty reproducing what you're seeing, here's what I get using the script you provided:
This is because the generated XML has multiple XML declarations. I took a few minutes to rewrite the script to avoid multiple decls (as well as multiple roots) and still can't reproduce. Here's what I did: #! /usr/bin/env ruby
# encoding: utf-8
require "nokogiri"
require "stringio"
start_xml = <<-XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
XML
end_xml = <<-XML
</root>
XML
template = <<-XML
<node>
<something>value</something>
<more>value</more>
<boom>%s</boom>
</node>
XML
class Handler < Nokogiri::XML::SAX::Document
def error(message)
raise message
end
end
parser = Nokogiri::XML::SAX::Parser.new(Handler.new)
# OK
xml = start_xml + (template % "#{"X" * 77}\n" * 300_000) + end_xml
parser.parse(xml)
puts "ok"
# OK
xml = start_xml + (template % "#{"X" * 77}\r\n" * 300_000) + end_xml
parser.parse(xml)
puts "ok"
# OK
xml = StringIO.new(start_xml + (template % "#{"X" * 77}\n" * 400_000) + end_xml)
parser.parse(xml)
puts "ok"
# internal error: Huge input lookup (RuntimeError)
xml = StringIO.new(start_xml + (template % "#{"X" * 77}\r\n" * 300_000) + end_xml)
parser.parse(xml)
puts "ok" which prints out four "ok"s. Can you help me reproduce this? Or help me discover what I'm doing differently from you (or what's different about my system)? My nokogiri is:
|
Hi @flavorjones No worries! :). You are right, the first script generated something that was not my intention... Note to myself: Never edit a script online while posting ;). Following script should reproduce the issue. The script is simulating a scenario when there is a big XML node (base64 encoded file) with CR and/or LF after every 77 chars.
Output:
My nokogiri seem to be the same as yours except:
|
Hi Mike, any updates on this? The snippet in my previous post should fire the exception 👍 |
@juskoljo Thanks for your patience, and apologies for for not replying sooner - your reply on May 31 fell through the cracks of my inbox (and I'm still struggling to spend time on OSS). I'll take a look today, I have some time blocked out. |
OK, I've explored this a bit and found something interesting. This script:
emits:
🤔 |
Here's the call stack when the error is raised:
|
I've narrowed this down to what I think is a libxml2 edge case in parsing elements in xmlParseCharDataComplex. Will spend some more time on it this weekend. |
OK, I've found the problem and I think it's a bug in libxml2. I'll write a brief description here, but will submit a bug report upstream and will think about patching Nokogiri's vendored library in the meantime. In brief: two things are happening simultaneously within
If only one or the other of these happens, nobody notices:
The bug is related to SAX parsing: by default libxml2 will read the doc in chunks of size You can actually see this in action by making sure the first line of the node's text is: node = ("x" * 3894) + "\n" # to always trigger the bug where the importance of 3894 is 4000-106, where 106 is the number of bytes occurring in the document before the node's text begins. Any boom node in this document that is longer than 10,000,000 characters will fail if its 3895th character is a newline. CRAZYTOWN. Here's the patch that fixes this problem (note that it looks like the same bug exists in diff --git a/parser.c b/parser.c
index f779eb6..ed78d94 100644
--- a/parser.c
+++ b/parser.c
@@ -4506,7 +4506,7 @@ get_more:
if (ctxt->instate == XML_PARSER_EOF)
return;
in = ctxt->input->cur;
- } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09));
+ } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0xD) || (*in == 0xA) || (*in == 0x9));
nbchar = 0;
}
ctxt->input->line = line;
@@ -4987,7 +4987,7 @@ get_more:
ctxt->input->col++;
goto get_more;
}
- } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0x09));
+ } while (((*in >= 0x20) && (*in <= 0x7F)) || (*in == 0xD) || (*in == 0xA) || (*in == 0x9));
xmlParseCommentComplex(ctxt, buf, len, size);
ctxt->instate = state;
return; |
I've opened an issue upstream (https://gitlab.gnome.org/GNOME/libxml2/-/issues/192) and have proposed a change (https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/86). |
Also, apparently we're fixing two bugs with this one! Good work. https://gitlab.gnome.org/nwellnhof/libxml2/-/commit/99bda1e1ee77783e43c9059af00cd326deee3372 |
@juskoljo I'm going to close this issue, since upstream has acknowledged the problem and seems likely to fix. Please open a new issue if you feel this is urgent enough to warrant applying a patch within Nokogiri's vendored libxml2. |
Hi @flavorjones Super! Thank you very much for the update and patch! While waiting update for libxml2 I think I'll manage with the patch you provided! Thanks again :) |
Hi
Apologies if this has been submitted already. I have problem parsing a big XML file that has ~20mb base64 encoded file attached. It seems that when parsing from IO and content has "\r\n" as line separators causes "Huge input lookup" error.
Reproduced "successfully" this with 1.10.9 and older versions
Thanks,
Jussi
Example:
Environment
The text was updated successfully, but these errors were encountered: