Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory corruption from meta tags claiming a charset of ISO-8859-1 #55

Closed
jmhodges opened this issue May 16, 2009 · 13 comments
Closed

memory corruption from meta tags claiming a charset of ISO-8859-1 #55

jmhodges opened this issue May 16, 2009 · 13 comments

Comments

@jmhodges
Copy link
Contributor

Here's an example. google-try contains nothing but the meta tag as seen in the read.

j = Nokogiri::HTML.parse('<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">')
=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
irb(main):002:0> l=nil; j.traverse{|e| l = e.to_s if e.name == "meta" }
=> nil
irb(main):003:0> l
=> "<\003 ction_view/template_handler.rb=\"\" ction_view/template_handler.rb=\"\"></\003>"

Sometimes this segfaults. Sometimes it works fine. Hunting.

Oh, and if you remove the 'charset=ISO-8859-1', it does not happen ever.

And this is OS X, libxml2 2.7.3 (2.7.3_0 in MacPorts).

@flavorjones
Copy link
Member

jmhodges - what version of Nokogiri are you using?

With 1.2.3 I cannot reproduce. On 1.2.4 I get an encoding error:

encoding error : output conversion failed due to conv error, bytes 0xE7 0xB7 0x10 0x3D
I/O error : encoder error

@jmhodges
Copy link
Contributor Author

Yeah, 1.2.4 and 1.2.3 for me. Here's a gist of an irb session that didn't segfault immediately with nokogiri 1.2.3 and here's one that did with nokogiri 1.2.4.

As you can see, I only get the error you did when the code does not segfault in 1.2.3. What is the output of meta_tag.to_s (l.to_s in those posts) on your machine?

I did see similar errors in a larger file that had some entities seemingly translated to utf-8 while the file was being parsed (I believe! Not sure!) as ISO-8859-1.

@jmhodges
Copy link
Contributor Author

Sorry, added a simpler bit of code that expresses it. Using #at instead of #traverse.

@flavorjones
Copy link
Member

whoop, I've managed to get valgrind to complain. consider it reproduced, and I'm on the case.

@jmhodges
Copy link
Contributor Author

Cool, I wasn't able to get the OS X beta valgrind to talk to me much about it but I haven't used it before. If you want, I can run it again and post something up (assuming, you're not already on OS X).

@flavorjones
Copy link
Member

This is a libxml2 bug. I just wrote a C program that reproduces it, and will be submitting it to libxml2's tracker tonight.

@flavorjones
Copy link
Member

C program to reproduce is at http://gist.github.com/112897

@flavorjones
Copy link
Member

Bug has been filed at http://bugzilla.gnome.org/show_bug.cgi?id=582913

@jmhodges
Copy link
Contributor Author

mdalessio, I love you a little bit. Never change. <3

@jmhodges
Copy link
Contributor Author

Also, sorry for not hunting this down myself. I got lost in a thicket of build problems last night.

@flavorjones
Copy link
Member

awwww, now you made me blush.

@flavorjones
Copy link
Member

closing this ticket, since it's now in the good hands of the libxml2 team.

@flavorjones
Copy link
Member

F yer I: Daniel V (maintainer of libxml2) has updated the libxml2 bugzilla ticket. Here's his comment

Okay, I have fixed htmlSetMetaEncoding() to be nicer,
not destroy existing meta encoding elements just update the
property and only if needed, which is never the case if you
just output part of the current document without asking for encoding
changes.
Fixed in git (8d7c1b7ab296ea2e8c8d18d7b8f3d24e0963f8ff)

 thanks for the report,

Daniel

You might want to check out that version and verify that it addresses your particular pain. Have a nice day.

flavorjones pushed a commit that referenced this issue Apr 7, 2021
Neither libxml2 nor Nokogiri contain an API for setting the line numbers
for a node. When the libxml2 headers are available, the line numbers can
be set directly in the node structure.

Closes: #53
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants