Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lxml trees parsed by html5lib can not be used with lxml.clean #102

Closed
tahajahangir opened this issue Jul 30, 2013 · 8 comments
Closed

lxml trees parsed by html5lib can not be used with lxml.clean #102

tahajahangir opened this issue Jul 30, 2013 · 8 comments

Comments

@tahajahangir
Copy link
Contributor

This simple script fails with html5lib.

import html5lib
import lxml.html.clean

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)
# tree = lxml.html.document_fromstring(html)
tree = parser.parse("<html><body><!-- a comment --></body></html>")

cleaner = lxml.html.clean.Cleaner()
cleaner(tree)

The problem is lxml.html.document_fromstring return an element with type lxml.html.HtmlElement, but HTMLParser.parse returns with type lxml.etree._ElementTree

@gsnedders
Copy link
Member

We unfortunately cannot return a lxml HTML tree unless we entirely break namespace support, which seems undesirable in the extreme, thus closing as wontfix.

@tahajahangir
Copy link
Contributor Author

So, what is namespaceHTMLElements=False for?

We can set namespaceHTMLElements=False and use lxml HTML tree.

@gsnedders
Copy link
Member

namespaceHTMLElements=False only changes what namespace HTML elements are put in, e.g., the html element is put is in the void namespace instead of the XHTML namespace. However, HTML also supports SVG and MathML elements, which are put in their respective namespaces regardless of namespaceHTMLElements. (If they didn't, the option would introduce ambiguity, as how else would you distinguish the script element in the HTML namespace from the script element in the SVG namespace?)

What are you actually wanting to use lxml.html.clean.Cleaner for? For sanitizing/purifying/whatever-you-want-to-call it? If so, you may want to try either https://github.com/jsocol/bleach or html5lib's own sanitizer (though note the API of that will probably change prior to 1.0 due to #72.

Regardless, it seems like, on the face of it, that it should support XML trees (as it is, it doesn't work for XHTML parsed by lxml either!). In general terms, I don't think its worthwhile to support.

(Looking at the implementation of lxml.html.clean.Cleaner, it looks like it is vaguely meant to work with XML/XHTML trees; possibly worth asking on lxml mailing list?)

@davirtavares
Copy link

Hi, I'm very interested on html5lib be able tu use lxml's HtmlMixin, can you @gsnedders please explain why it would break support for NS (at code level)?

@requiredfield
Copy link

Hi, I'm very interested on html5lib be able tu use lxml's HtmlMixin

Same here. The functionality it provides like make_links_absolute and rewrite_links is essential for what I'm doing.

@gsnedders, do you have any suggestions? @davirtavares, did you ever figure out anything?

@davirtavares
Copy link

Hey @requiredfield, at all I ended writing my own version of these methods as funcs, based on the lxml's code https://github.com/lxml/lxml/blob/master/src/lxml/html/__init__.py#L455. Unfortunately had no time to fixing it by a more elegant way :/

@cjerdonek
Copy link

We unfortunately cannot return a lxml HTML tree unless we entirely break namespace support

Responding to this (and independent of this issue as it was originally filed), would it make sense for lxml.html to add support for namespaces?

@gsnedders
Copy link
Member

Responding to this (and independent of this issue as it was originally filed), would it make sense for lxml.html to add support for namespaces?

Yes, given HTML has been defined, and implemented in browsers, to put things in namespaces for almost ten years now. That might need changes in libxml2 too, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants