Sanitizing filter broken in 0.90 #72

gsnedders · 2013-06-21T16:49:27Z

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by [email protected], Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:
    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
    >>> dom = p.parse("""<body onload="sucker()">""") 
    >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
    u'<body onload=sucker()>'
This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:
    >>> from html5lib import tokenizer
    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}
But during filtering, tokens look like this:
    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
    {'namespace': u'http://www.w3.org/1999/xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}
When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

Serialize the DOM to HTML without sanitization.

Re-parse the HTML from step 1, using the sanitizing tokenizer.

The text was updated successfully, but these errors were encountered:

gsnedders · 2013-06-26T22:54:37Z

So the status is:

We only support sanitizing at the tokenizer level.
We support and are totally broken sanitizing at the filter level (and sometimes die, sometimes don't).

My gut says the filter level is the right level for the sanitizer to be operating (the middle of the parser doesn't make much sense, as what you really want to do is post-process the tree to remove what you dislike). I think we should probably goes as far as to remove the ability to change the tokenizer. The big downside of that, obviously, is that we go from the sanitizer only working as a tokenizer in 1.0b1 to that being unsupported and only working as a filter in 1.0b2…

Thoughts, @jgraham, @garethrees, @jsocol?

gsnedders · 2013-06-26T23:38:52Z

#24 has some relevance, but given duck-typing doesn't help much.

garethrees · 2013-06-27T08:57:09Z

Thoughts, @jgraham, @garethrees, @jsocol?

I'm garethrees.co.uk. GL!

gsnedders · 2013-06-27T11:51:29Z

Oops! Sorry! Attempt two: Thoughts, @gareth-rees, on the above?

jsocol · 2013-06-27T16:43:21Z

Ostensibly I agree that the filters are the more "correct" place to do sanitization, even if it means huge changes in bleach for 1.0, but I haven't really done it that way so my one question is: do filters enable both dropping the tag completely (with or without any content) and replacing it with an escaped version (e.g. <script>)?

gsnedders · 2013-06-27T16:53:33Z

It's much easier to do with filters, as you're guaranteed a matching start tag and an end tag for each node, so you can maintain a simple stack to drop content. Obviously with tokenizers you either have to reimplement half the parser or accept you'll never get it quite right.

This drops support for the tokenizing side of thing, which is sadly the only side that works in previous releases.

…ound for https://code.google.com/p/html5lib/issues/detail?id=210 https://code.google.com/p/html5lib/issues/detail?id=210 The sanitizer filter seems to be buggy (see also html5lib/html5lib-python#72) so we rely on the sanitizing tokenizer instead

As we no longer need the sanitizer to be shared between a filter and a tokenizer, move the entire sanitizer to the filter module. Also, replace the existing, tiny sanitizer testsuite with the one in html5lib-tests.

…ound for https://code.google.com/p/html5lib/issues/detail?id=210 https://code.google.com/p/html5lib/issues/detail?id=210 The sanitizer filter seems to be buggy (see also html5lib/html5lib-python#72) so we rely on the sanitizing tokenizer instead

kurtmckee · 2015-05-04T18:08:54Z

Howdy! I'm working to migrate the HTML sanitizer in feedparser to rely on html5lib. However, some of the feedparser unit tests are triggering the TypeError bug referenced in #68. If @gsnedders has written viable code to resolve this, would it be possible to coordinate feedparser's migration with the integration of the fix for this?

gsnedders · 2015-05-05T07:34:24Z

@kurtmckee the fix for that is for currently to use the sanitizer as a filter when tokenizing and not when serializing; this will change once #72 gets fixed (which is PR #110) as then you'll use the sanitizer as a filter when serializing and not when tokenizing.

As we no longer need the sanitizer to be shared between a filter and a tokenizer, move the entire sanitizer to the filter module. Also, replace the existing, tiny sanitizer testsuite with the one in html5lib-tests.

Undoes deletion of the testsuite

…er only.

As we no longer need the sanitizer to be shared between a filter and a tokenizer, move the entire sanitizer to the filter module.

…er only.

As we no longer need the sanitizer to be shared between a filter and a tokenizer, move the entire sanitizer to the filter module.

rando305 · 2016-11-18T15:18:18Z

Any chance someone could write replacement code for the following?

from html5lib import sanitizer

sanitizer.HTMLSanitizer.acceptable_elements.extend(settings.TEXT_ADDITIONAL_TAGS)

This would really help me out. Thanks.

gsnedders mentioned this issue Jun 21, 2013

Sanitizer and lxml tree walker: TypeError: unhashable type #68

Closed

gsnedders mentioned this issue Aug 3, 2013

lxml trees parsed by html5lib can not be used with lxml.clean #102

Closed

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue Aug 27, 2013

Fix html5lib#72: Move the sanitizer to purely be a filter.

85256cf

This drops support for the tokenizing side of thing, which is sadly the only side that works in previous releases.

gsnedders mentioned this issue Aug 27, 2013

Move the sanitizer to purely be a filter. #110

Merged

gsnedders modified the milestones: 1.0, 0.99999999 May 8, 2016

gsnedders added a commit that referenced this issue May 9, 2016

squash! Fix #72: rewrite the sanitizer to be a treewalker filter only.

08a5eca

Undoes deletion of the testsuite

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

95a0be3

…er only.

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

42fde37

…er only.

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

d4abff1

…er only.

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

393f86a

…er only.

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

62ce217

…er only.

gsnedders added a commit that referenced this issue May 9, 2016

fixup! squash! Fix #72: rewrite the sanitizer to be a treewalker filt…

28bf43b

…er only.

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 17, 2016

fixup! Fix html5lib#72: rewrite the sanitizer to be a treewalker filt…

f582c58

…er only.

gsnedders closed this as completed in #110 May 18, 2016

gsnedders mentioned this issue Jul 15, 2016

cannot import name 'sanitizer' #277

Closed

malcolmr mentioned this issue Jul 24, 2016

Sanitizer example in documentation needs rewriting #289

Closed

JoeJasinski mentioned this issue Aug 6, 2016

html5lib version 0.999999999 (latest) does not work with djangocms-text-ckeditor django-cms/djangocms-text-ckeditor#344

Closed

nicorikken mentioned this issue Nov 10, 2016

Doesn't work out of the box freedomvote/freedomvote#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitizing filter broken in 0.90 #72

Sanitizing filter broken in 0.90 #72

gsnedders commented Jun 21, 2013

gsnedders commented Jun 26, 2013

gsnedders commented Jun 26, 2013

garethrees commented Jun 27, 2013

gsnedders commented Jun 27, 2013

jsocol commented Jun 27, 2013

gsnedders commented Jun 27, 2013

kurtmckee commented May 4, 2015

gsnedders commented May 5, 2015

rando305 commented Nov 18, 2016

Sanitizing filter broken in 0.90 #72

Sanitizing filter broken in 0.90 #72

Comments

gsnedders commented Jun 21, 2013

gsnedders commented Jun 26, 2013

gsnedders commented Jun 26, 2013

garethrees commented Jun 27, 2013

gsnedders commented Jun 27, 2013

jsocol commented Jun 27, 2013

gsnedders commented Jun 27, 2013

kurtmckee commented May 4, 2015

gsnedders commented May 5, 2015

rando305 commented Nov 18, 2016