Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitizing filter broken in 0.90 #72

Closed
gsnedders opened this issue Jun 21, 2013 · 9 comments · Fixed by #110
Closed

Sanitizing filter broken in 0.90 #72

gsnedders opened this issue Jun 21, 2013 · 9 comments · Fixed by #110

Comments

@gsnedders
Copy link
Member

http://code.google.com/p/html5lib/issues/detail?id=162

Reported by [email protected], Oct 10, 2010

DESCRIPTION

Consider the following interaction with html5lib 0.90:

    >>> from html5lib import html5parser, serializer, treebuilders, treewalkers
    >>> p = html5parser.HTMLParser(tree = treebuilders.getTreeBuilder('dom'))
    >>> dom = p.parse("""<body onload="sucker()">""") 
    >>> s = serializer.htmlserializer.HTMLSerializer(sanitize = True)
    >>> ''.join(s.serialize(treewalkers.getTreeWalker('dom')(dom)))
    u'<body onload=sucker()>'

This is clearly incorrect: the onload attribute should have been removed by the sanitizer during the serialization.

ANALYSIS

The problem is that there are two sanitizers: a tokenizing sanitizer in html5lib.sanitizer, and a sanitizing filter in html5lib.filter.sanitizer. To avoid duplication of code, these two sanitizers inherit from the class HTMLSanitizerMixin and both call that class's function sanitize_token.

Unfortunately, the format of tokens differs between tokenization and filtering. During tokenization, a token looks like this:

    >>> from html5lib import tokenizer
    >>> next(iter(tokenizer.HTMLTokenizer("""<body onload="sucker()">""")))
    {'selfClosing': False, 'data': [[u'onload', u'sucker()']], 'type': 3, 'name': u'body', 'selfClosingAcknowledged': False}

But during filtering, tokens look like this:

    >>> list(iter(treewalkers.getTreeWalker('dom')(dom)))[3]
    {'namespace': u'http:/​/​www.w3.org/​1999/​xhtml', 'type': 'StartTag', 'name': u'body', 'data': [(u'onload', u'sucker()')]}

When the sanitizing filter passes its token to the sanitize_token method of HTMLSanitizerMixin, nothing happens, because sanitize_token is expecting 'type' to be an integer.

OBSERVATION

Having two very similar but subtly different data formats for the same data type is dangerous: how many other incompatibilities are there?

WORKAROUND

I am working around this problem as follows: when I need to apply a sanitizing filter to a DOM tree, instead I do the following:

  1. Serialize the DOM to HTML without sanitization.
  2. Re-parse the HTML from step 1, using the sanitizing tokenizer.
@gsnedders
Copy link
Member Author

So the status is:

  • We only support sanitizing at the tokenizer level.
  • We support and are totally broken sanitizing at the filter level (and sometimes die, sometimes don't).

My gut says the filter level is the right level for the sanitizer to be operating (the middle of the parser doesn't make much sense, as what you really want to do is post-process the tree to remove what you dislike). I think we should probably goes as far as to remove the ability to change the tokenizer. The big downside of that, obviously, is that we go from the sanitizer only working as a tokenizer in 1.0b1 to that being unsupported and only working as a filter in 1.0b2…

Thoughts, @jgraham, @garethrees, @jsocol?

@gsnedders
Copy link
Member Author

#24 has some relevance, but given duck-typing doesn't help much.

@garethrees
Copy link

Thoughts, @jgraham, @garethrees, @jsocol?

I'm garethrees.co.uk. GL!

@gsnedders
Copy link
Member Author

Oops! Sorry! Attempt two: Thoughts, @gareth-rees, on the above?

@jsocol
Copy link

jsocol commented Jun 27, 2013

Ostensibly I agree that the filters are the more "correct" place to do sanitization, even if it means huge changes in bleach for 1.0, but I haven't really done it that way so my one question is: do filters enable both dropping the tag completely (with or without any content) and replacing it with an escaped version (e.g. &lt;script&gt;)?

@gsnedders
Copy link
Member Author

It's much easier to do with filters, as you're guaranteed a matching start tag and an end tag for each node, so you can maintain a simple stack to drop content. Obviously with tokenizers you either have to reimplement half the parser or accept you'll never get it quite right.

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue Aug 27, 2013
This drops support for the tokenizing side of thing, which is sadly
the only side that works in previous releases.
olberger added a commit to olberger/venus that referenced this issue Jan 22, 2014
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 19, 2014
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.

Also, replace the existing, tiny sanitizer testsuite with the one
in html5lib-tests.
jaromil pushed a commit to dyne/venus that referenced this issue Oct 15, 2014
@kurtmckee
Copy link

Howdy! I'm working to migrate the HTML sanitizer in feedparser to rely on html5lib. However, some of the feedparser unit tests are triggering the TypeError bug referenced in #68. If @gsnedders has written viable code to resolve this, would it be possible to coordinate feedparser's migration with the integration of the fix for this?

@gsnedders
Copy link
Member Author

@kurtmckee the fix for that is for currently to use the sanitizer as a filter when tokenizing and not when serializing; this will change once #72 gets fixed (which is PR #110) as then you'll use the sanitizer as a filter when serializing and not when tokenizing.

gsnedders added a commit to gsnedders/html5lib-python that referenced this issue Nov 25, 2015
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.

Also, replace the existing, tiny sanitizer testsuite with the one
in html5lib-tests.
@gsnedders gsnedders modified the milestones: 1.0, 0.99999999 May 8, 2016
gsnedders added a commit that referenced this issue May 9, 2016
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.

Also, replace the existing, tiny sanitizer testsuite with the one
in html5lib-tests.
gsnedders added a commit that referenced this issue May 9, 2016
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 17, 2016
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 17, 2016
gsnedders added a commit to gsnedders/html5lib-python that referenced this issue May 18, 2016
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.
@rando305
Copy link

Any chance someone could write replacement code for the following?

from html5lib import sanitizer

sanitizer.HTMLSanitizer.acceptable_elements.extend(settings.TEXT_ADDITIONAL_TAGS)

This would really help me out. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants