Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move the sanitizer to purely be a filter. #110

Merged
merged 4 commits into from
May 18, 2016

Conversation

gsnedders
Copy link
Member

@gsnedders gsnedders commented Aug 27, 2013

This drops support for the tokenizing side of thing, which is sadly the only side that works in previous releases. Fixes #72. See all discussion there.

@hoppipolla-critic-bot
Copy link

Critic review: https://critic.hoppipolla.co.uk/r/287

This is an external review system which you may optionally use for the code review of your pull request.

@gsnedders gsnedders force-pushed the sanitizer-fixes-72 branch from 28bf43b to 28fb733 Compare May 17, 2016 22:10
@landscape-bot
Copy link

Code Health
Repository health increased by 0.45% when pulling 28fb733 on gsnedders:sanitizer-fixes-72 into f6741ea on html5lib:master.

@gsnedders
Copy link
Member Author

I think when squashed this is now ready to land.

@landscape-bot
Copy link

Code Health
Repository health increased by 0.46% when pulling f582c58 on gsnedders:sanitizer-fixes-72 into f6741ea on html5lib:master.

gsnedders added 3 commits May 18, 2016 17:48
As we no longer need the sanitizer to be shared between a filter and
a tokenizer, move the entire sanitizer to the filter module.
This is imported into this repo as its expectations are very much
implementation dependent, with expectations amended to match our
actual behaviour.
@gsnedders gsnedders force-pushed the sanitizer-fixes-72 branch from f582c58 to 57dfcae Compare May 18, 2016 16:49
@landscape-bot
Copy link

Code Health
Repository health increased by 0.46% when pulling 57dfcae on gsnedders:sanitizer-fixes-72 into f6741ea on html5lib:master.

# Remove attributes with disallowed URL values
for attr in (attr_names & self.attr_val_is_uri):
assert attr in attrs
# I don't have a clue where this regexp comes from or why it matches those
Copy link

@mgilson mgilson May 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this regex is trying to remove:

`                Literal backtic...
\x00-\x20        Ascii table characters starting at NULL up to and including Space -- Mostly non-printable characters...
\x7f-\xa0        Bytes with values 127 (DELETE) -> 160.  More non-printable characters...
\s               Whitespace

I don't know why it would be doing these things ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is it's something to do with legacy attribute parsing in old UAs, for whom such characters can alter parsing.

@gsnedders gsnedders deleted the sanitizer-fixes-72 branch May 20, 2016 17:18
gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request May 22, 2016
This should be unneeded since the sanitizer changes (html5lib#110)
gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request Jul 6, 2016
This should be unneeded since the sanitizer changes (html5lib#110)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sanitizing filter broken in 0.90
4 participants