Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to be compatible with html5lib's sanitization? #624

Closed
sferencik opened this issue Dec 20, 2021 · 9 comments · Fixed by #625
Closed

How to be compatible with html5lib's sanitization? #624

sferencik opened this issue Dec 20, 2021 · 9 comments · Fixed by #625

Comments

@sferencik
Copy link
Contributor

I've been using html5lib's native sanitization, but I'm migrating to Bleach, as recommended in https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#11. I'm replacing my previous code:

html5lib.serialize(html5parser.HTMLParser().parseFragment("<u>hi</u>"), sanitize=True)
# -> <u>hi</u>

with this:

bleach.clean("<u>hi</u>")
# -> &lt;u&gt;hi&lt;/u&gt;

Notice how Bleach escapes my <u>hi</u> because the u-tag isn't on its very conservative tokenizer allow-list. I am aware that I could specify my own list of allowed tags, like so:

bleach.clean("<u>hi</u>", tags=MY_LIST_THAT_INCLUDES_U)
# -> <u>hi</u>

but how would I construct MY_LIST_THAT_INCLUDES_U such that it's backward-compatible with what html5lib's sanitizer allows (allowed me until now)? I think I'd need to get at the larger tag set in the vendored html5lib sanitizer. Is that what I'm supposed to do? Is there a neater way to achieve that? Has this come up as a use case previously?

@willkg
Copy link
Member

willkg commented Dec 20, 2021

If you're migrating from one system to another and want to maintain the current behavior of the system you're migrating from, you could copy the list of HTML tags that you believe are safe from html5lib into your code.

You could use the vendored code, but that's not part of Bleach's exposed API and we're not tracking changes for that.

I'm not aware of how many other people are currently migrating from html5lib sanitizer to Bleach. Maybe it's worth putting together a migration guide. Would you want to tackle something like that?

@sferencik
Copy link
Contributor Author

Yes, I could help with the migration guide. Would you want to host it in Bleach, or would you rather see it in html5lib? Would you be open to exposing a compatibility mode in Bleach?

Also, is there a way to not turn special characters into HTML entities? Like when "' becomes &#34;&#39;. That's another incompatibility; I haven't quite got to the bottom of it but I don't suppose there's a way to keep these as "'?

@willkg
Copy link
Member

willkg commented Dec 20, 2021

I don't work on html5lib anymore, so I think this would be a migration guide that lived in the Bleach docs.

I'm not interested in implemented or maintaining a compatibility mode.

I don't know what you mean by your last question. Can you provide some example code?

@sferencik
Copy link
Contributor Author

By my last question, I meant this:

bleach.clean("\"'")
# -> &#34;&#39; instead of "'

Can the caller opt out and keep the quotes?

@willkg
Copy link
Member

willkg commented Dec 20, 2021

What's the use case for not switching to character entities?

@sferencik
Copy link
Contributor Author

sferencik commented Dec 20, 2021

Admittedly, not a very strong one. Some regression tests have triggered, but I've already proposed to the test owners to rebase these. I agree this shouldn't be an issue in most rendering contexts - but I don't own these tests and I just wanted to check if perhaps there was a way to not escape the quotes. (This is not a feature request! Just wanted to know in case there was an easy way out.)

... and based on your response, I'm assuming there isn't, correct?

@willkg
Copy link
Member

willkg commented Dec 20, 2021

Bleach sanitizes text for use in an HTML context. Because of that, I think it really should be encoding characters like <, >, ', ", etc. I think it should be encoding those characters. I forget what the sanitizer did. If it sanitized entire documents, that'd be different because Bleach would know what was before and after and could be less strict.

@sferencik
Copy link
Contributor Author

Thanks, makes sense. (In your last sentence, did you mean to say "the sanitizer would know what was before and after"?)

The one thing left then is the migration guide we discussed above. I've created #625.

@willkg
Copy link
Member

willkg commented Jan 3, 2022

The sanitizer is the primary component of Bleach, so I use them interchangeably. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants