-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to be compatible with html5lib's sanitization? #624
Comments
If you're migrating from one system to another and want to maintain the current behavior of the system you're migrating from, you could copy the list of HTML tags that you believe are safe from html5lib into your code. You could use the vendored code, but that's not part of Bleach's exposed API and we're not tracking changes for that. I'm not aware of how many other people are currently migrating from html5lib sanitizer to Bleach. Maybe it's worth putting together a migration guide. Would you want to tackle something like that? |
Yes, I could help with the migration guide. Would you want to host it in Bleach, or would you rather see it in html5lib? Would you be open to exposing a compatibility mode in Bleach? Also, is there a way to not turn special characters into HTML entities? Like when |
I don't work on html5lib anymore, so I think this would be a migration guide that lived in the Bleach docs. I'm not interested in implemented or maintaining a compatibility mode. I don't know what you mean by your last question. Can you provide some example code? |
By my last question, I meant this: bleach.clean("\"'")
# -> "' instead of "' Can the caller opt out and keep the quotes? |
What's the use case for not switching to character entities? |
Admittedly, not a very strong one. Some regression tests have triggered, but I've already proposed to the test owners to rebase these. I agree this shouldn't be an issue in most rendering contexts - but I don't own these tests and I just wanted to check if perhaps there was a way to not escape the quotes. (This is not a feature request! Just wanted to know in case there was an easy way out.) ... and based on your response, I'm assuming there isn't, correct? |
Bleach sanitizes text for use in an HTML context. Because of that, I think it really should be encoding characters like |
Thanks, makes sense. (In your last sentence, did you mean to say "the sanitizer would know what was before and after"?) The one thing left then is the migration guide we discussed above. I've created #625. |
The sanitizer is the primary component of Bleach, so I use them interchangeably. Sorry about that. |
I've been using html5lib's native sanitization, but I'm migrating to Bleach, as recommended in https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#11. I'm replacing my previous code:
with this:
Notice how Bleach escapes my
<u>hi</u>
because theu
-tag isn't on its very conservative tokenizer allow-list. I am aware that I could specify my own list of allowed tags, like so:but how would I construct
MY_LIST_THAT_INCLUDES_U
such that it's backward-compatible with what html5lib's sanitizer allows (allowed me until now)? I think I'd need to get at the larger tag set in the vendored html5lib sanitizer. Is that what I'm supposed to do? Is there a neater way to achieve that? Has this come up as a use case previously?The text was updated successfully, but these errors were encountered: