-
-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop chardet #1269
Drop chardet #1269
Conversation
One little edge case to consider here is what behaviour we expect for streaming text on responses that do not include any explicit charset. There are three different things we could choose to do here...
Currently this PR is using the first of those approaches. |
return self.decoder.decode(data) | ||
self.decoder = codecs.getincrementaldecoder("cp1252")(errors="replace") | ||
else: | ||
self.decoder = codecs.getincrementaldecoder("utf-8")(errors="replace") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ("utf-8")(errors="replace")
here if it passed with ("utf-8")(errors="strict")
to get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need strict
to raise an error if it doesn't appear to decode as UTF-8, but once we've made the decision we use errors="replace"
for the most robust behaviour possible. So eg. if we've got a streaming response that initially appears to be UTF-8, but later has some non-UTF-8 bytes, then we're not raising a hard error on accessing .text
.
(We'd like it to have a failure mode that is as graceful as possible.)
Co-authored-by: Florimond Manca <[email protected]>
Alrighty then, let's press on with this! 👍 |
Closes #1018
chardet
for charset auto-detection.response.apparent_encoding
.response.iter_text()
no longer buffers.Content-Type: text/...
and no explicitcharset
no longer default toiso-8859-1
, since the RFC there is considered obsoleted behaviour.Instead simplifies our auto-detection approach, so that...
If an encoding is explicitly specified, then we use that. Otherwise our strategy is to attempt UTF-8, and fallback to Windows 1252.
Note that UTF-8 is a strict superset of ascii, and Windows 1252 is a superset of the non-control characters in iso-8859-1, so we essentially end up supporting any of ascii, utf-8, iso-8859-1, cp1252.
Given that UTF-8 is now by far the most widely used encoding, this should be a pretty robust strategy for cases where a charset has not been explicitly included.
Useful stats on the prevalence of different charsets in the wild...
The HTML5 spec also has some useful guidelines, suggesting defaults of either UTF-8 or Windows 1252 in most cases...
Users can override this behaviour if required with an explicit
response.encoding = ...
.I do also have some thoughts about exposing a
text_decoder=...
interface to allow overriding this behaviour, but I think we should probably treat that as a follow-up PR.