Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop chardet #1269

Merged
merged 12 commits into from
Sep 15, 2020
Merged

Drop chardet #1269

merged 12 commits into from
Sep 15, 2020

Conversation

tomchristie
Copy link
Member

@tomchristie tomchristie commented Sep 8, 2020

Closes #1018

  • Drops chardet for charset auto-detection.
  • Drops response.apparent_encoding.
  • response.iter_text() no longer buffers.
  • Responses with Content-Type: text/... and no explicit charset no longer default to iso-8859-1, since the RFC there is considered obsoleted behaviour.

Instead simplifies our auto-detection approach, so that...

If an encoding is explicitly specified, then we use that. Otherwise our strategy is to attempt UTF-8, and fallback to Windows 1252.

Note that UTF-8 is a strict superset of ascii, and Windows 1252 is a superset of the non-control characters in iso-8859-1, so we essentially end up supporting any of ascii, utf-8, iso-8859-1, cp1252.

Given that UTF-8 is now by far the most widely used encoding, this should be a pretty robust strategy for cases where a charset has not been explicitly included.

Useful stats on the prevalence of different charsets in the wild...

The HTML5 spec also has some useful guidelines, suggesting defaults of either UTF-8 or Windows 1252 in most cases...

Users can override this behaviour if required with an explicit response.encoding = ....

I do also have some thoughts about exposing a text_decoder=... interface to allow overriding this behaviour, but I think we should probably treat that as a follow-up PR.

@tomchristie tomchristie added the enhancement New feature or request label Sep 9, 2020
@tomchristie tomchristie marked this pull request as ready for review September 9, 2020 12:53
@tomchristie
Copy link
Member Author

One little edge case to consider here is what behaviour we expect for streaming text on responses that do not include any explicit charset. There are three different things we could choose to do here...

  • Decide which of utf-8 / cp1252 to use on the first decoder pass, and stick with that. (Potentially more robust, but also in edge cases could potentially result in inconsistent behaviour depending on how much data is received in the first pass.)
  • Always use utf-8 in streaming cases. (More consistent, potentially less robust.)
  • Buffer up some minimum amount of data before selecting a decoder. (Most robust, but introduces buffering artefacts.)

Currently this PR is using the first of those approaches.

return self.decoder.decode(data)
self.decoder = codecs.getincrementaldecoder("cp1252")(errors="replace")
else:
self.decoder = codecs.getincrementaldecoder("utf-8")(errors="replace")
Copy link
Contributor

@StephenBrown2 StephenBrown2 Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ("utf-8")(errors="replace") here if it passed with ("utf-8")(errors="strict") to get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need strict to raise an error if it doesn't appear to decode as UTF-8, but once we've made the decision we use errors="replace" for the most robust behaviour possible. So eg. if we've got a streaming response that initially appears to be UTF-8, but later has some non-UTF-8 bytes, then we're not raising a hard error on accessing .text.

(We'd like it to have a failure mode that is as graceful as possible.)

docs/quickstart.md Outdated Show resolved Hide resolved
@tomchristie
Copy link
Member Author

Alrighty then, let's press on with this! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace chardet with plain UTF-8 detection
3 participants