Drop chardet #1269

tomchristie · 2020-09-08T16:13:39Z

Drops chardet for charset auto-detection.
Drops response.apparent_encoding.
response.iter_text() no longer buffers.
Responses with Content-Type: text/... and no explicit charset no longer default to iso-8859-1, since the RFC there is considered obsoleted behaviour.

Instead simplifies our auto-detection approach, so that...

If an encoding is explicitly specified, then we use that. Otherwise our strategy is to attempt UTF-8, and fallback to Windows 1252.

Note that UTF-8 is a strict superset of ascii, and Windows 1252 is a superset of the non-control characters in iso-8859-1, so we essentially end up supporting any of ascii, utf-8, iso-8859-1, cp1252.

Given that UTF-8 is now by far the most widely used encoding, this should be a pretty robust strategy for cases where a charset has not been explicitly included.

Useful stats on the prevalence of different charsets in the wild...

The HTML5 spec also has some useful guidelines, suggesting defaults of either UTF-8 or Windows 1252 in most cases...

https://dev.w3.org/html5/spec-LC/Overview.html

Users can override this behaviour if required with an explicit response.encoding = ....

I do also have some thoughts about exposing a text_decoder=... interface to allow overriding this behaviour, but I think we should probably treat that as a follow-up PR.

tomchristie · 2020-09-09T13:00:20Z

One little edge case to consider here is what behaviour we expect for streaming text on responses that do not include any explicit charset. There are three different things we could choose to do here...

Decide which of utf-8 / cp1252 to use on the first decoder pass, and stick with that. (Potentially more robust, but also in edge cases could potentially result in inconsistent behaviour depending on how much data is received in the first pass.)
Always use utf-8 in streaming cases. (More consistent, potentially less robust.)
Buffer up some minimum amount of data before selecting a decoder. (Most robust, but introduces buffering artefacts.)

Currently this PR is using the first of those approaches.

StephenBrown2 · 2020-09-09T15:18:15Z

httpx/_decoders.py

-                return self.decoder.decode(data)
+                self.decoder = codecs.getincrementaldecoder("cp1252")(errors="replace")
+            else:
+                self.decoder = codecs.getincrementaldecoder("utf-8")(errors="replace")


Why ("utf-8")(errors="replace") here if it passed with ("utf-8")(errors="strict") to get here?

We need strict to raise an error if it doesn't appear to decode as UTF-8, but once we've made the decision we use errors="replace" for the most robust behaviour possible. So eg. if we've got a streaming response that initially appears to be UTF-8, but later has some non-UTF-8 bytes, then we're not raising a hard error on accessing .text.

(We'd like it to have a failure mode that is as graceful as possible.)

docs/quickstart.md

Co-authored-by: Florimond Manca <[email protected]>

tomchristie · 2020-09-15T10:20:15Z

Alrighty then, let's press on with this! 👍

tomchristie added 8 commits September 7, 2020 14:20

Internal refactoring to swap auth/redirects ordering

2ff1573

Drop chardet for charset detection

9fe9521

Drop chardet in favour of simpler charset autodetection

d038272

Revert unintentionally included changes

dd8a8a4

Update test case

adcce7e

Merge branch 'master' into drop-chardet

7bce9f3

Refactor to prefer different decoding style

7e1568c

Update text decoding docs/docstrings

12f2664

tomchristie added the enhancement New feature or request label Sep 9, 2020

tomchristie marked this pull request as ready for review September 9, 2020 12:53

Resolve typo

b1d91c9

StephenBrown2 reviewed Sep 9, 2020

View reviewed changes

florimondmanca reviewed Sep 9, 2020

View reviewed changes

docs/quickstart.md Outdated Show resolved Hide resolved

tomchristie and others added 2 commits September 10, 2020 13:56

Merge branch 'master' into drop-chardet

5353a9a

Update docs/quickstart.md

de579ac

Co-authored-by: Florimond Manca <[email protected]>

konstin mentioned this pull request Sep 11, 2020

Text encoding is changed when passing a utf-8 string to content lundberg/respx#73

Closed

Merge branch 'master' into drop-chardet

2c69ff5

tomchristie merged commit d0fe113 into master Sep 15, 2020

tomchristie deleted the drop-chardet branch September 15, 2020 10:20

tomchristie mentioned this pull request Sep 21, 2020

Version 0.15.0 #1301

Merged

4 tasks

villebro mentioned this pull request Jan 9, 2021

build: try to speed up Github workflows apache/superset#12090

Merged

6 tasks

johnthagen mentioned this pull request Jul 15, 2021

Consider non-LGPL character encoding library #1013

Closed

benoit74 mentioned this pull request Jun 14, 2024

Automated encoding detection is still not working properly openzim/warc2zim#312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop chardet #1269

Drop chardet #1269

tomchristie commented Sep 8, 2020 •

edited

Loading

tomchristie commented Sep 9, 2020

StephenBrown2 Sep 9, 2020 •

edited

Loading

tomchristie Sep 9, 2020

tomchristie commented Sep 15, 2020

Drop chardet #1269

Drop chardet #1269

Conversation

tomchristie commented Sep 8, 2020 • edited Loading

tomchristie commented Sep 9, 2020

StephenBrown2 Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

tomchristie Sep 9, 2020

Choose a reason for hiding this comment

tomchristie commented Sep 15, 2020

tomchristie commented Sep 8, 2020 •

edited

Loading

StephenBrown2 Sep 9, 2020 •

edited

Loading