Skip to content

Commit

Permalink
Switch default fallback encoding detection lib to charset-normalizer
Browse files Browse the repository at this point in the history
This change improves the performance of the encoding
detection by substituting the backend lib with the
new `Charset-Normalizer` (used to be `Chardet`).

The patch is backward-compatible API wise, except
that the dependency is different.

PR aio-libs#5930

Co-authored-by: Sviatoslav Sydorenko <[email protected]>
(cherry picked from commit 2d5597e)
  • Loading branch information
Ousret authored and webknjaz committed Oct 20, 2021
1 parent 04d9ac4 commit cfecafa
Show file tree
Hide file tree
Showing 11 changed files with 26 additions and 21 deletions.
1 change: 1 addition & 0 deletions CHANGES/5930.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Switched ``chardet`` to ``charset-normalizer`` for guessing the HTTP payload body encoding -- :user:`Ousret`.
1 change: 1 addition & 0 deletions CONTRIBUTORS.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Adam Horacek
Adam Mills
Adrian Krupa
Adrián Chaves
Ahmed Tahri
Alan Tse
Alec Hanefeld
Alejandro Gómez
Expand Down
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,14 +161,14 @@ Requirements
- Python >= 3.6
- async-timeout_
- attrs_
- chardet_
- charset-normalizer_
- multidict_
- yarl_

Optionally you may install the cChardet_ and aiodns_ libraries (highly
recommended for sake of speed).

.. _chardet: https://pypi.python.org/pypi/chardet
.. _charset-normalizer: https://pypi.org/project/charset-normalizer
.. _aiodns: https://pypi.python.org/pypi/aiodns
.. _attrs: https://github.com/python-attrs/attrs
.. _multidict: https://pypi.python.org/pypi/multidict
Expand Down
2 changes: 1 addition & 1 deletion aiohttp/client_reqrep.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
try:
import cchardet as chardet
except ImportError: # pragma: no cover
import chardet # type: ignore[no-redef]
import charset_normalizer as chardet # type: ignore[no-redef]


__all__ = ("ClientRequest", "ClientResponse", "RequestInfo", "Fingerprint")
Expand Down
16 changes: 8 additions & 8 deletions docs/client_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1374,10 +1374,10 @@ Response object
specified *encoding* parameter.

If *encoding* is ``None`` content encoding is autocalculated
using ``Content-Type`` HTTP header and *chardet* tool if the
using ``Content-Type`` HTTP header and *charset-normalizer* tool if the
header is not provided by server.

:term:`cchardet` is used with fallback to :term:`chardet` if
:term:`cchardet` is used with fallback to :term:`charset-normalizer` if
*cchardet* is not available.

Close underlying connection if data reading gets an error,
Expand All @@ -1389,14 +1389,14 @@ Response object

:return str: decoded *BODY*

:raise LookupError: if the encoding detected by chardet or cchardet is
:raise LookupError: if the encoding detected by cchardet is
unknown by Python (e.g. VISCII).

.. note::

If response has no ``charset`` info in ``Content-Type`` HTTP
header :term:`cchardet` / :term:`chardet` is used for content
encoding autodetection.
header :term:`cchardet` / :term:`charset-normalizer` is used for
content encoding autodetection.

It may hurt performance. If page encoding is known passing
explicit *encoding* parameter might help::
Expand All @@ -1411,7 +1411,7 @@ Response object
a ``read`` call will be done,

If *encoding* is ``None`` content encoding is autocalculated
using :term:`cchardet` or :term:`chardet` as fallback if
using :term:`cchardet` or :term:`charset-normalizer` as fallback if
*cchardet* is not available.

if response's `content-type` does not match `content_type` parameter
Expand Down Expand Up @@ -1449,11 +1449,11 @@ Response object
Automatically detect content encoding using ``charset`` info in
``Content-Type`` HTTP header. If this info is not exists or there
are no appropriate codecs for encoding then :term:`cchardet` /
:term:`chardet` is used.
:term:`charset-normalizer` is used.

Beware that it is not always safe to use the result of this function to
decode a response. Some encodings detected by cchardet are not known by
Python (e.g. VISCII).
Python (e.g. VISCII). *charset-normalizer* is not concerned by that issue.

:raise RuntimeError: if called before the body has been read,
for :term:`cchardet` usage
Expand Down
7 changes: 4 additions & 3 deletions docs/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,12 @@
Any object that can be called. Use :func:`callable` to check
that.

chardet
charset-normalizer

The Universal Character Encoding Detector
The Real First Universal Charset Detector.
Open, modern and actively maintained alternative to Chardet.

https://pypi.python.org/pypi/chardet/
https://pypi.org/project/charset-normalizer/

cchardet

Expand Down
8 changes: 4 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Library Installation
$ pip install aiohttp
You may want to install *optional* :term:`cchardet` library as faster
replacement for :term:`chardet`:
replacement for :term:`charset-normalizer`:

.. code-block:: bash
Expand All @@ -51,7 +51,7 @@ This option is highly recommended:
Installing speedups altogether
------------------------------

The following will get you ``aiohttp`` along with :term:`chardet`,
The following will get you ``aiohttp`` along with :term:`charset-normalizer`,
:term:`aiodns` and ``Brotli`` in one bundle. No need to type
separate commands anymore!

Expand Down Expand Up @@ -149,11 +149,11 @@ Dependencies
- Python 3.6+
- *async_timeout*
- *attrs*
- *chardet*
- *charset-normalizer*
- *multidict*
- *yarl*
- *Optional* :term:`cchardet` as faster replacement for
:term:`chardet`.
:term:`charset-normalizer`.

Install it explicitly via:

Expand Down
2 changes: 2 additions & 0 deletions docs/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ canonicalization
canonicalize
cchardet
ceil
Chardet
charset
charsetdetect
chunked
Expand Down Expand Up @@ -226,6 +227,7 @@ namespace
netrc
nginx
noop
normalizer
nowait
optimizations
os
Expand Down
2 changes: 1 addition & 1 deletion requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ asynctest==0.13.0; python_version<"3.8"
attrs==21.2.0
Brotli==1.0.9
cchardet==2.1.7
chardet==4.0.0
charset-normalizer==2.0.4
frozenlist==1.2.0
gunicorn==20.1.0
idna-ssl==1.1.0; python_version<"3.7"
Expand Down
2 changes: 1 addition & 1 deletion requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ cfgv==3.2.0
# via
# -r requirements/lint.txt
# pre-commit
chardet==4.0.0
charset-normalizer==2.0.4
# via
# -r requirements/base.txt
# requests
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def build_extension(self, ext):

install_requires = [
"attrs>=17.3.0",
"chardet>=2.0,<5.0",
"charset-normalizer>=2.0,<3.0",
"multidict>=4.5,<7.0",
"async_timeout>=4.0.0a3,<5.0",
'asynctest==0.13.0; python_version<"3.8"',
Expand Down

0 comments on commit cfecafa

Please sign in to comment.