Skip to content

Commit

Permalink
Release 3.1 (#270)
Browse files Browse the repository at this point in the history
  • Loading branch information
Ousret authored Mar 6, 2023
1 parent 86617ac commit db9af43
Show file tree
Hide file tree
Showing 7 changed files with 99 additions and 69 deletions.
7 changes: 5 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,17 @@
All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [3.1.0-dev0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...master) (unreleased)
## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)

### Added
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #261)
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)

### Removed
- Support for Python 3.6 (PR #260)

### Changed
- Optional speedup provided by mypy/c 1.0.1

## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)

### Fixed
Expand Down
50 changes: 25 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,18 @@
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.

| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
| ------------- | :-------------: | :------------------: | :------------------: |
| `Fast` | ❌<br> | ✅<br> | ✅ <br> |
| `Universal**` | | | |
| `Reliable` **without** distinguishable standards | || |
| `Reliable` **with** distinguishable standards ||| |
| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
| `Native Python` ||| |
| `Detect spoken language` ||| N/A |
| `UnicodeDecodeError Safety` ||| |
| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB |
| `Supported Encoding` | 33 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
|--------------------------------------------------|:---------------------------------------------:|:------------------------------------------------------------------------------------------------------:|:-----------------------------------------------:|
| `Fast` | ❌<br> | ✅<br> | ✅ <br> |
| `Universal**` | | | |
| `Reliable` **without** distinguishable standards | || |
| `Reliable` **with** distinguishable standards ||| |
| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
| `Native Python` ||| |
| `Detect spoken language` ||| N/A |
| `UnicodeDecodeError Safety` ||| |
| `Whl Size` | 193.6 kB | 39.5 kB | ~200 kB |
| `Supported Encoding` | 33 | :tada: [90](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 |

<p align="center">
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
Expand All @@ -50,15 +50,15 @@ Did you got there because of the logs? See [https://charset-normalizer.readthedo

This package offer better performance than its counterpart Chardet. Here are some numbers.

| Package | Accuracy | Mean per file (ms) | File per sec (est) |
| ------------- | :-------------: | :------------------: | :------------------: |
| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec |
| charset-normalizer | **98 %** | **10 ms** | 100 file/sec |
| Package | Accuracy | Mean per file (ms) | File per sec (est) |
|-----------------------------------------------|:--------:|:------------------:|:------------------:|
| [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec |
| charset-normalizer | **98 %** | **10 ms** | 100 file/sec |

| Package | 99th percentile | 95th percentile | 50th percentile |
| ------------- | :-------------: | :------------------: | :------------------: |
| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms |
| charset-normalizer | 100 ms | 50 ms | 5 ms |
| Package | 99th percentile | 95th percentile | 50th percentile |
|-----------------------------------------------|:---------------:|:---------------:|:---------------:|
| [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms |
| charset-normalizer | 100 ms | 50 ms | 5 ms |

Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.

Expand Down Expand Up @@ -185,15 +185,15 @@ Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is
## 🍰 How

- Discard all charset encoding table that could not fit the binary content.
- Measure chaos, or the mess once opened (by chunks) with a corresponding charset encoding.
- Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Additionally, we measure coherence / probe for a language.

**Wait a minute**, what is chaos/mess and coherence according to **YOU ?**
**Wait a minute**, what is noise/mess and coherence according to **YOU ?**

*Chaos :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
**I established** some ground rules about **what is obvious** when **it seems like** a mess.
I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to
I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
improve or rewrite it.

*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
Expand Down Expand Up @@ -226,7 +226,7 @@ This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/L

Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)

## For Enterprise
## 💼 For Enterprise

Professional support for charset-normalizer is available as part of the [Tidelift
Subscription][1]. Tidelift gives software development teams a single source for
Expand Down
2 changes: 1 addition & 1 deletion bin/run_autofix.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@ fi

set -x

${PREFIX}black --target-version=py36 charset_normalizer
${PREFIX}black --target-version=py37 charset_normalizer
${PREFIX}isort charset_normalizer
2 changes: 1 addition & 1 deletion bin/run_checks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ fi
set -x

${PREFIX}pytest
${PREFIX}black --check --diff --target-version=py36 charset_normalizer
${PREFIX}black --check --diff --target-version=py37 charset_normalizer
${PREFIX}flake8 charset_normalizer
${PREFIX}mypy charset_normalizer
${PREFIX}isort --check --diff charset_normalizer
2 changes: 1 addition & 1 deletion charset_normalizer/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
Expose version
"""

__version__ = "3.1.0-dev0"
__version__ = "3.1.0"
VERSION = __version__.split(".")
19 changes: 18 additions & 1 deletion docs/community/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ If you use the legacy `detect` function,
Then this change is mostly backward-compatible, exception of a thing:

- This new library support way more code pages (x3) than its counterpart Chardet.
- Based on the 30-ich charsets that Chardet support, expect roughly 85% BC results https://github.com/Ousret/charset_normalizer/pull/77/checks?check_run_id=3244585065
- Based on the 30-ich charsets that Chardet support, expect roughly 80% BC results

We do not guarantee this BC exact percentage through time. May vary but not by much.

Expand All @@ -56,3 +56,20 @@ detection.

Any code page supported by your cPython is supported by charset-normalizer! It is that simple, no need to update the
library. It is as generic as we could do.

I can't build standalone executable
-----------------------------------

If you are using ``pyinstaller``, ``py2exe`` or alike, you may be encountering this or close to:

ModuleNotFoundError: No module named 'charset_normalizer.md__mypyc'

Why?

- Your package manager picked up a optimized (for speed purposes) wheel that match your architecture and operating system.
- Finally, the module ``charset_normalizer.md__mypyc`` is imported via binaries and can't be seen using your tool.

How to remedy?

If your bundler program support it, set up a hook that implicitly import the hidden module.
Otherwise, follow the guide on how to install the vanilla version of this package. (Section: *Optional speedup extension*)
86 changes: 48 additions & 38 deletions docs/user/support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -124,41 +124,51 @@ Supported Languages
Those language can be detected inside your content. All of these are specified in ./charset_normalizer/assets/__init__.py .


English,
German,
French,
Dutch,
Italian,
Polish,
Spanish,
Russian,
Japanese,
Portuguese,
Swedish,
Chinese,
Ukrainian,
Norwegian,
Finnish,
Vietnamese,
Czech,
Hungarian,
Korean,
Indonesian,
Turkish,
Romanian,
Farsi,
Arabic,
Danish,
Serbian,
Lithuanian,
Slovene,
Slovak,
Malay,
Hebrew,
Bulgarian,
Croatian,
Hindi,
Estonian,
Thai,
Greek,
Tamil.
| English,
| German,
| French,
| Dutch,
| Italian,
| Polish,
| Spanish,
| Russian,
| Japanese,
| Portuguese,
| Swedish,
| Chinese,
| Ukrainian,
| Norwegian,
| Finnish,
| Vietnamese,
| Czech,
| Hungarian,
| Korean,
| Indonesian,
| Turkish,
| Romanian,
| Farsi,
| Arabic,
| Danish,
| Serbian,
| Lithuanian,
| Slovene,
| Slovak,
| Malay,
| Hebrew,
| Bulgarian,
| Croatian,
| Hindi,
| Estonian,
| Thai,
| Greek,
| Tamil.
----------------------------
Incomplete Sequence / Stream
----------------------------

It is not (yet) officially supported. If you feed an incomplete byte sequence (eg. truncated multi-byte sequence) the detector will
most likely fail to return a proper result.
If you are purposely feeding part of your payload for performance concerns, you may stop doing it as this package is fairly optimized.

We are working on a dedicated way to handle streams.

0 comments on commit db9af43

Please sign in to comment.