Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cChardet with something compatible with current Python versions #165

Open
Mr0grog opened this issue Jan 2, 2024 · 0 comments
Open

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Jan 2, 2024

cChardet is no longer maintained and is not readily compatible with the last two major releases of Python (3.11 and 3.12), so we probably need to replace it: PyYoshi/cChardet#81

I did a bunch of research and testing a few weeks ago on alternatives that I still need to write up, but the bottom line is that there aren’t really any good options. What’s on the table:

  1. Go back to chardet. It’s pure Python and still works, but is not as accurate as other options, is really slow, and is blocking, which is not great.

  2. Switch to charset-normalizer. It is also pure Python and claims drastically improved accuracy and performance over chardet, but this isn’t actually consistent or broadly true in my testing. It’s highly dependent on having encoding declarations in the content being sniffed as a shortcut, and in all other cases is much slower and has similar accuracy to chardet. Since we already check for declarations, we’ll only see the slowest cases here.

    (OTOH, it is sometimes more accurate if the declaration is wrong, since it only treats the declaration as a hint. But there’s some reasonable debate over whether that’s the right thing to do, since it differs from how browsers behave. During testing I also learned a lot about how browsers treat declarations, which is much more complicated and nuanced than I’d realized, and charset-normalizer doesn’t leverage the hints as well as I now understand it could — I should probably file some issues.)

  3. Switch to faust-cchardet, which is a fork of cChardet with patches to make it work in modern Pythons. Unfortunately, it uses problematic naming that could break things in an environment with other packages that rely on cChardet, since it takes over the cchardet import name, rather than using its own. The author has suggested some vague interest in taking over cchardet, which would solve the issue, but doesn’t seem to actually be moving forward on it (Take over the original PyPI project? faust-streaming/cChardet#32). Absent that, I worry this creates complex dependency issues in any situation where someone would install web-monitoring-diff as a library of it is installed in a Python environment with other CLI tools.

    I’m also a little concerned that there’s not any strong energy for long-term maintenance on this one, and switching to it could just land us in the same situation as we are currently in.

  4. Switch to chardetng-py, a Python wrapper around chardetng, which is written in Rust and used in Firefox. It is much more accurate than chardet or charset-normalizer, and also much faster (between half and just as fast as cChardet). It supports a much more limited set of encodings though (these days, browsers generally have a more constrained set of supported encodings and a dedicated spec all about it. To the extent that we want to act like a browser does, that’s fine.

    One complex downside here is that this requires much more careful handling of the encodings it finds, because the names it returns don’t always indicate the same decoders that Python uses for those names: Encodings reported by chardetng-py don't always match up to python's decoding john-parton/chardetng-py#11

    I’m also a little concerned that there’s not strong energy for long-term maintenance on this one, and switching to it could just land us in the same situation as we are currently in. There’s definitely not much intent to update except for really serious bugs in the underlying chardetng library, it seems: guess_assess() can’t return false for second return value hsivonen/chardetng#13, https://github.com/hsivonen/chardetng/pulls

Of the available options, I think (4) is probably best, followed up by (2). The biggest problem with (4) is the maintenance concerns (but also special treatment for the values it detects). I’m not super happy with any of these, though. 😞

This is a blocking issue for #128.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant