-
-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefer defaulting to utf-8, rather than using chardet autodetect #1078
Conversation
Would your thoughts on a import charset_normalizer
normalize_encoding = lambda content: charset_normalizer.detect(content)["encoding"]
with Client(encoding=normalize_encoding) as client:
response = client.get(...)
print(response.text) This could short-curcuit the logic above, or just be added as a fallback if the rest fail, depending on how much trust/speed trade-off there should be. Admittedly, this could be done as you stated above, like: import charset_normalizer
response = client.get(...)
if response.charset_encoding is None:
response.encoding = charset_normalizer.detect(self.content)["encoding"]
print(response.text) but having it on the Client would be a nice-to-have to avoid repeating the same if block everywhere. Apologies if I missed anything or am jumping the gun. seeing the last bit of your post makes me think that it would be an abstract class to implement, but I think it could also just be done with a simple function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to ponder on this, but I have an initial question/comment. :-)
content = "おはようございます。".encode("utf-8") | ||
response = httpx.Response(200, content=content, headers=headers, request=REQUEST) | ||
assert response.text == "おはようございます。" | ||
assert response.encoding == "EUC-JP" | ||
assert response.encoding == "utf-8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what if we do get a "おはようございます。".encode("EUC-JP")
, and the server fails to include ; encoding=EUC-JP
?
I assume we'll try to .decode("utf-8")
— and that it will fail?
>>> text = "おはようございます。".encode("EUC-JP")
>>> text.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte
In that case I assume we want to allow the user to provide some kind of manual encoding (eg if they know it's some exotic encoding).
Most likely it would just be a recommendation to use something like this...
text = response.content.decode("EUC-JP")
This is mostly fine I suppose.
But then, we probably also want to provide more helpful error messaging than the UnicodeDecodeError
above, correct? For example, a DecodingError
with an error message saying "consider decoding .content
manually" (we may have this already, but not sure?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than...
text = response.content.decode("EUC-JP")
print(text)
You'd use..
response.encoding = "EUC-JP"
print(response.text)
So after having a dig into this I think we clearly want to update our strategy here. For one thing the For example, HTML5 now recommends cp1252 ("Windows 1252") as the fallback encoding. It's a good choice because like iso-8859-1 it also encompasses the full byte range, but it's printable characters are a superset of the printable characters in iso-8859-1. I asked for a bit of dev feedback on twitter mostly to try to get an idea about which non utf-8 charsets we might expect. There's also some useful statistical info, here. And here Note that windows 1251 is more prevalent than windows 1252, but it's not a good fallback because the later is aligned with ISO-8859-1. (The printable characters in cp-1252 are a superset of the printable characters in iso-8859-1) Useful info from the twitter question...
So, based on this... I can think of 3 possible strategies that we might want to adopt here...
Which would be nice and simple. Users can anyways use
Bit fancy. Potentially awkward in streaming text cases, since chardet will buffer up before returning a charset.
Could bit more resilient than the first case, since machine locale may sometimes be a decent proxy onto possible text encodings of the URL being accessed by the client. But might be a bit of a non-obvious behaviour difference between devs testing stuff locally vs. deployed to server, which may well not have the same locale. Any thoughts? |
Here's a bit of provisional documentation for how things might look if we adopted the first of those three options, but also allowed the text encodings to be configurable... Text EncodingsWhen accessing
UTF-8 is now very widely adopted, so if the character set has not been specified then we always attempt that as our first choice. Windows 1252 is a superset of ISO-8859-1, and includes encodings for full byte range (0-255), so will always successfully return a unicode text string, for any input. This makes it a good fall-back option. The HTML5 spec recommends decoding as Windows 1252 when the encoding is not known. In some cases you might need to override the default behaviour, to deal with resources that are not including character set information, and are not utf-8 encoded. Here are some scenarios for how you might override this behaviour: Explicitly override the encoding... response = client.get(...)
response.encoding = ...
print(response.text) Explicitly override the encoding, if no charset info is included... response = client.get(...)
if response.charset_encoding is None:
response.encoding = ...
print(response.text) Auto-determine the encoding using response = client.get(...)
if response.charset_encoding is None:
response.encoding = chardet.detect(response.content)["encoding"]
print(response.text) Use a different fallback encoding... # Configure client to expect Russian language resources.
# Try UTF-8 first, then fallback to Window 1251.
client = Client(encodings=['utf-8', 'cp1251'])
response = client.get(...)
print(response.text) Use a fallback encoding based on the user-locale... # Configure client to try the user locale encoding,
# if it is not already one of the two defaults options.
locale_lang, locale_encoding = locale.getdefaultlocale()
encodings = ['utf-8', 'cp1252']
if locale_encoding.lower() not in encodings:
encodings = ['utf-8', locale_encoding, 'cp1252']
client = Client(encodings=encodings)
response = client.get(...)
print(response.text) |
I think first option + detailed docs for more exotic cases (exactly like the above) should get us covered, right? In particular I like how the default behavior would be really simple while accommodating most situations, and then how the "manual mode" docs seem to allow dealing with any special cases. |
Also useful context... psf/requests#2086 |
I'm in favor of Option 2, but in the name of speed + simplicity, Option 1 + detailed docs is the best. Unfortunately the docs you wrote aren't quite copy-and-pastable, as there's a - response.encoding = chardet.detect(self.content)["encoding"]
+ response.encoding = chardet.detect(response.content)["encoding"] You may also want to list a couple of the charset-detection modules to give users more choice (those especially interested in speed may not prefer |
Switching this to a draft since I don't think it's quite the right implementation. |
Closes #1018
Not marking this as a "draft" since it is ready for review, but we shouldn't merge it without prior discussion.
Drops
chardet
autodetection in favour ofutf-8
as the default.This changes our behaviour from:
.encoding
.charset
included on the Content-Type.text/*
Content-Type without a charset.To...
.encoding
.charset
included on the Content-Type.text/*
Content-Type without a charset.We would probably want to document how to use chardet for autodetection, at least for the simpler non-streaming case.
Users can replicate the previous behaviour with...
We might also consider a TextDecoder API at some point in order to allow users to customize text decoding, including the streaming case, which would allow us to support eg. chardet or HTML meta tag detection options.
What are our thoughts on switching to utf-8 as the fallback case, rather than using chardet-autodetection?