Should `utils.iana_name()` return the actual IANA name? #572

wosc · 2024-12-05T13:26:40Z

charset_normalizer.utils.iana_name('utf-8') returns 'utf_8', which does not appear at all on https://www.iana.org/assignments/character-sets/character-sets.xhtml -- it's called UTF-8 there, or possibly utf-8 (as the table notes " no distinction is made between use of upper and lower case letters").

(The concrete usecase that brought this up was serving arbitrary files over HTTP and generating an appropriate content-type: text/plain; charset=UTF-8 header for them. I was quite suprised to get charset=utf_8 instead, which browsers don't understand and then interpret wrongly.)

I've looked at the current implementation, which is based on encoding.aliases from the stdlib -- but that explicitly talks about normalizing the names beforehand, because it is meant to lookup python modules AFAIU, whose syntax rules are quite different than the IANA encoding names. So I'm not sure if that's actually an appropriate datasource for that use case, or am I completely misunderstanding something here? I'll be grateful for any light that someone could shed onto this.

The text was updated successfully, but these errors were encountered:

address #572

…ing name close #572

Ousret · 2024-12-24T10:12:35Z

Your analysis is correct. The function name is misleading.
This "was" the original intent when the project started, but it was more "adapted" to return Python normalized name instead as it serves Python user mostly. We've update the docstring nonetheless.

In general, we recommend not using and relying on our internal functions. Our documentation clearly state that only "top-level" imported function are covered by our BC policy. But I suspect that you encountered this in a specific case[...]

I was quite suprised to get charset=utf_8 instead

We've addressed the "converting" to Unicode bytes part when you'd get a utf_8 instead of preferred utf-8.
It will be available in the next version.

Regards,

wosc added the enhancement New feature or request label Dec 5, 2024

Ousret added a commit that referenced this issue Dec 24, 2024

📝 update utils.iana_name function to reflect his purpose

3582e62

address #572

Ousret added a commit that referenced this issue Dec 24, 2024

🐛 output(...) replace declarative mark using non iana compliant encod…

14b4649

…ing name close #572

Ousret closed this as completed Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should `utils.iana_name()` return the actual IANA name? #572

Should `utils.iana_name()` return the actual IANA name? #572

wosc commented Dec 5, 2024

Ousret commented Dec 24, 2024

Should utils.iana_name() return the actual IANA name? #572

Should utils.iana_name() return the actual IANA name? #572

Comments

wosc commented Dec 5, 2024

Ousret commented Dec 24, 2024

Should `utils.iana_name()` return the actual IANA name? #572

Should `utils.iana_name()` return the actual IANA name? #572