Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should utils.iana_name() return the actual IANA name? #572

Closed
wosc opened this issue Dec 5, 2024 · 1 comment
Closed

Should utils.iana_name() return the actual IANA name? #572

wosc opened this issue Dec 5, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@wosc
Copy link

wosc commented Dec 5, 2024

charset_normalizer.utils.iana_name('utf-8') returns 'utf_8', which does not appear at all on https://www.iana.org/assignments/character-sets/character-sets.xhtml -- it's called UTF-8 there, or possibly utf-8 (as the table notes " no distinction is made between use of upper and lower case letters").

(The concrete usecase that brought this up was serving arbitrary files over HTTP and generating an appropriate content-type: text/plain; charset=UTF-8 header for them. I was quite suprised to get charset=utf_8 instead, which browsers don't understand and then interpret wrongly.)

I've looked at the current implementation, which is based on encoding.aliases from the stdlib -- but that explicitly talks about normalizing the names beforehand, because it is meant to lookup python modules AFAIU, whose syntax rules are quite different than the IANA encoding names. So I'm not sure if that's actually an appropriate datasource for that use case, or am I completely misunderstanding something here? I'll be grateful for any light that someone could shed onto this.

@wosc wosc added the enhancement New feature or request label Dec 5, 2024
@Ousret
Copy link
Member

Ousret commented Dec 24, 2024

Your analysis is correct. The function name is misleading.
This "was" the original intent when the project started, but it was more "adapted" to return Python normalized name instead as it serves Python user mostly. We've update the docstring nonetheless.

In general, we recommend not using and relying on our internal functions. Our documentation clearly state that only "top-level" imported function are covered by our BC policy. But I suspect that you encountered this in a specific case[...]

I was quite suprised to get charset=utf_8 instead

We've addressed the "converting" to Unicode bytes part when you'd get a utf_8 instead of preferred utf-8.
It will be available in the next version.

Regards,

@Ousret Ousret closed this as completed Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants