You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.
I realise that utf-8-sig is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet could check to known when to append the -sig.
Other differences when compared with other libraries:
utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig when the BOM appears in utf-8.
Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.
Joungkyun/python-chardet#3
Reported by @jayvdb
A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.
I realise that
utf-8-sig
is a Python-ism, but the libchardet could provide some extra flags in its results whichpython-chardet
could check to known when to append the-sig
.Other differences when compared with other libraries:
utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the
-sig
when the BOM appears in utf-8.Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.
UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.
The text was updated successfully, but these errors were encountered: