Report utf-8-sig #13

Joungkyun · 2019-07-29T15:05:24Z

Reported by @jayvdb

A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.

I realise that utf-8-sig is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet could check to known when to append the -sig.

Other differences when compared with other libraries:

utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig when the BOM appears in utf-8.

Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.

UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.

The text was updated successfully, but these errors were encountered:

Joungkyun · 2019-07-29T15:13:38Z

I have a knowledge of UTF8 BOM and it seems to be easy to handle this.

However, there are two problems.

Legacy problems due to changes in return values. (utf8 vs utf8-sig)
I do not know the character set that requires BOM processing except UTF8. I want you to know the character set that needs BOM processing.

jayvdb · 2019-07-29T16:25:21Z

I do not know of a real standard requiring utf8 bom. It is a python specific codec.

Joungkyun · 2019-07-31T18:09:13Z

The bom member has been added to the DetectObj structure. If you have a BOM, it will be set to 1.

diff --git a/src/chardet.h b/src/chardet.h
index 84975a3..f603a37 100644
--- a/src/chardet.h
+++ b/src/chardet.h
@@ -89,6 +89,7 @@ extern "C" {
    typedef struct DetectObject {
        char * encoding;
        float confidence;
+       short bom;
    } DetectObj;

    CHARDET_API char * detect_version (void);

The following character set is detected by BOM check.
- BOCU-1, GB-18030, SCSU, UTF-1, UTF-7, UTF-EBCDIC
- https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding

Joungkyun added enhancement suggest labels Jul 29, 2019

Joungkyun self-assigned this Jul 29, 2019

Joungkyun added this to the 1.0.6 milestone Jul 29, 2019

Joungkyun closed this as completed in 2738494 Jul 31, 2019

Joungkyun added a commit that referenced this issue Jul 31, 2019

fixed #13 Report utf-8-sig

da0a1a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report utf-8-sig #13

Report utf-8-sig #13

Joungkyun commented Jul 29, 2019

Joungkyun commented Jul 29, 2019

jayvdb commented Jul 29, 2019

Joungkyun commented Jul 31, 2019 •

edited

Loading

Report utf-8-sig #13

Report utf-8-sig #13

Comments

Joungkyun commented Jul 29, 2019

Reported by @jayvdb

Joungkyun commented Jul 29, 2019

jayvdb commented Jul 29, 2019

Joungkyun commented Jul 31, 2019 • edited Loading

Joungkyun commented Jul 31, 2019 •

edited

Loading