Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report utf-8-sig #13

Closed
Joungkyun opened this issue Jul 29, 2019 · 3 comments
Closed

Report utf-8-sig #13

Joungkyun opened this issue Jul 29, 2019 · 3 comments
Assignees
Milestone

Comments

@Joungkyun
Copy link
Owner

Joungkyun/python-chardet#3

Reported by @jayvdb

A file with a UTF-8 BOM is detected as 'utf-8' when it should ideally be 'utf-8-sig'. This is really important because lots of tools re-open the file using the detected encoding, and 'utf-8-sig' will strip the bom but 'utf-8' will not, and the BOM will cause breakages.

I realise that utf-8-sig is a Python-ism, but the libchardet could provide some extra flags in its results which python-chardet could check to known when to append the -sig.

Other differences when compared with other libraries:

utf16* are detected as 'utf-16le' and 'utf-16be', which is great because many detection libraries just report 'utf-16'. It is odd that this library is reporting the endianness of utf-16, but is not reporting the -sig when the BOM appears in utf-8.

Curly quotes in ascii are detected as 'windows-1250', which decodes correctly. \o/ Libraries which detect this often detect it as 'windows-1252', but that is an internal/arbitrary choice not based on the input text.

UTF-7 problems like reported at https://github.com/PyYoshi/uchardet/issues/4 , and the other BOMs reported there, also occur in this library.

@Joungkyun Joungkyun self-assigned this Jul 29, 2019
@Joungkyun Joungkyun added this to the 1.0.6 milestone Jul 29, 2019
@Joungkyun
Copy link
Owner Author

I have a knowledge of UTF8 BOM and it seems to be easy to handle this.

However, there are two problems.

  1. Legacy problems due to changes in return values. (utf8 vs utf8-sig)
  2. I do not know the character set that requires BOM processing except UTF8. I want you to know the character set that needs BOM processing.

@jayvdb
Copy link

jayvdb commented Jul 29, 2019

I do not know of a real standard requiring utf8 bom. It is a python specific codec.

@Joungkyun
Copy link
Owner Author

Joungkyun commented Jul 31, 2019

  1. The bom member has been added to the DetectObj structure. If you have a BOM, it will be set to 1.
    diff --git a/src/chardet.h b/src/chardet.h
    index 84975a3..f603a37 100644
    --- a/src/chardet.h
    +++ b/src/chardet.h
    @@ -89,6 +89,7 @@ extern "C" {
        typedef struct DetectObject {
            char * encoding;
            float confidence;
    +       short bom;
        } DetectObj;
    
        CHARDET_API char * detect_version (void);
  2. The following character set is detected by BOM check.

Joungkyun added a commit that referenced this issue Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants