Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input truncated at first NUL byte #27

Closed
jeffrey-easyesi opened this issue Apr 11, 2017 · 3 comments
Closed

Input truncated at first NUL byte #27

jeffrey-easyesi opened this issue Apr 11, 2017 · 3 comments

Comments

@jeffrey-easyesi
Copy link

When the byte-string passed to detect or feed has a NUL (\x00) byte inside it, none of the following bytes are actually fed to the detector. This leads to binary files often being detected as ASCII with high confidence:

>>> cchardet.detect(open('/bin/sh', 'rb').read())
{'confidence': 1.0, 'encoding': 'ASCII'}

Possibly relevant: https://github.com/cython/cython/wiki/FAQ#how-to-pass-string-buffers-that-may-contain-0-bytes-to-cython

@PyYoshi
Copy link
Owner

PyYoshi commented Apr 12, 2017

Hi @jeffrey-easyesi .
Thank you for reporting!

I'll check this issue.

Thanks.

@PyYoshi
Copy link
Owner

PyYoshi commented Apr 14, 2017

Hi @jeffrey-easyesi.
Please tell me the version of Python and cChardet you are using.

I tried it with cChardet v2.0.0:

>>> import sys, cchardet
>>> sys.version_info
sys.version_info(major=3, minor=6, micro=0, releaselevel='final', serial=0)
>>> cchardet.detect(open('/bin/sh', 'rb').read())
{'encoding': None, 'confidence': None}
>>> cchardet.detect(open('/usr/bin/vim', 'rb').read())
{'encoding': None, 'confidence': None}

@jeffrey-easyesi
Copy link
Author

jeffrey-easyesi commented Apr 14, 2017

Python 3.5.2, cChardet v2.0.0

I shouldn't have used /bin/sh as the only example since that's going to look different on different OSes. Try an explicit string like b'ABC\x00\x80\x81'

>>> import sys, cchardet
>>> sys.version_info
sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
>>> cchardet.__version__
'2.0.0'
>>> cchardet.detect(b'ABC\x00\x80\x81')
{'confidence': 1.0, 'encoding': 'ASCII'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants