Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't error out on compressed files or on non-Unicode files #7

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

joshkel
Copy link

@joshkel joshkel commented Aug 23, 2015

Dodgy ignored the second element of mimetypes.guess_type's result, so it would try to process files that were gzipped, bzipped, etc.

In Python 3, trying to read from a text file may throw a UnicodeDecodeError if any encoding errors are encountered.

These two issues together meant that Dodgy would throw an exception and abort when it ran on my Django project, which had some gzipped CSS. This PR offers a (fairly minimal) fix for these issues.

In Python 3, opening a file in text mode and reading it may throw
UnicodeDecodeErrors.  This adds handling for this and reports any such
errors as `unicode_decode_error` messages in Dodgy's results.

This means that running Dodgy under Python 3 will be pickier than
running it under Python 2, since Python 2 doesn't really care about
encodings.  This probably isn't ideal, but it at least keeps Dodgy from
crashing on an entire project tree if one file has a bad encoding has
its file type mis-detected.
This prevents spurious UnicodeDecodeErrors in Python 3.

Adding handling for compressed files would not be hard (using gzip, bz2,
and optionally lzma libraries), but there's probably little benefit,
since compressed files in a project tree are likely either from an
upstream source or have an uncompressed version available for testing.
@landscape-bot
Copy link

Code Health
Repository health decreased by 0.07% when pulling 6282992 on joshkel:master into 589b272 on landscapeio:master.

@jamadden
Copy link

I am also running into this issue...in a bundled .c file of all places! file identifies the encoding of the .c file as ISO-8859, and it can't be successfully decoded as UTF-8.

jamadden added a commit to gevent/gevent that referenced this pull request Sep 13, 2016
Not in iso-8859 which is what it was identified as. There were
characters in a string constant that couldn't be decoded using utf-8. I
believe this will be a basically compatible change due to the nature of
the two encoding systems.

Workaround for prospector-dev/dodgy#7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants