Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix binary detection for text files containing emoji #79

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

zimagen
Copy link
Contributor

@zimagen zimagen commented Oct 23, 2024

Problem

The isBinaryFileSync function incorrectly identifies UTF-8 text files with emoji as binary.

Cause

Emoji characters are represented using multiple bytes in UTF-8, each greater than 0x80. The function counts these as suspicious, leading to misclassification.

Fix

The UTF-8 detection logic has been updated:

  • 2-byte UTF-8 characters (0xc0 to 0xdf) are valid if the next byte is 0x80 to 0xbf.
  • 3-byte UTF-8 characters (0xe0 to 0xef) are valid if the next 2 bytes are 0x80 to 0xbf.
  • 4-byte UTF-8 characters (0xf0 to 0xf7) are valid if the next 3 bytes are 0x80 to 0xbf.

Valid UTF-8 characters are no longer counted as suspicious.

Result

Text files with emoji are now correctly identified as text, not binary.

Let me know if you have any further questions!

@gjtorikian
Copy link
Owner

Makes sense to me--thanks! Will ship a patch release with this.

@gjtorikian gjtorikian merged commit 856dfd9 into gjtorikian:main Oct 23, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants