Fix binary detection for text files containing emoji #79

zimagen · 2024-10-23T08:43:50Z

Problem

The isBinaryFileSync function incorrectly identifies UTF-8 text files with emoji as binary.

Cause

Emoji characters are represented using multiple bytes in UTF-8, each greater than 0x80. The function counts these as suspicious, leading to misclassification.

Fix

The UTF-8 detection logic has been updated:

2-byte UTF-8 characters (0xc0 to 0xdf) are valid if the next byte is 0x80 to 0xbf.
3-byte UTF-8 characters (0xe0 to 0xef) are valid if the next 2 bytes are 0x80 to 0xbf.
4-byte UTF-8 characters (0xf0 to 0xf7) are valid if the next 3 bytes are 0x80 to 0xbf.

Valid UTF-8 characters are no longer counted as suspicious.

Result

Text files with emoji are now correctly identified as text, not binary.

Let me know if you have any further questions!

gjtorikian · 2024-10-23T16:07:50Z

Makes sense to me--thanks! Will ship a patch release with this.

Fix binary detection for text files containing emoji

e2bb5ec

gjtorikian merged commit 856dfd9 into gjtorikian:main Oct 23, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix binary detection for text files containing emoji #79

Fix binary detection for text files containing emoji #79

zimagen commented Oct 23, 2024

gjtorikian commented Oct 23, 2024

Fix binary detection for text files containing emoji #79

Fix binary detection for text files containing emoji #79

Conversation

zimagen commented Oct 23, 2024

Problem

Cause

Fix

Result

gjtorikian commented Oct 23, 2024