Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tartufo does not scan files with alternative encodings, such as UTF-16 LE. #353

Open
jwilbur-godaddy opened this issue Apr 28, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@jwilbur-godaddy
Copy link
Contributor

🐛 Bug Report

Tartufo does not scan files with alternatives encodings, such as UTF-16 LE.

This is important because (and I discovered this accidentally because) Powershell converts all standard output to UTF-16 LE, so if you use a command like openssl genrsa > private_key.pem, it will save private_key.pem as a UTF-16 LE-encoded text file. Tartufo will NOT scan such files at all.

If you create a UTF-16 LE-encoded file (you can do this by re-encoding a file in VS Code) and you look at it under a hex editor, you'll see two strange bytes at the beginning of the file (the Byte-Order Mark / BOM), and every other byte will be null (0x00). In the eyes of Git, this makes this file count as a "binary" file. Tartufo ignores binary files, so this file will be ignored.

Non-Solution

Do not use the GIT_DIFF_FORCE_TEXT flag to fix this. This will cause problems when a file is actually a binary file, and the files that are actually text will still be converted to Python strings that contain aberrant characters, and will therefore not match regular expressions and possibly entropy checks.

Possible solution

Attempt to detect the text encoding of a file. If it is text, but it is not UTF-8, re-encode the chunk to be scanned (not the original file itself!) into UTF-8 prior to scanning it. I think you will want to change this here:

tartufo/tartufo/scanner.py

Lines 571 to 580 in 3f075ab

for chunk in self.chunks: # pylint: disable=too-many-nested-blocks
# Run regex scans first to trigger a potential fast fail for bad config
if self.global_options.regex and self.rules_regexes:
for issue in self.scan_regex(chunk):
self.store_issue(issue)
yield issue
if self.global_options.entropy:
for issue in self.scan_entropy(chunk):
self.store_issue(issue)
yield issue

To Reproduce

  1. In PowerShell, run openssl genrsa > private_key.pem. This will generate a UTF-16 LE-encoded private key, which should get flagged by tartufo, but does not.
  2. Run tartufo in the repo. It will not catch this.
  3. Open private_key.pem in VS Code. On the bottom right, you'll see "UTF-16 LE." Click on this and save it with "UTF-8" encoding instead.
  4. Run tartufo again and observe that it scans this file.

Expected Behavior

Tartufo should be able to scan files that are text, but not encoded with UTF-8.

Environment

A Windows machine.

@jwilbur-godaddy jwilbur-godaddy added the bug Something isn't working label Apr 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant