utils: more robust text file detection on checksum #3264

kevin-hanselman · 2020-01-30T22:53:27Z

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice. This patch
makes the process for detecting text files more robust by increasing the
number of bytes interrogated by DVC used to classify the file.

Note: In my team's case, this will invalidate our DVC cache and remote, as the checksums for most of our corpus were incorrectly calculated. We can, of course, fix this by re-dvc adding all of our files, but I want to make sure you (the maintainers) are aware of the ramifications for others.

This doesn't outright fix #3261, but it was a result of that issue.

Fixes #3364

❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

In the case where binary files have a large text header, the current checksum routine will treat said files as text files and normalize line-endings before performing the checksum. Not only is it dangerous to manipulate binary files like this, it also doubles the runtime of the checksum routine, as every block of data must be read twice. This patch makes the process for detecting text files more robust by increasing the number of bytes interrogated by DVC used to classify the file.

ghost

looks good to me, @kevlar1818 , and thanks for taking the time to submit a patch for it 🙂

I would prefer to remove the file modification and delegate the responsibility of dealing with different line endings to the user. I wonder if it would be nice to discuss this again, in the meanwhile, thanks for your contrib!

kevin-hanselman · 2020-01-30T23:01:45Z

@MrOutis I agree about removing the line-ending handling. That seems like something that the user should be responsible for. Heck, you could make a dos2unix.dvc DVC stage 😄

kevin-hanselman · 2020-01-30T23:07:30Z

Just to reiterate: Merging this PR and cutting a build may result in users seeing a lot of WARNING: corrupted cache file in dvc, as a subset of files that were once being treated as text files will now be treated as binary files (changing their dvc-computed checksums). Please consider these ramifications before merging. I'm happy to discuss more 👍

FWIW: If we were to remove the line-ending handling, even more people may be affected.

ghost · 2020-01-30T23:19:44Z

@kevlar1818 , would making dos2unix optional works for you and your team? This one would be less aggro to the rest of the users by keeping compatibility with previous versions)

efiop

Hi @kevlar1818 ! Thanks for the PR!

The reason why we are doing that is git automatically adding CRLF on windows and we need to make dvc repos compatible between platforms.

Please check my comment in #3261 , I think we simply have a bug in "trust the remote" feature, solving which will mitigate this. And if that will be the case, I think we will be able to simply close this PR.

efiop · 2020-01-31T00:02:43Z

@kevlar1818 , would making dos2unix optional works for you and your team? This one would be less aggro to the rest of the users by keeping compatibility with previous versions)

@MrOutis It is a bad idea, as your remotes don't know anything about your config, and people use 1 remote for multiple projects, which might result in some terrible situations with cache corruption.

efiop · 2020-01-31T00:18:37Z

dvc/utils/__init__.py

@@ -62,7 +62,10 @@ def file_md5(fname):
                    if not data:
                        break

-                    if binary:
+                    if is_file_binary is None:
+                        is_file_binary = not istext(data)


This basically kiils the heuristics and starts checking the whole file contents. I suppose we could extend the heuristics to "check first N bytes + check last N bytes" to cover your particular case, but there is no real need for it, as the main intention for this heuristics is providing md5 compatibility for git-tracked files between *nix and windows, where git uses CRLF by default even if you didn't have them before in your file. So we actually need to have a heuristic that is an exact match to the one that git uses. 🙂

That being said, we sure could look into optimizing this by not opening the file again or optimizing dos2unix() etc.

FWIW: This code still only checks the first N bytes. Where N was 512 bytes before, now it's the first MB. (Note how this if statement only gets entered once -- if is_file_binary has it's initial non-bool value of None.)

Looks like Git may use 8KB
See also: https://stackoverflow.com/questions/6119956/how-to-determine-if-git-handles-a-file-as-binary-or-as-text

kevin-hanselman · 2020-02-03T16:06:10Z

What's the consensus on this PR? I think it makes sense to raise the number of bytes interrogated to determine if a file is text or binary, and I think looking over the first chunk read prior to checksum calculation (in this case 1MB) is reasonable. I would also support simply changing the text/binary detector to match Git (8K, only looking for null bytes -- no printable char checking).

efiop · 2020-02-03T16:11:49Z

@kevlar1818 Waiting for prof results from you in #3261 , as you've promised 🙂 Need to figure out the causes first, this PR is not mergeable as is right now, we need really strong reasons for it.

efiop · 2020-03-10T09:19:51Z

Closing since we've found the source in the issue.

kevin-hanselman force-pushed the improve_textfile_detection branch from 7fcecfb to e35a254 Compare January 30, 2020 22:57

ghost approved these changes Jan 30, 2020

View reviewed changes

efiop suggested changes Jan 31, 2020

View reviewed changes

efiop reviewed Jan 31, 2020

View reviewed changes

weekly-digest bot mentioned this pull request Feb 2, 2020

Weekly Digest (26 January, 2020 - 2 February, 2020) #3272

Closed

weekly-digest bot mentioned this pull request Feb 9, 2020

Weekly Digest (2 February, 2020 - 9 February, 2020) #3289

Closed

This was referenced Feb 13, 2020

Pull extrememly slow on ~400GB of data with hot DVC cache #3261

Closed

Misclassification of binary files as text files #3364

Closed

weekly-digest bot mentioned this pull request Feb 23, 2020

Weekly Digest (16 February, 2020 - 23 February, 2020) #3387

Closed

efiop closed this Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils: more robust text file detection on checksum #3264

utils: more robust text file detection on checksum #3264

kevin-hanselman commented Jan 30, 2020 •

edited

Loading

ghost left a comment

kevin-hanselman commented Jan 30, 2020

kevin-hanselman commented Jan 30, 2020 •

edited

Loading

ghost commented Jan 30, 2020

efiop left a comment

efiop commented Jan 31, 2020 •

edited

Loading

efiop Jan 31, 2020

kevin-hanselman Jan 31, 2020 •

edited

Loading

kevin-hanselman Jan 31, 2020 •

edited

Loading

kevin-hanselman commented Feb 3, 2020

efiop commented Feb 3, 2020

efiop commented Mar 10, 2020

utils: more robust text file detection on checksum #3264

utils: more robust text file detection on checksum #3264

Conversation

kevin-hanselman commented Jan 30, 2020 • edited Loading

ghost left a comment

Choose a reason for hiding this comment

kevin-hanselman commented Jan 30, 2020

kevin-hanselman commented Jan 30, 2020 • edited Loading

ghost commented Jan 30, 2020

efiop left a comment

Choose a reason for hiding this comment

efiop commented Jan 31, 2020 • edited Loading

efiop Jan 31, 2020

Choose a reason for hiding this comment

kevin-hanselman Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

kevin-hanselman Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

kevin-hanselman commented Feb 3, 2020

efiop commented Feb 3, 2020

efiop commented Mar 10, 2020

kevin-hanselman commented Jan 30, 2020 •

edited

Loading

kevin-hanselman commented Jan 30, 2020 •

edited

Loading

efiop commented Jan 31, 2020 •

edited

Loading

kevin-hanselman Jan 31, 2020 •

edited

Loading

kevin-hanselman Jan 31, 2020 •

edited

Loading