Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: more robust text file detection on checksum #3264

Conversation

kevin-hanselman
Copy link

@kevin-hanselman kevin-hanselman commented Jan 30, 2020

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice. This patch
makes the process for detecting text files more robust by increasing the
number of bytes interrogated by DVC used to classify the file.

Note: In my team's case, this will invalidate our DVC cache and remote, as the checksums for most of our corpus were incorrectly calculated. We can, of course, fix this by re-dvc adding all of our files, but I want to make sure you (the maintainers) are aware of the ramifications for others.

This doesn't outright fix #3261, but it was a result of that issue.

Fixes #3364

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice. This patch
makes the process for detecting text files more robust by increasing the
number of bytes interrogated by DVC used to classify the file.
@kevin-hanselman kevin-hanselman force-pushed the improve_textfile_detection branch from 7fcecfb to e35a254 Compare January 30, 2020 22:57
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, @kevlar1818 , and thanks for taking the time to submit a patch for it 🙂

I would prefer to remove the file modification and delegate the responsibility of dealing with different line endings to the user. I wonder if it would be nice to discuss this again, in the meanwhile, thanks for your contrib!

@kevin-hanselman
Copy link
Author

@MrOutis I agree about removing the line-ending handling. That seems like something that the user should be responsible for. Heck, you could make a dos2unix.dvc DVC stage 😄

@kevin-hanselman
Copy link
Author

kevin-hanselman commented Jan 30, 2020

Just to reiterate: Merging this PR and cutting a build may result in users seeing a lot of WARNING: corrupted cache file in dvc, as a subset of files that were once being treated as text files will now be treated as binary files (changing their dvc-computed checksums). Please consider these ramifications before merging. I'm happy to discuss more 👍

FWIW: If we were to remove the line-ending handling, even more people may be affected.

@ghost
Copy link

ghost commented Jan 30, 2020

@kevlar1818 , would making dos2unix optional works for you and your team? This one would be less aggro to the rest of the users by keeping compatibility with previous versions)

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kevlar1818 ! Thanks for the PR!

The reason why we are doing that is git automatically adding CRLF on windows and we need to make dvc repos compatible between platforms.

Please check my comment in #3261 , I think we simply have a bug in "trust the remote" feature, solving which will mitigate this. And if that will be the case, I think we will be able to simply close this PR.

@efiop
Copy link
Contributor

efiop commented Jan 31, 2020

@kevlar1818 , would making dos2unix optional works for you and your team? This one would be less aggro to the rest of the users by keeping compatibility with previous versions)

@MrOutis It is a bad idea, as your remotes don't know anything about your config, and people use 1 remote for multiple projects, which might result in some terrible situations with cache corruption.

@@ -62,7 +62,10 @@ def file_md5(fname):
if not data:
break

if binary:
if is_file_binary is None:
is_file_binary = not istext(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically kiils the heuristics and starts checking the whole file contents. I suppose we could extend the heuristics to "check first N bytes + check last N bytes" to cover your particular case, but there is no real need for it, as the main intention for this heuristics is providing md5 compatibility for git-tracked files between *nix and windows, where git uses CRLF by default even if you didn't have them before in your file. So we actually need to have a heuristic that is an exact match to the one that git uses. 🙂

That being said, we sure could look into optimizing this by not opening the file again or optimizing dos2unix() etc.

Copy link
Author

@kevin-hanselman kevin-hanselman Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: This code still only checks the first N bytes. Where N was 512 bytes before, now it's the first MB. (Note how this if statement only gets entered once -- if is_file_binary has it's initial non-bool value of None.)

Copy link
Author

@kevin-hanselman kevin-hanselman Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-hanselman
Copy link
Author

What's the consensus on this PR? I think it makes sense to raise the number of bytes interrogated to determine if a file is text or binary, and I think looking over the first chunk read prior to checksum calculation (in this case 1MB) is reasonable. I would also support simply changing the text/binary detector to match Git (8K, only looking for null bytes -- no printable char checking).

@efiop
Copy link
Contributor

efiop commented Feb 3, 2020

@kevlar1818 Waiting for prof results from you in #3261 , as you've promised 🙂 Need to figure out the causes first, this PR is not mergeable as is right now, we need really strong reasons for it.

@efiop
Copy link
Contributor

efiop commented Mar 10, 2020

Closing since we've found the source in the issue.

@efiop efiop closed this Mar 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misclassification of binary files as text files Pull extrememly slow on ~400GB of data with hot DVC cache
2 participants