-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils: more robust text file detection on checksum #3264
utils: more robust text file detection on checksum #3264
Conversation
In the case where binary files have a large text header, the current checksum routine will treat said files as text files and normalize line-endings before performing the checksum. Not only is it dangerous to manipulate binary files like this, it also doubles the runtime of the checksum routine, as every block of data must be read twice. This patch makes the process for detecting text files more robust by increasing the number of bytes interrogated by DVC used to classify the file.
7fcecfb
to
e35a254
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, @kevlar1818 , and thanks for taking the time to submit a patch for it 🙂
I would prefer to remove the file modification and delegate the responsibility of dealing with different line endings to the user. I wonder if it would be nice to discuss this again, in the meanwhile, thanks for your contrib!
@MrOutis I agree about removing the line-ending handling. That seems like something that the user should be responsible for. Heck, you could make a |
Just to reiterate: Merging this PR and cutting a build may result in users seeing a lot of FWIW: If we were to remove the line-ending handling, even more people may be affected. |
@kevlar1818 , would making dos2unix optional works for you and your team? This one would be less aggro to the rest of the users by keeping compatibility with previous versions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevlar1818 ! Thanks for the PR!
The reason why we are doing that is git automatically adding CRLF on windows and we need to make dvc repos compatible between platforms.
Please check my comment in #3261 , I think we simply have a bug in "trust the remote" feature, solving which will mitigate this. And if that will be the case, I think we will be able to simply close this PR.
@MrOutis It is a bad idea, as your remotes don't know anything about your config, and people use 1 remote for multiple projects, which might result in some terrible situations with cache corruption. |
@@ -62,7 +62,10 @@ def file_md5(fname): | |||
if not data: | |||
break | |||
|
|||
if binary: | |||
if is_file_binary is None: | |||
is_file_binary = not istext(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically kiils the heuristics and starts checking the whole file contents. I suppose we could extend the heuristics to "check first N bytes + check last N bytes" to cover your particular case, but there is no real need for it, as the main intention for this heuristics is providing md5 compatibility for git-tracked files between *nix and windows, where git uses CRLF by default even if you didn't have them before in your file. So we actually need to have a heuristic that is an exact match to the one that git uses. 🙂
That being said, we sure could look into optimizing this by not opening the file again or optimizing dos2unix() etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: This code still only checks the first N bytes. Where N was 512 bytes before, now it's the first MB. (Note how this if statement only gets entered once -- if is_file_binary
has it's initial non-bool value of None
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Git may use 8KB
See also: https://stackoverflow.com/questions/6119956/how-to-determine-if-git-handles-a-file-as-binary-or-as-text
What's the consensus on this PR? I think it makes sense to raise the number of bytes interrogated to determine if a file is text or binary, and I think looking over the first chunk read prior to checksum calculation (in this case 1MB) is reasonable. I would also support simply changing the text/binary detector to match Git (8K, only looking for null bytes -- no printable char checking). |
@kevlar1818 Waiting for prof results from you in #3261 , as you've promised 🙂 Need to figure out the causes first, this PR is not mergeable as is right now, we need really strong reasons for it. |
Closing since we've found the source in the issue. |
In the case where binary files have a large text header, the current
checksum routine will treat said files as text files and normalize
line-endings before performing the checksum. Not only is it dangerous to
manipulate binary files like this, it also doubles the runtime of the
checksum routine, as every block of data must be read twice. This patch
makes the process for detecting text files more robust by increasing the
number of bytes interrogated by DVC used to classify the file.
Note: In my team's case, this will invalidate our DVC cache and remote, as the checksums for most of our corpus were incorrectly calculated. We can, of course, fix this by re-
dvc add
ing all of our files, but I want to make sure you (the maintainers) are aware of the ramifications for others.This doesn't outright fix #3261, but it was a result of that issue.
Fixes #3364
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏