-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop dos2unix behaviour #4658
Comments
Hi @MetalBlueberry ! This is a known behaviour of dvc #992 that we have anticipated might bite us in the rear at some point. Initially it was implemented that way to make sure that windows and *nix hashes of git files match, as git will automatically set CRLF on windows, unless you set autocrlf to false. In hindsight, we could probably check for that option automatically or at least document it and just stick with md5sum instead of dos2unix + md5sum. That being said, dvc itself doesnt change the contents of the files it stores, so the collision happened when one file was pushed to the remote. Could you elaborate on your scenario with CRLF, btw? Just want to have a better idea on your use case. |
In my scenario we were trying to upload a XML file to dvc remote using a custom uploader. The problem is that the md5 checksum calculated by dvc doesn't match the standard md5 when the file contains CRLF line endings. Our custom uploader performs validations on md5 checksum to ensure that you cannot upload data with different sums. From my point of view, dvc should not change the file content to calculate the md5 to ensure that the file hash matches. In the scenario that you describes, either the linux user or the windows user will be forced to work with CRLF / LF line endings. The worst thing is that two identical files with CRLF/LF swapped could not be uploaded to the same remote because they will have the same hash. As I describe in the example, this makes impossible to change the line endings once you push to the remote. If windows users have problems with git changing to CRLF line endings, they could just change the git settings instead of having inconsistent file hashes in DVC. Other idea could be to enable/disable this behaviour with a parameters, We could have this behaviour enabled by default for backward compatibility but allow the users to disable it for consistency. |
@MetalBlueberry That sounds good. And in 2.0 we could consider switching to that behaviour by default (we actually had some plans to do that in 1.0, but didn't get to that). Here is the piece in the code that does it Line 38 in d0e1c88
|
Yep, agreed! Providing an option is a great idea. Should we repurpose this ticket for this? |
This bit me too. I have a bunch of binary files that dvc thinks are text files even though they aren't. So the md5 hash calculated by dvc doesn't match the one calculated by azure or any other tool that just hashes the bytes without line-ending conversion. It seems like this kind of approach makes your md5 hashes really brittle. Any bug fix to istextfile or file_md5 might break users' caches. I would strongly suggest that the stored md5 hashes come with a version number that indicate what version of the algorithm they used. That way, if the algorithm is updated and improved, the old hashes aren't suddenly useless. E.g., currently a .dvc file looks like:
If dvc is fixed in 2.0 so that this is properly recognized as binary, the md5 hash above will be wrong. Instead, consider:
If the hash algo changes, just create a "1.1" or "2.0" version, but keep the ability to interpret a 1.0 hash code for older .dvc files. Wow this kind of bug makes me nervous. |
@gamis Thanks for the feedback! We plan on changing the cache format in 2.0 #4841 and plan on dropping dos2unix at the same time, so there is no cache mixup. Once we turn off dos2unix by default, there is not much reason to introduce md5 versioning, since it will be true md5 anyways. That said, we are also considering switching to a different default hash type, because people tend to be scared of md5, even though we are not doing any cryptography with it. Regarding the azure etag, I don't think even azure guarantees it to be an md5, it just happens to match it sometimes. Same with s3's etag. So you probably shouldn't rely on it too much. Or am I missing something? |
Thanks for the quick response:
|
No, 2.0 will likely be backward compatible with old dvc files and cache, at the very least in read-only mode. If something needs to be broken, we'll either migrate automatically or will provide a migration instructions/script. Keep in mind that we still need to be able to read old metrics/plots/params and do some other stuff with project's history, so we kinda have to have at least a read-only backward compatibility. 🙂
Ah, makes sense 👍 |
Could you clarify please what is the purpose of this field (if it's a user provided one) and how do people use it? |
In my case, I'm using gocloud/blob, which is a cross provider blob storage. I also have the Content-MD5 field and it is used to check the integrity of the uploaded file. If the calculated hash doesn't match the user hash, the upload fails. https://godoc.org/gocloud.dev/blob#WriterOptions it is also a standard header used to check the integrity of the data transmission. https://tools.ietf.org/html/rfc1864 For my specific use case, I'm using the value to upload files to a dvc repository through an API. The hash is used not only for integrity validation but to open a connection with the storage site at the beginning of the upload to stream the file contents. The md5 is required because it is the file structure on the remote.
I think that people tend to be scared is not a good reason to leave behind md5. It won't be easy to replace it with another algorithm while keeping backward compatibility. |
Yes, what @MetalBlueberry said! Azure's azcopy will declare an upload failed if the md5 doesn't match. I use it to ensure that my local versions of the files are identical to those in Azure. 1+ re people being "scared" of md5. Maybe add a paragraph in the docs explaining why it's ok. There's not really any good reason to switch, though I do like blake2s! |
@pmrowla knowing this decision, can you reconsider merging the feature #5246? Our team is facing issues when tracking .tiff images and other text-like files. Right now we are maintaining our own fork internally but this doesn't scale well. |
@MetalBlueberry Really sorry for the inconvenience, but we've discussed it multiple times in the team and it is clear that if we merge it, it means that it will become an official feature and more people will start using it, so we'll need to handle the migration and compatibility (or lack thereof) somehow, and we'll need to break the cache once again in 3.0. That's quite a lot of effort to handle it properly, which we will be much better off just spending on the upcoming 3.0 cache changes, that don't use dos2unix by design. To be clear, 3.0 cache will be available as an opt-in in 2.x, so you will be able to switch to it once it is released as an experimental feature (likely in march) and be compatible with all the exciting enhancements that will become standard in 3.0. |
Sounds like the best course of action. Thanks all! |
@efiop Thoughts on the priority of this? |
@dberenbaum fixing this is a major backwards compatibility breaking change, so the original thought was we would address this at the same time as all the other potential 3.0 cache/object changes (that will also break backwards compatibility) |
DVC CRLF issue
I've prepared a sample repository with the steps to reproduce this issue.
https://github.com/MetalBlueberry/dvc-crlf
Reproduce
Obtained
script.sh is stored with CRLF in cache, so when you run
dvc checkout
it always comes back to CRLF line endings even if you modify it locally.Expected
script.sh should keep the original CRLF line endings or at least store the cache content always with LF endings.
Consequences
Thank you
First reported in iterative/dvc.org#1469 (comment)
The text was updated successfully, but these errors were encountered: