Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop dos2unix behaviour #4658

Closed
Tracked by #7093
MetalBlueberry opened this issue Oct 2, 2020 · 18 comments · Fixed by #9538
Closed
Tracked by #7093

drop dos2unix behaviour #4658

MetalBlueberry opened this issue Oct 2, 2020 · 18 comments · Fixed by #9538
Assignees
Labels
p1-important Important, aka current backlog of things to do

Comments

@MetalBlueberry
Copy link
Contributor

MetalBlueberry commented Oct 2, 2020

DVC CRLF issue

I've prepared a sample repository with the steps to reproduce this issue.
https://github.com/MetalBlueberry/dvc-crlf

DVC version: 1.7.2 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-5.3.0-28-generic-x86_64-with-Ubuntu-18.04-bionic
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git

Cache is committed to the repo so it can be viewed

Reproduce

  • init git and dvc
  • create a script.sh with CRLF line endings
  • dvc add script.sh
  • change CRLF ending to LF
  • try to add it again, (doesn't detect changes)

Obtained

script.sh is stored with CRLF in cache, so when you run dvc checkout it always comes back to CRLF line endings even if you modify it locally.

Expected

script.sh should keep the original CRLF line endings or at least store the cache content always with LF endings.

Consequences

  • CRLF and LF cannot be modified after first commit
  • md5 file name doesn't match the file content hash

Thank you

First reported in iterative/dvc.org#1469 (comment)

@efiop
Copy link
Contributor

efiop commented Oct 2, 2020

Hi @MetalBlueberry !

This is a known behaviour of dvc #992 that we have anticipated might bite us in the rear at some point. Initially it was implemented that way to make sure that windows and *nix hashes of git files match, as git will automatically set CRLF on windows, unless you set autocrlf to false. In hindsight, we could probably check for that option automatically or at least document it and just stick with md5sum instead of dos2unix + md5sum.

That being said, dvc itself doesnt change the contents of the files it stores, so the collision happened when one file was pushed to the remote.

Could you elaborate on your scenario with CRLF, btw? Just want to have a better idea on your use case.

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Oct 2, 2020
@MetalBlueberry
Copy link
Contributor Author

In my scenario we were trying to upload a XML file to dvc remote using a custom uploader. The problem is that the md5 checksum calculated by dvc doesn't match the standard md5 when the file contains CRLF line endings. Our custom uploader performs validations on md5 checksum to ensure that you cannot upload data with different sums.

From my point of view, dvc should not change the file content to calculate the md5 to ensure that the file hash matches. In the scenario that you describes, either the linux user or the windows user will be forced to work with CRLF / LF line endings. The worst thing is that two identical files with CRLF/LF swapped could not be uploaded to the same remote because they will have the same hash. As I describe in the example, this makes impossible to change the line endings once you push to the remote.

If windows users have problems with git changing to CRLF line endings, they could just change the git settings instead of having inconsistent file hashes in DVC.

Other idea could be to enable/disable this behaviour with a parameters, We could have this behaviour enabled by default for backward compatibility but allow the users to disable it for consistency.

@efiop
Copy link
Contributor

efiop commented Oct 5, 2020

@MetalBlueberry That sounds good. And in 2.0 we could consider switching to that behaviour by default (we actually had some plans to do that in 1.0, but didn't get to that). Here is the piece in the code that does it

chunk = dos2unix(data)
.

@shcheklein
Copy link
Member

Yep, agreed! Providing an option is a great idea. Should we repurpose this ticket for this?

@gamis
Copy link

gamis commented Nov 16, 2020

This bit me too. I have a bunch of binary files that dvc thinks are text files even though they aren't. So the md5 hash calculated by dvc doesn't match the one calculated by azure or any other tool that just hashes the bytes without line-ending conversion.

It seems like this kind of approach makes your md5 hashes really brittle. Any bug fix to istextfile or file_md5 might break users' caches.

I would strongly suggest that the stored md5 hashes come with a version number that indicate what version of the algorithm they used. That way, if the algorithm is updated and improved, the old hashes aren't suddenly useless.

E.g., currently a .dvc file looks like:

outs:
- md5: cbc3c863de715aaa1c83b4c73a4baa20
  path: 'my_amazing_binary_file_dvc_thinks_is_txt.bin'

If dvc is fixed in 2.0 so that this is properly recognized as binary, the md5 hash above will be wrong.

Instead, consider:

outs:
- md5: cbc3c863de715aaa1c83b4c73a4baa20
  md5_hash_version: 1.0
  path: 'my_amazing_binary_file_dvc_thinks_is_txt.bin'

If the hash algo changes, just create a "1.1" or "2.0" version, but keep the ability to interpret a 1.0 hash code for older .dvc files.

Wow this kind of bug makes me nervous.

@efiop
Copy link
Contributor

efiop commented Nov 16, 2020

@gamis Thanks for the feedback! We plan on changing the cache format in 2.0 #4841 and plan on dropping dos2unix at the same time, so there is no cache mixup. Once we turn off dos2unix by default, there is not much reason to introduce md5 versioning, since it will be true md5 anyways. That said, we are also considering switching to a different default hash type, because people tend to be scared of md5, even though we are not doing any cryptography with it.

Regarding the azure etag, I don't think even azure guarantees it to be an md5, it just happens to match it sometimes. Same with s3's etag. So you probably shouldn't rely on it too much. Or am I missing something?

@gamis
Copy link

gamis commented Nov 16, 2020

Thanks for the quick response:

  1. Regarding versioning, if i upgrade to dvc 2.0, what happens to my .dvc files and my cache? Are they going to suddenly be completely invalid?
  2. Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

@efiop
Copy link
Contributor

efiop commented Nov 17, 2020

Regarding versioning, if i upgrade to dvc 2.0, what happens to my .dvc files and my cache? Are they going to suddenly be completely invalid?

No, 2.0 will likely be backward compatible with old dvc files and cache, at the very least in read-only mode. If something needs to be broken, we'll either migrate automatically or will provide a migration instructions/script. Keep in mind that we still need to be able to read old metrics/plots/params and do some other stuff with project's history, so we kinda have to have at least a read-only backward compatibility. 🙂

Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

Ah, makes sense 👍

@shcheklein
Copy link
Member

@gamis

Regarding azure, I'm not referring to the etag but rather the content_md5 property. It is only filled if the uploading client includes it, which is true of both azcopy and the python client.

Could you clarify please what is the purpose of this field (if it's a user provided one) and how do people use it?

@MetalBlueberry
Copy link
Contributor Author

In my case, I'm using gocloud/blob, which is a cross provider blob storage.

I also have the Content-MD5 field and it is used to check the integrity of the uploaded file. If the calculated hash doesn't match the user hash, the upload fails. https://godoc.org/gocloud.dev/blob#WriterOptions

it is also a standard header used to check the integrity of the data transmission. https://tools.ietf.org/html/rfc1864

For my specific use case, I'm using the value to upload files to a dvc repository through an API. The hash is used not only for integrity validation but to open a connection with the storage site at the beginning of the upload to stream the file contents. The md5 is required because it is the file structure on the remote.

@efiop

we are also considering switching to a different default hash type, because people tend to be scared of md5, even though we are not doing any cryptography with it.

I think that people tend to be scared is not a good reason to leave behind md5. It won't be easy to replace it with another algorithm while keeping backward compatibility.

@gamis
Copy link

gamis commented Nov 17, 2020

Yes, what @MetalBlueberry said! Azure's azcopy will declare an upload failed if the md5 doesn't match. I use it to ensure that my local versions of the files are identical to those in Azure.

1+ re people being "scared" of md5. Maybe add a paragraph in the docs explaining why it's ok. There's not really any good reason to switch, though I do like blake2s!

@efiop efiop added p1-important Important, aka current backlog of things to do and removed awaiting response we are waiting for your reply, please respond! :) labels Jan 18, 2021
@pmrowla pmrowla removed their assignment Feb 15, 2021
@MetalBlueberry
Copy link
Contributor Author

@pmrowla knowing this decision, can you reconsider merging the feature #5246? Our team is facing issues when tracking .tiff images and other text-like files. Right now we are maintaining our own fork internally but this doesn't scale well.
The toggle feature flag won't break existing compatibility and most users won't realize that the flag exists.

@efiop
Copy link
Contributor

efiop commented Feb 15, 2021

@MetalBlueberry Really sorry for the inconvenience, but we've discussed it multiple times in the team and it is clear that if we merge it, it means that it will become an official feature and more people will start using it, so we'll need to handle the migration and compatibility (or lack thereof) somehow, and we'll need to break the cache once again in 3.0. That's quite a lot of effort to handle it properly, which we will be much better off just spending on the upcoming 3.0 cache changes, that don't use dos2unix by design. To be clear, 3.0 cache will be available as an opt-in in 2.x, so you will be able to switch to it once it is released as an experimental feature (likely in march) and be compatible with all the exciting enhancements that will become standard in 3.0.

@gamis
Copy link

gamis commented Feb 15, 2021

Sounds like the best course of action. Thanks all!

@dberenbaum
Copy link
Collaborator

@efiop Thoughts on the priority of this?

@pmrowla
Copy link
Contributor

pmrowla commented Mar 7, 2022

@efiop Thoughts on the priority of this?

@dberenbaum fixing this is a major backwards compatibility breaking change, so the original thought was we would address this at the same time as all the other potential 3.0 cache/object changes (that will also break backwards compatibility)

@dberenbaum
Copy link
Collaborator

@pmrowla Thanks! Can we add to #7093 and lower the priority for now since it's not likely something we can get to now?

@pmrowla pmrowla mentioned this issue Mar 9, 2022
4 tasks
@pmrowla pmrowla added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Mar 9, 2022
@skshetry skshetry changed the title CRLF and LF line endings cannot be modified after dvc add drop dos2unix behaviour May 4, 2023
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels May 5, 2023
@pmrowla pmrowla self-assigned this May 25, 2023
@pmrowla pmrowla added this to DVC May 25, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC May 25, 2023
@pmrowla pmrowla moved this from Backlog to In Progress in DVC May 25, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
6 participants