Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output: deprecate dos2unix md5 hashing #9517

Closed
wants to merge 1 commit into from

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented May 29, 2023

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Will close #4658

Requires iterative/dvc-data#359

  • Adds config option core.legacy_md5. When True, legacy istextfile+dos2unix behavior will be used when computing MD5 hashes
  • core.legacy_md5 defaults to true (behavior is unchanged for existing 2.x repos where core.legacy_md5 is unset)
  • core.legacy_md5 is explicitly set to false in all new DVC repos upon Repo.init/dvc init (legacy dos2unix behavior is disabled by default in new 3.x repos).

@pmrowla pmrowla added breaking-change A: data-management Related to dvc add/checkout/commit/move/remove labels May 29, 2023
@pmrowla pmrowla requested a review from a team May 29, 2023 07:22
@pmrowla pmrowla self-assigned this May 29, 2023
@efiop
Copy link
Contributor

efiop commented May 29, 2023

@pmrowla Do you think we need that backward compatibility option at all? Seems like it could potentially corrupt cache if someone turns it on for a moment.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 29, 2023

My main concern was with how to handle existing DVC repos. If we don't have the compatibility option anyone with an existing repo with files we hashed using the dos2unix behavior also has a corrupted cache. There will be a lot of users with existing DVC repos (and existing local cache + remotes) that were affected by the dos2unix behavior without the user realizing it, and so I was leaning towards making the default leave the behavior as-is for those existing repos.

But I'd also be fine with reversing the default in this PR and then documenting that users with existing 2.x repos upgrading to 3.x need to explicitly enable this option only if they are sure they need the legacy behavior.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 29, 2023

But I'd also be fine with reversing the default in this PR and then documenting that users with existing 2.x repos upgrading to 3.x need to explicitly enable this option only if they are sure they need the legacy behavior.

For this case we could also add a utility (or a flag for dvc doctor) in 3.x that would re-hash everything (dvc-tracked) in the user's workspace and warn them if there's a mismatch between the 3.x md5 and existing md5 and then suggest that the user needs to enable the legacy option

(and document that users should run this check when upgrading from 2.x to 3.x)

@efiop
Copy link
Contributor

efiop commented May 29, 2023

@pmrowla How about moving new objects somewhere different? E.g existing ones are .dvc/cache/12 and new in .dvc/cache/file/12 ? I think we've talked about it before, but don't think we agreed on anything in particular.

Separate locations will allow us to fallback to old objects. And we could call legacy hash algo something like md5dos2unix to distinguish it from actual md5 instead of introducing text knowledge in dvc-data. WDYT?

Or maybe we could just drop support for old ones along the way and provide a script or something to convert to new objects?

I guess we'll be meeting tomorrow, so will have a chance to chat.

@pmrowla
Copy link
Contributor Author

pmrowla commented May 30, 2023

@pmrowla How about moving new objects somewhere different? E.g existing ones are .dvc/cache/12 and new in .dvc/cache/file/12 ? I think we've talked about it before, but don't think we agreed on anything in particular.

Separate locations will allow us to fallback to old objects. And we could call legacy hash algo something like md5dos2unix to distinguish it from actual md5 instead of introducing text knowledge in dvc-data. WDYT?

From an engineering standpoint I think this is the most correct option, but I was concerned it's less user-friendly for handling existing repos mainly since we will lose de-duplication for files which have already been tracked in existing repos. This is just the standard problem you have when switching to any new hash algo, where you either lose de-dupe or have to keep track of a mapping of old hashes to the new ones.

We could handle this by computing both hashes (and writing both hashes in dvc files) so that we know when we can fall back to a 2.x cache or remote version, but this will get expensive in terms of performance. But this could also be something users explicitly enable via config.

Alternatively we can just accept that we will lose de-dupe, existing objects will be left alone and all newly added ones will be non-dos2unix.

For reference, the git sha256 change handles this by generating a lookup table file that maps old sha -> new sha for existing objects, and in storage existing objects are left alone and all new objects are stored according to the new hash.

(Also, I think it might be better to use a new name for the 3.x hashing (something like md5raw) since we have existing dvc files where md5 means dos2unix.)

@pmrowla
Copy link
Contributor Author

pmrowla commented May 30, 2023

For reference, the git sha256 change handles this by generating a lookup table file that maps old sha -> new sha for existing objects, and in storage existing objects are left alone and all new objects are stored according to the new hash.

Keep in mind that this lookup table is probably not a viable option for us since we need to account for the case where a majority of our objects may only be available in a remote. Recomputing hashes to generate this table is straightforward when all of your objects are available locally (like with git), but doing it for a dvc remote would be extremely expensive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove breaking-change
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

drop dos2unix behaviour
2 participants