Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote/cache: consider de-duplication for .dir files #3791

Closed
pmrowla opened this issue May 13, 2020 · 1 comment
Closed

remote/cache: consider de-duplication for .dir files #3791

pmrowla opened this issue May 13, 2020 · 1 comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research

Comments

@pmrowla
Copy link
Contributor

pmrowla commented May 13, 2020

Currently, when handing directories, we need to generate the json .dir file containing the MD5 + relpath for every file in the directory. When we add or modify a file in that directory, we have to create a new .dir file containing the full directory contents for the new directory revision. When dealing with very large directories, this amounts to a significant amount of storage. For a dir with 1M files, even if only a single file has changed between two revisions, we currently need to generate and store two separate (nearly identical) json files each containing 1M entries.

Ideally we should not be duplicating data between these directory versions. For a new directory version, we should only be storing data for the files which have changed between revisions.
Essentially we want store the diff between two directory trees, rather than two full directory trees, but exactly how we should be storing that needs to be researched. One suggestion was that we look into how git versions directory trees (discord context).

This should especially be considered now that we are discussing other potential changes to our cache structure.

@pmrowla pmrowla added enhancement Enhances DVC research performance improvement over resource / time consuming tasks labels May 13, 2020
@efiop efiop added the p2-medium Medium priority, should be done, but less important label May 14, 2020
@efiop efiop mentioned this issue Dec 4, 2020
11 tasks
@efiop
Copy link
Contributor

efiop commented Dec 4, 2020

Closing in favor of #829 (comment)

@efiop efiop closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks research
Projects
None yet
Development

No branches or pull requests

2 participants