remote/cache: consider de-duplication for .dir files #3791
Labels
enhancement
Enhances DVC
p2-medium
Medium priority, should be done, but less important
performance
improvement over resource / time consuming tasks
research
Currently, when handing directories, we need to generate the json .dir file containing the MD5 + relpath for every file in the directory. When we add or modify a file in that directory, we have to create a new .dir file containing the full directory contents for the new directory revision. When dealing with very large directories, this amounts to a significant amount of storage. For a dir with 1M files, even if only a single file has changed between two revisions, we currently need to generate and store two separate (nearly identical) json files each containing 1M entries.
Ideally we should not be duplicating data between these directory versions. For a new directory version, we should only be storing data for the files which have changed between revisions.
Essentially we want store the diff between two directory trees, rather than two full directory trees, but exactly how we should be storing that needs to be researched. One suggestion was that we look into how git versions directory trees (discord context).
This should especially be considered now that we are discussing other potential changes to our cache structure.
The text was updated successfully, but these errors were encountered: