Replies: 1 comment 7 replies
-
Hi @aparpara Did you experience hash collision while using dvc in real world? |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
As DVC cache entirely relies on MD5 hash sums, it is vulnerable to MD5 collisions. I.e. if 2 different files have the same MD5, they are considered as identical. I understand the probability of this event is about 2-64, but according to Murphy's law, sooner or later it should happen. And IMHO this should be explicitly mentioned in DVC documentation so that users can be aware of this possibility. But unfortunately I found no such mentions.
To reproduce the collision one may use an example from Wikipedia:
dvc init --no-scm
dvc config cache.type reflink,hardlink,symlink
dvc add md5coll-1.bin md5coll-2.bin
Beta Was this translation helpful? Give feedback.
All reactions