-
Notifications
You must be signed in to change notification settings - Fork 24
Ensure that Rabin fingerprinting works with large datasets #134
Comments
Rabin sharding from what I understand isn't that cheap and it is the reason it chokes on it. Before we invest time into it I would recommend checking if it gives any benefits in multiple areas: in file, cross files (directory) and cross datasets vs normal chunking. As from what I understand it might prevent some other duplications from happening. |
I appreciate your caution. In theory rabin fingerprinting should be beneficial for exactly this case, where many people have downloaded the same datasets from the same sources but might have slight variations in the copies they downloaded. Our default chunking algorithm (fixed-size 256kb chunks) prevents them from even trying to deduplicate files. People like @20zinnm are motivated to test how the code performs for this use case. I want to make sure that the code is ready for them to proceed. Keep in mind:
|
See: #136 |
@flyingzumwalt It's still very heavy in terms of performance and needs high specs for anything > a few gigs. But yes, in principle it should work. |
From https://botbot.me/freenode/ipfs/2017-01-29/?msg=80105342&page=1
@whyrusleeping Could you re-add the test dataset from #126 using Rabin fingerprinting to make sure it doesn't choke?
The text was updated successfully, but these errors were encountered: