Ensure that Rabin fingerprinting works with large datasets #134

flyingzumwalt · 2017-01-29T14:14:15Z

From https://botbot.me/freenode/ipfs/2017-01-29/?msg=80105342&page=1

[ani 10:03 pm] Rabin seems to be failing at larger jobs. Will try with more memory and CPU but it's stalling around 5GB/49

@whyrusleeping Could you re-add the test dataset from #126 using Rabin fingerprinting to make sure it doesn't choke?

Kubuxu · 2017-01-29T15:04:01Z

Rabin sharding from what I understand isn't that cheap and it is the reason it chokes on it.

Before we invest time into it I would recommend checking if it gives any benefits in multiple areas: in file, cross files (directory) and cross datasets vs normal chunking.

As from what I understand it might prevent some other duplications from happening.

flyingzumwalt · 2017-01-29T18:59:01Z

I appreciate your caution. In theory rabin fingerprinting should be beneficial for exactly this case, where many people have downloaded the same datasets from the same sources but might have slight variations in the copies they downloaded. Our default chunking algorithm (fixed-size 256kb chunks) prevents them from even trying to deduplicate files. People like @20zinnm are motivated to test how the code performs for this use case. I want to make sure that the code is ready for them to proceed.

Keep in mind:

We can't do the tests you've suggested if the chunking functions fails to even process the input files
Though rabin fingerprinting is not the default chunking algorithm, it is an officially supported one. It should work. If it's not working, we should at least record and diagnose the bug.

meyerzinn · 2017-01-30T01:18:57Z

See: #136

flyingzumwalt · 2017-01-30T01:35:11Z

I'll take #136 as confirmation that rabin fingerprinting works for large datasets. Great work @20zinnm. I'll open a new issue for the tests that @Kubuxu suggested.

meyerzinn · 2017-01-30T01:36:05Z

@flyingzumwalt It's still very heavy in terms of performance and needs high specs for anything > a few gigs. But yes, in principle it should work.

flyingzumwalt closed this as completed Jan 30, 2017

flyingzumwalt mentioned this issue Jan 30, 2017

Test the benefits of Rabin Fingerprinting vs normal chunking #137

Open

DonaldTsang mentioned this issue Feb 26, 2018

Possible updates with the use of non-Rabin algorithms restic/chunker#19

Closed

DonaldTsang mentioned this issue Sep 21, 2018

Listing possible chunking algorithm ipfs/go-ipfs-chunker#7

Closed

DonaldTsang mentioned this issue Nov 27, 2019

Collection of findings ipfs/go-ipfs-chunker#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that Rabin fingerprinting works with large datasets #134

Ensure that Rabin fingerprinting works with large datasets #134

flyingzumwalt commented Jan 29, 2017

Kubuxu commented Jan 29, 2017

flyingzumwalt commented Jan 29, 2017 •

edited

Loading

meyerzinn commented Jan 30, 2017

flyingzumwalt commented Jan 30, 2017

meyerzinn commented Jan 30, 2017 •

edited

Loading

Ensure that Rabin fingerprinting works with large datasets #134

Ensure that Rabin fingerprinting works with large datasets #134

Comments

flyingzumwalt commented Jan 29, 2017

Kubuxu commented Jan 29, 2017

flyingzumwalt commented Jan 29, 2017 • edited Loading

meyerzinn commented Jan 30, 2017

flyingzumwalt commented Jan 30, 2017

meyerzinn commented Jan 30, 2017 • edited Loading

flyingzumwalt commented Jan 29, 2017 •

edited

Loading

meyerzinn commented Jan 30, 2017 •

edited

Loading