Other interesting related analysis are available.
- dbaarda/rollsum-tests - analysis of different rollsum algorithm's hashing performance for different applications.
- dbaarda/rollsum-chunking - analysis of different chunking algorithm's [CDC] deduplication performance.
The following rollsum algorithms and implementations are tested:
- 'bupsplit' - the rolling checksum implementation in bup, apparently taken from rsync.
Wikipedia has more historic information on rsync (apparently it's based on adler-32, itself modified from Fletcher's checksum -
https://en.wikipedia.org/wiki/Rsync). This rollsum implementation targets average 8k chunk sizes (anything else is a deviation
from 'true' bupsplit in my eyes, as opposed to other algorithms which are not prescriptive).
- bup rollsum (C + Python)
- https://github.com/aidanhs/rsroll (Rust)
- perkeep rollsum (Go)
- bita rollsum (Rust)
- 'rabin' - the venerable 'Rabin Fingerprint', outlined on Wikipedia and somewhat popular in filesystem chunking applications. It's not clear to me if Rabin actually holds any advantage over rolling CRC)
- 'split' - not content-defined chunking at all! Fixed size blocks, interesting for comparison purposes
- 'gear' - the predecessor to FastCDC (outlined in https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)
- 'buzhash' - outlined on Wikipedia, buzhash is a simple algorithm
that uses a random number generator to generate a lookup table - you're probably unlikely to find implementations with matching
results for that reason (Rabin just has a single polynomial)
- bita rollsum (Rust)
- IPFS chunking library (Go)
- libasuran rollsum (Rust)
Comments:
- I partitioned out the 8k and 256k competitors in case the number of splits would make a material difference to perf - in practice it does make some, but not enough the change rankings (as you can see from the results)
- currently these implementation benchmarking times include the initial load of the file - this is bad and they shouldn't,
but is currently mitigated by doing
cat file > /dev/null
to get it into cache - the benchmark attempts to be semirealistic - given a file in cache, how long does it take to load it and determine the split points. In time I'll probably migrate this to exclude actually loading the file
You need Podman or Docker (if the latter, just change the command in the Makefile and it should work).
Clone this repo, get the rollsum implementations and build a Docker image with the dependencies:
$ git clone https://github.com/aidanhs/rollsum-tests.git && cd rollsum-tests && git submodule update --init --recursive
$ make container # create an image for the container
Jump into the container to and set up the test environment.
NOTE: all the implementations load the test file for a run fully into RAM before starting work. The test files are all below 1.5GB.
$ podman run -it -v $(pwd):/work --tmpfs /tmp --rm rollsum-tests bash
# make testfiles # generate deterministic test files
[...]
# make rebuilddep # wipe and build all the dependencies
# make preptest # build the assorted test programs
[...]
Finally:
# make test # actually run tests
[...]
HOW TO READ THE RESULTS
- Each cell is `time[err,count,pstdev,mem]`. `time` is in seconds, `err` indicates whether the split result failed
to match bup exactly, `count` indicates how many splits there were, `pstdev` indicates the standard devision of the
split sizes and `mem` indicates max memory in MB.
- Standard deviation is only calculated for tests with >500 splits to try and give a rough judgement on the hash
algorighm quality (intuitively I think random data should be a happy case for a hash).
- To save screen space, the common results from non-erroring lines are collapsed.
FILE SIZE BUP RSROBU BITABU PERKEEP RSROGE IPFSRA IPFSSPL RSROBU256 RSROGE256 BITABZ ASURANBZ IPFSRA256 IPFSSPL256 IPFSBZ
01.test 4.0K 0.0[/,0c,-,6M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",3M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[X,1c,-,2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M]
02.test 4.0K 0.0[/,0c,-,7M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[X,1c,-,2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M]
03.test 8.0K 0.0[/,1c,-,7M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[X,0c,-,2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[X,0c,-,2M] 0.0[X,0c,-,2M] 0.0[X,0c,-,2M] 0.0[X,1c,-,2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M]
04.test 64K 0.01[/,11c,-,7M] 0.0[",",",2M] 0.0[",",",2M] 0.0[",",",2M] 0.0[X,6c,-,2M] 0.0[X,10c,-,6M] 0.0[X,8c,-,6M] 0.0[X,0c,-,2M] 0.0[X,0c,-,2M] 0.0[X,0c,-,2M] 0.0[X,1c,-,2M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M] 0.0[X,1c,-,6M]
05.test 384K 0.0[/,45c,-,7M] 0.0[",",",3M] 0.0[",",",3M] 0.0[",",",2M] 0.0[X,52c,-,3M] 0.0[X,48c,-,8M] 0.0[X,48c,-,7M] 0.0[X,1c,-,3M] 0.0[X,1c,-,3M] 0.0[X,1c,-,3M] 0.0[X,1c,-,3M] 0.0[X,1c,-,8M] 0.0[X,2c,-,7M] 0.0[X,2c,-,7M]
06.test 1.7M 0.01[/,188c,-,8M] 0.0[",",",4M] 0.0[X,188c,-,5M] 0.0[",",",4M] 0.0[X,211c,-,4M] 0.01[X,203c,-,11M] 0.0[X,206c,-,9M] 0.0[X,8c,-,4M] 0.0[X,7c,-,4M] 0.01[X,7c,-,6M] 0.01[X,6c,-,5M] 0.0[X,6c,-,12M] 0.0[X,7c,-,9M] 0.0[X,7c,-,9M]
07.test 5.5M 0.01[/,711c,7478,12M] 0.0[",",",8M] 0.03[X,711c,7478,9M] 0.02[",",",8M] 0.0[X,729c,8435,8M] 0.04[X,680c,3405,19M] 0.0[X,704c,89,18M] 0.01[X,27c,-,8M] 0.0[X,32c,-,8M] 0.02[X,24c,-,10M] 0.03[X,21c,-,9M] 0.03[X,21c,-,18M] 0.0[X,22c,-,18M] 0.0[X,26c,-,18M]
08.test 16M 0.03[/,2010c,8554,22M] 0.02[",",",18M] 0.1[X,2012c,8554,19M] 0.06[",",",19M] 0.01[X,2178c,8358,19M] 0.11[X,2019c,3438,39M] 0.0[X,2048c,0,39M] 0.03[X,85c,-,19M] 0.02[X,57c,-,18M] 0.08[X,73c,-,21M] 0.09[X,49c,-,20M] 0.1[X,64c,-,39M] 0.0[X,64c,-,39M] 0.02[X,65c,-,39M]
09.test 42M 0.07[/,5094c,8506,47M] 0.07[",",",44M] 0.25[X,5104c,8506,45M] 0.17[",",",44M] 0.02[X,5549c,8232,43M] 0.26[X,5154c,3467,89M] 0.01[X,5255c,31,90M] 0.07[X,213c,-,43M] 0.04[X,160c,-,43M] 0.19[X,163c,-,48M] 0.24[X,135c,-,44M] 0.23[X,155c,-,90M] 0.0[X,165c,-,90M] 0.04[X,172c,-,90M]
10.test 96M 0.18[/,12260c,8176,102M] 0.16[",",",98M] 0.59[X,12265c,8176,99M] 0.39[",",",100M] 0.06[X,12566c,8644,98M] 0.6[X,11970c,3435,199M] 0.03[X,12208c,72,201M] 0.16[X,471c,-,98M] 0.11[X,410c,-,98M] 0.45[X,366c,-,103M] 0.57[X,297c,-,99M] 0.54[X,370c,-,199M] 0.02[X,382c,-,199M] 0.11[X,400c,-,199M]
11.test 205M 0.34[/,26205c,8239,211M] 0.32[",",",207M] 1.3[X,26211c,8239,208M] 0.82[",",",213M] 0.14[X,27406c,8446,207M] 1.3[X,25630c,3425,420M] 0.08[X,26167c,7,422M] 0.35[X,1011c,223003,207M] 0.23[X,803c,263320,207M] 0.96[X,836c,261952,211M] 1.19[X,639c,249044,208M] 1.15[X,799c,111378,418M] 0.03[X,818c,2618,419M] 0.19[X,818c,111677,417M]
12.test 411M 0.72[/,52107c,8292,416M] 0.67[",",",412M] 2.57[X,52088c,8292,414M] 1.69[",",",426M] 0.26[X,54428c,8554,412M] 2.57[X,51216c,3439,835M] 0.14[X,52488c,0,840M] 0.72[X,1981c,212794,412M] 0.47[X,1787c,242932,412M] 1.93[X,1637c,272110,418M] 2.38[X,1365c,231810,414M] 2.27[X,1611c,110133,831M] 0.07[X,1641c,4852,833M] 0.38[X,1672c,110754,831M]
13.test 778M 1.3[/,99447c,8185,784M] 1.26[",",",780M] 4.97[X,99454c,8185,782M] 3.21[",",",806M] 0.5[X,103607c,8485,780M] 5.15[X,97600c,3443,1578M] 0.2[X,99577c,13,1587M] 1.4[X,3680c,228807,780M] 0.92[X,3166c,264622,780M] 3.96[X,3111c,260256,786M] 4.94[X,2623c,233203,781M] 4.58[X,3017c,110748,1570M] 0.12[X,3112c,1101,1574M] 0.77[X,3254c,107475,1568M]
14.test 1.4G 2.42[/,179861c,8153,1414M] 2.26[",",",1410M] 9.28[X,179978c,8153,1412M] 5.87[",",",1456M] 1.01[X,188641c,8448,1410M] 9.66[X,176327c,3444,2850M] 0.46[X,180151c,19,2866M] 2.46[X,6726c,216579,1410M] 1.66[X,5821c,265289,1410M] 6.82[X,5679c,254793,1416M] 8.56[X,4498c,243249,1411M] 8.11[X,5506c,111157,2835M] 0.24[X,5630c,1088,2841M] 1.34[X,5768c,110212,2833M]
You can limit the test files with e.g. make TEST_FILES="test/01.test test/02.test" test
.
- Of the three identical implementations ('bupsplit'):
- rsroll edges out bup by small but consistent amounts, potentially due to python overhead
- perkeep and bita are slow by comparison
- All of the IPFS implementations seem to be leaking memory pretty badly (~1x the amount being split)
- IPFS split is a nice baseline to consider other implementations in the context of
- Gear in rsroll and buzhash seem comparable and fluctuate which is the 'winner' based on time of day -
I will be improving benchmarks in time to make them more consistent
- given FastCDC is claimed to be significantly faster than gear, this is probably one to look at
- some of the standard deviations are artificially reduced by having min and max chunk sizes - IPFS in particular is very aggressive with this (disabling those limits puts buzhash and rabin on a par with Bita buzhash) and it remains to be seen in future analyses how much this is crippling the chunking
To regenerate sums:
for f in test/*.test; do ./test_rust rsroll-bup $f | sha1sum > $f.sum; done
Bugs generally:
- ipfs/notes#1 (comment) - strong agree with this comment, I also found most of them were wrong when I was writing my own
Bugs found/comments inspired by this repo:
TODO:
- it's not useful to have splits before the window has been filled - fix it
- partition out the set of tests into
- 'correctness' (via patches to unify+compare implementations with each other)
- 'quality' (via patches to remove chunk limits for standard deviation analysis, then using that to see how much the limits cause non-chunked splitting)
- 'performance' (i.e. just running the implementations as-is)
- have implementations load the data, mlock it and then do timing themselves
- add the buzhash implementation from attic - https://github.com/jborg/attic/blob/master/attic/\_chunker.c
- add https://github.com/nlfiedler/fastcdc-rs/blob/master/src/lib.rs
- add librsync
- have the benchmarks repeat the measurements 10 times (within the test programs themselves)
- even better, actually do benchmarking properly - take inspiration from statistics, e.g. Go used by
go-ipfs-chunker,
cargo bench
or https://jsperf.com/
- even better, actually do benchmarking properly - take inspiration from statistics, e.g. Go used by
go-ipfs-chunker,
- add SSE2-optimised rabin https://github.com/kevinko/rabin
- add restic
- add tarsnap https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.c
- read some of the papers on ipfs/go-ipfs-chunker#7 (comment) to see if there are any other interesting things