-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research/quantify performance envelopes of multiple CDC algorighms #227
Comments
My ideas for IPFS challenge:
|
Hi! I've been reading up on IPFS recently and stumbled across a mention of chunking, which got me interested because they're sort of a side hobby (manifesting in a repo I occasionally maintain to compare rolling checksum implementations - https://github.com/aidanhs/rollsum-tests). Searches led me to ipfs-inactive/archives#137, then ipfs/go-ipfs-chunker#18 and now here! I don't know how interesting/useful my input may be, so I'll just make a series of comments - happy to elaborate on any, or feel free to ignore 😄 1 - speed of Rabin There's a lot of discussion of the current rabin implementation being slow. I've not seen quantification of this, other than ipfs-inactive/archives#137. Unfortunately the discussion is a bit muddled but there's mention of a 16 hour run for 61GB (modarchive). This number isn't credible to me as being caused by Rabin - the benchmarking I've done (results available on my repo) indicates that the IPFS Rabin with a 256KB average chunk ( Don't get me wrong, there's improvement to be had! With my SSD and a clean cache I can tar a 9GB directory of ~90k files and send to /dev/null in 10s (or /dev/zero in 1min10s) - it'd be nice to have something on a par. But I'd be interested to see measurements indicating it's a limiting factor. 2 - purpose of CDC Being a newcomer to IPFS I'm probably missing context (having just read a bunch of issues), but there seems to be quite a high expectation of chunking on deduplication and the like. I personally would expect it to deduplicate A) files that are identical and B) 'items' within other files (.tar, .a, disk images, maybe object sections within executables)...and fixed size chunking already does A! So I would expect essentially no space saving for everything in the comment above, apart from the Linux repos. A different way to look at it is that CDC "avoids the pathologically poor cases of fixed size chunking and handles archives well", rather than making it order of magnitude better in the general case. I've seen a number of issues about format-specific chunking which sounds awesome, but is orthogonal to (and possibly partially in conflict with) CDC. To be clear, I'm a huge fan of CDC (you can probably tell), but I do think it's important to set expectations. 3 - chunking configuration and algorithm choice Based on the above, we want chunks on average a fair amount smaller than the typical file size...otherwise, in case B the chunks will span multiple embedded items and be dependent on their ordering in the container! IPFS currently uses 512KB - this is pretty large for CDC (attic uses 64KB, anything using bupsplit (perkeep, bup) uses 8KB). Based on the above, it would then be best suited for files approx >2MB (small files would still work, just wouldn't benefit). I realise there may be other influencing factors, e.g. IPFS performance with small files. For the algorithm itself, you'd like something fast! But there's more than that - it'd also be nice to have something that has a small standard deviation of chunk sizes. If there's a large standard deviation you're more regularly hitting the min or max limits and losing the benefits of CDC. The current IPFS buzhash relies heavily on the chunk limits - it seems to actually be configured for 128KB splits but the aggressive limits raise the average to 256KB. My current results don't show this, but it's basically on a par with Gear from rsroll ( 4 - final misc notes
(cc ipfs/go-ipfs-chunker#13 which mentions bup) |
@aidanhs if it is possible read the lists in gilbertchen/benchmarking#13 and gilbertchen/benchmarking#14 for a consensus in chunking please, because I would like to know what chunk size is the most common. |
For those following along: this weekend I finally managed to arrange incoming streams in a way that allows for building trees in parallel, while still being able to spit out a DAG 100% converging with The tool itself should show up in its final version very early in April 🤞 Some very rough / unscientific numbers taken from a
|
If one moves to proper OS/hardware (
with zstandard out of the way ( it tops out at about 3GiB/s )
And with the pipe out of the way ( buffer max out at 16MiB on linux )
Not too bad 😈 |
As one of the maintainers of librsync who wrote one of the early "bupsplit" implementations many people have copied, I've recently introduced a RabinKarp polyhash as the new default instead of to the "bupsplit" rollsum because my analysis showed "bupsplit" had terrible hash distribution and collisions, particularly for small block/window sizes (<16K) and ASCII data, and particularly in the low "s1" end of the digest. Using rabinkarp made calculating signatures about 10% more expensive because they are more expensive to calculate, but makes calculating deltas about 10% faster because of significantly less rollsum hash collisions, and calculating deltas is the expensive operation (for 1GB files, adds about 1sec to calculating sigs, saves more than 10secs calculating deltas). See https://github.com/dbaarda/librsync-tests/blob/master/RESULTS.rst for details. The poor distribution probably doesn't matter quite as much for chunking, but with the tiny windows used for chunking (32 bytes?) and checking the LSB bits of the digest, you may be hitting the degenerate worst-case behavior for bupsplit. This would probably manifest as a chunk-size distribution that varies considerably with different file types (ASCII vs random) and doesn't reflect the target average block size. I also found cyclic-poly AKA buzhash had terrible collision rates for ASCII data... again, probably not something that matters for chunking. See the following for details. https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst |
FTR I've done some;more testing/analysis of chunker algorithms after reading the FastCDC paper and got some surprising results here; https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst Also, for a definitive summary of everything related to IPFS deduplication, read this; https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda |
Another update; I've added "Regression Chunking" from a Microsoft paper to my chunking analysis that has an interesting technique for reducing the affects of truncation at the max_len. I also added tests/analysis of Gear and small sliding windows used for chunking to my rollsum analysis. |
I wrote deduplicator, which is x4 faster when storing than other solutions and x2 faster when extracting than other solutions. But my deduplicator doesn't do any CDC, it simply splits data into fixed sized chunks. Such great speed was achieved thanks to blake3 and rayon ( https://crates.io/crates/rayon ). Also I did benchmark for various deduplicators. See this thread: borgbackup/borg#7674 . Especially last comment with newest benchmark and my newest deduplicator: borgbackup/borg#7674 (comment) . Yes, I know that you are interested in CDC, but I still think you can take some ideas |
Of course a fixed size fixed chunkier is going to be fast, and on VM images will probably even give you OK de-duplication. This is because VM images are filesystems that are arranged into fixed size blocks containing files that are mostly binaries, libraries, and compressed data. These kinds of files are either identical, or completely different, so they either have all or none of the same blocks on disk, and a fixed size chunkier can easily find them. However, a fixed size chunker fails very badly on many other simple cases. Get a big file, add one byte at the start, and a fixed chunkier will get 0% de-duplication. Or tar all the files on those VM images so they are no longer nicely aligned into fixed-size disk blocks, get nearly 0% de-duplication, etc. |
I added comparison between CDC-based tools: borgbackup/borg#7674 (comment) . Note that casync, desync and borg all use buzhash. So they use exactly same algorithm (buzhash + same chunk size + zstd level 3). And yet they give very different performance. These means that some of these tools greatly under-perform. Make sure not to repeat these problems in IPFS. Also note that borg is single-threaded and yet it beats parallel desync. This mean that desync does something horribly wrong. Make sure not to repeat this in IPFS! Possibly whole thread may contain some important info |
Okay, so here is list of Github issues I |
oI
95%
Assemble corpuses of data from various prior performance research initiatives ( both within and outside of PL )💯
Enumerate/obtain test datasets90%
Document rationales for the test datasets95%
Publish all of the above as plain HTTP + IPFS pinned downloadoI
85%
Document prior art, motivation and precise scope and types of sought metrics💯
Solicit/assemble feedback from various stakeholders💯
Collect/determine relevance of existing academic research into chunking ( 14 distinct papers selected for evaluation )💯
Convert the pre-PL chunk-tester to proper multi-streaming, to dramatically lower the cost of experiments( aiming at about 500 megabyte/s stream processing )with the correct implementation and hardware about 3.5GiB/s standard ingestion 🎉80%
Generate few preliminary datapoints to aid understanding the goal/scope90%
In depth study/evaluation/application of findings from above works💯
Understand and reuse existing go-ipfs implementations of CDCs ( Rabin + Buzzhash ) in a simpler go-ipfs independent utility, allowing rapid retries of different parameters💯
Same as above but pertaining to linking strategies ( trickle-dag etc ), as ignoring the link-layer of streams skews the results disproportionately98%
( subsumes a large portion of points belowv0.1
ETA: DEMO AT TEAM-WEEK ) Fully implement a standalone CLI utility re-implementing/converging withgo-ipfs
on all above algorithms. The distinguishing feature of said tool is the exposure of each chunker/linker as an atomic, composable primitive. The UX is similar to that offfmpeg
whereby an input stream is processed via multiple "filters", with the result being a stream of blocks with a statistic on their counts/sizes plus a valid IPFS CID. Current remaining tasks:💯
Profile/optimize baseline stream ingestion, ensure there is no penalty from applying a "null-filter", which allows one to benchmark a particular hardware setup's theoritcal maximum throughput💯
Finalize the "stackable chunkers" UI/UX, allowing effortless demonstration of impact of such chunker chains on the💯
Adjust statistics compilation/output for the above ( it currently looks like this, ignoring various "filter-levels" )💯
Make final pass on memory allocation profile and fixup obvious low hanging fruit beforev0.1
80%
README / godoc / stuffz80%
Rewrite previously utilized plotly.js-based visualiser to aid with the above pointoI
Open document to a short discussion soliciting feedback from workgroupsoII
Perform a number of "brute force" tests aiming at reproducible results ( utilizing https://github.com/ipfs/testground )for the purposes of what we are trying to quantifyiptb
will be sufficientoII
( half-covered by initial writeup ) Convert raw results into multi-dimensional scatter-plot visualizations ( plotly.js )oIII
Combine all available results into a "compromise chunking settings" RFC documentoIV
Publish the results for discussion and decision of the level of incorporation into IPFS implementations ( default parameters, use of selected algorithm by default, etc )The text was updated successfully, but these errors were encountered: