Current chunking: content-aware? de-duped? #77

bnewbold · 2017-10-30T03:44:14Z

Three questions about the current/stable dat client and ecosystem library behavior:

are imported files chunked in a "content-aware" way, eg using Rabin fingerprinting? I've seen mention of this in the past (eg, https://blog.datproject.org/2016/02/01/dat-1-0-is-ready/), and I see https://github.com/datproject/rabin, but quick queries of the hyperdrive code base don't turn anything up.
does hyperdrive handle full-file de-duplication? Eg, if the same file is added under different names, or a file is added and removed, will the metadata feed point to the same blocks in the data feed?
does hyperdrive handle partial-file de-duplication? Eg, if a long .csv file has text changed in the middle (not changing the chunking or overall length), will only the mutated chunk get appended to the data feed? The current metadata implementation seems to be "chunk offset and length" based, so i'm not sure how a sparse set of chunks would be found.

Questions 1+2 are just curiosity about current behavior; I don't see anything in the spec that would prevent clients for implementing these optimizations in the future. Question 3 comes after working on an implementation; maybe I need to go back and re-read the spec.

max-mapper · 2017-11-01T00:26:48Z

we decided to disable rabin fingerprinting by default earlier this year, around when we switched to blake hashes. the reason was it limited import speed by a factor of around 4X. now we use fixed size 64k chunks by default.
yes
thats the scenario for rabin, right? with fixed size chunks we only dedupe if the chunk content is the same, nothing smarter than that

bnewbold · 2017-11-26T02:20:59Z

Thanks for the reply!

This seems to be other projects' experience as well (eg, IPFS). Hopefully somebody (a researcher?) will come up with a faster robust chunking scheme some day.
I'm confused how this works with hyperdrive, which specifies only a single (blocks, offset) pointer for a given version of a file, implying that all file content must be stored in a single contiguous chunk in the data register. If two ~10 MB files each shared the first ~8 MB of data (with ~2 MB of unique data), how would the second file be deduplicated against the first (for the fixed chunks that are the same)? This also applies to file changes... I can see how appending to the most recent file just adds chunks, and the Stat entry can be extended easily for each version, but if an unrelated file was added (with new chunks), now the appended file would have it's chunks fragmented, and I don't see how hyperdrive would handle the deduplication.

bnewbold · 2018-01-03T09:35:25Z

For now, de-duplication (adding locally or downloading) doesn't happen in the official dat client or hyper* libraries.

Three thoughts about de-dupe:

If we had complete hashes of entire files as part of Stat metadata, we could index that and check if the whole file already exists before adding/downloading. This isn't the current spec.

As-is, it would probably work well enough to lookup the file size and first chunk hash in the complete data history (tree). If those matched, it would be work spending the computational resources to hash the new file and the (contiguous chunks of) old file; if they are an exact match, just point backwards in the data register.

Finally, many of these de-dupe optimizations are interesting both in a single (large) dat archive or when used against a huge repository of thousands of archives: doing download or disk-storage de-dupe between archives, potentially taking advantage of a novel storage backend. One simple back-end design would be a content-addressed key/value (hash/data) lookup of, eg, many TB of data.

bcomnes · 2018-01-05T01:44:14Z

Seems like it should be an option. It sounds like you can still do it though by using the modules.

okdistribute · 2019-05-13T20:15:19Z

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

jmatsushita · 2019-11-29T16:07:50Z

I'm quite interested in deduplication. I couldn't find software which does both deduplication of data in transit (incremental transfers, with variable size chunks) and deduplication of data at rest (with a chosen scope). It seems like a sweet design spot, especially with a p2p approach.

Are you aware of https://github.com/ronomon/deduplication ?

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++

It achieves 1GB+/sec throughput. If I'm not mistaken, Rabin-Karp hashes at throughputs in the order of 100MB/s. It seems like that could help to deal with the import speed issues which led to removing content fingerprinting?

If I understand correctly there are a number of aspects to consider to implement deduplication in dat:

An optional content hashing metadata field would need to be added to hyperdrive metadata For discussion: full-file hashes in hyperdrive metadata DEPs#12
This seems to require some non trivial design decisions @martinheidegger to implement this in hyperdrive Proposal: Storage of sha for files in the tree holepunchto/hyperdrive#203
A host of other design decisions regarding dat would then also need to be made. Specifically how to declare which scopes deduplicate with each other both locally and remotely. Maybe something that would be a module for dat-store?

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

Here's a permalink to on master today https://github.com/mafintosh/hyperdrive/blob/b7121f5ecc97596722af4b65ecea74b8f2158774/index.js#L404

jmatsushita · 2019-12-20T12:00:37Z

@mafintosh Could hypertrie's customisable rolling hash function be used for deduplication and/or incremental transfers?

@andrewosh Could the corestore approach to have the content feed be a dependent of the metadata feed be used to add content aware chunking? Seems like namespaces could be a good way to define deduplication scopes too...

serapath · 2020-04-28T17:33:04Z

people are very busy i guess - just saw your comment.
you could try again here: https://gitter.im/datproject/discussions

martinheidegger · 2020-04-29T01:12:23Z

Yes, we are active & busy with other things. @jmatsushita to get the ball rolling on something like this it would be a good idea to join a comm-comm meeting for finding allies and gaining perspective.

bnewbold mentioned this issue Feb 21, 2018

Proposal: Storage of sha for files in the tree holepunchto/hyperdrive#203

Closed

bnewbold mentioned this issue Mar 18, 2018

For discussion: full-file hashes in hyperdrive metadata dat-ecosystem-archive/DEPs#12

Open

bnewbold mentioned this issue Mar 30, 2018

document metadata.ogd dat-ecosystem-archive/docs#114

Open

martinheidegger mentioned this issue Jan 11, 2019

Provide information on deduplication dat-ecosystem-archive/docs#135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current chunking: content-aware? de-duped? #77

Current chunking: content-aware? de-duped? #77

bnewbold commented Oct 30, 2017

max-mapper commented Nov 1, 2017

bnewbold commented Nov 26, 2017

bnewbold commented Jan 3, 2018

bcomnes commented Jan 5, 2018

okdistribute commented May 13, 2019

jmatsushita commented Nov 29, 2019 •

edited

Loading

jmatsushita commented Dec 20, 2019

serapath commented Apr 28, 2020

martinheidegger commented Apr 29, 2020

Current chunking: content-aware? de-duped? #77

Current chunking: content-aware? de-duped? #77

Comments

bnewbold commented Oct 30, 2017

max-mapper commented Nov 1, 2017

bnewbold commented Nov 26, 2017

bnewbold commented Jan 3, 2018

bcomnes commented Jan 5, 2018

okdistribute commented May 13, 2019

jmatsushita commented Nov 29, 2019 • edited Loading

jmatsushita commented Dec 20, 2019

serapath commented Apr 28, 2020

martinheidegger commented Apr 29, 2020

jmatsushita commented Nov 29, 2019 •

edited

Loading