Skip to content
This repository has been archived by the owner on Dec 29, 2021. It is now read-only.

Current chunking: content-aware? de-duped? #77

Open
bnewbold opened this issue Oct 30, 2017 · 9 comments
Open

Current chunking: content-aware? de-duped? #77

bnewbold opened this issue Oct 30, 2017 · 9 comments

Comments

@bnewbold
Copy link

Three questions about the current/stable dat client and ecosystem library behavior:

  1. are imported files chunked in a "content-aware" way, eg using Rabin fingerprinting? I've seen mention of this in the past (eg, https://blog.datproject.org/2016/02/01/dat-1-0-is-ready/), and I see https://github.com/datproject/rabin, but quick queries of the hyperdrive code base don't turn anything up.
  2. does hyperdrive handle full-file de-duplication? Eg, if the same file is added under different names, or a file is added and removed, will the metadata feed point to the same blocks in the data feed?
  3. does hyperdrive handle partial-file de-duplication? Eg, if a long .csv file has text changed in the middle (not changing the chunking or overall length), will only the mutated chunk get appended to the data feed? The current metadata implementation seems to be "chunk offset and length" based, so i'm not sure how a sparse set of chunks would be found.

Questions 1+2 are just curiosity about current behavior; I don't see anything in the spec that would prevent clients for implementing these optimizations in the future. Question 3 comes after working on an implementation; maybe I need to go back and re-read the spec.

@max-mapper
Copy link
Contributor

  1. we decided to disable rabin fingerprinting by default earlier this year, around when we switched to blake hashes. the reason was it limited import speed by a factor of around 4X. now we use fixed size 64k chunks by default.

  2. yes

  3. thats the scenario for rabin, right? with fixed size chunks we only dedupe if the chunk content is the same, nothing smarter than that

@bnewbold
Copy link
Author

Thanks for the reply!

  1. This seems to be other projects' experience as well (eg, IPFS). Hopefully somebody (a researcher?) will come up with a faster robust chunking scheme some day.

  2. I'm confused how this works with hyperdrive, which specifies only a single (blocks, offset) pointer for a given version of a file, implying that all file content must be stored in a single contiguous chunk in the data register. If two ~10 MB files each shared the first ~8 MB of data (with ~2 MB of unique data), how would the second file be deduplicated against the first (for the fixed chunks that are the same)? This also applies to file changes... I can see how appending to the most recent file just adds chunks, and the Stat entry can be extended easily for each version, but if an unrelated file was added (with new chunks), now the appended file would have it's chunks fragmented, and I don't see how hyperdrive would handle the deduplication.

@bnewbold
Copy link
Author

bnewbold commented Jan 3, 2018

For now, de-duplication (adding locally or downloading) doesn't happen in the official dat client or hyper* libraries.

Three thoughts about de-dupe:

If we had complete hashes of entire files as part of Stat metadata, we could index that and check if the whole file already exists before adding/downloading. This isn't the current spec.

As-is, it would probably work well enough to lookup the file size and first chunk hash in the complete data history (tree). If those matched, it would be work spending the computational resources to hash the new file and the (contiguous chunks of) old file; if they are an exact match, just point backwards in the data register.

Finally, many of these de-dupe optimizations are interesting both in a single (large) dat archive or when used against a huge repository of thousands of archives: doing download or disk-storage de-dupe between archives, potentially taking advantage of a novel storage backend. One simple back-end design would be a content-addressed key/value (hash/data) lookup of, eg, many TB of data.

@bcomnes
Copy link

bcomnes commented Jan 5, 2018

Seems like it should be an option. It sounds like you can still do it though by using the modules.

@okdistribute
Copy link
Contributor

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

@jmatsushita
Copy link

jmatsushita commented Nov 29, 2019

I'm quite interested in deduplication. I couldn't find software which does both deduplication of data in transit (incremental transfers, with variable size chunks) and deduplication of data at rest (with a chosen scope). It seems like a sweet design spot, especially with a p2p approach.

Are you aware of https://github.com/ronomon/deduplication ?

Fast multi-threaded content-dependent chunking deduplication for Buffers in C++

It achieves 1GB+/sec throughput. If I'm not mistaken, Rabin-Karp hashes at throughputs in the order of 100MB/s. It seems like that could help to deal with the import speed issues which led to removing content fingerprinting?

If I understand correctly there are a number of aspects to consider to implement deduplication in dat:

here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553

Here's a permalink to on master today https://github.com/mafintosh/hyperdrive/blob/b7121f5ecc97596722af4b65ecea74b8f2158774/index.js#L404

@jmatsushita
Copy link

@mafintosh Could hypertrie's customisable rolling hash function be used for deduplication and/or incremental transfers?

@andrewosh Could the corestore approach to have the content feed be a dependent of the metadata feed be used to add content aware chunking? Seems like namespaces could be a good way to define deduplication scopes too...

@serapath
Copy link
Member

people are very busy i guess - just saw your comment.
you could try again here: https://gitter.im/datproject/discussions

@martinheidegger
Copy link

Yes, we are active & busy with other things. @jmatsushita to get the ball rolling on something like this it would be a good idea to join a comm-comm meeting for finding allies and gaining perspective.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants