-
Notifications
You must be signed in to change notification settings - Fork 6
Current chunking: content-aware? de-duped? #77
Comments
|
Thanks for the reply!
|
For now, de-duplication (adding locally or downloading) doesn't happen in the official dat client or hyper* libraries. Three thoughts about de-dupe: If we had complete hashes of entire files as part of Stat metadata, we could index that and check if the whole file already exists before adding/downloading. This isn't the current spec. As-is, it would probably work well enough to lookup the file size and first chunk hash in the complete data history (tree). If those matched, it would be work spending the computational resources to hash the new file and the (contiguous chunks of) old file; if they are an exact match, just point backwards in the data register. Finally, many of these de-dupe optimizations are interesting both in a single (large) dat archive or when used against a huge repository of thousands of archives: doing download or disk-storage de-dupe between archives, potentially taking advantage of a novel storage backend. One simple back-end design would be a content-addressed key/value (hash/data) lookup of, eg, many TB of data. |
Seems like it should be an option. It sounds like you can still do it though by using the modules. |
here's a code comment on a way to implement this in hyperdrive https://github.com/mafintosh/hyperdrive/blob/master/index.js#L553 |
I'm quite interested in deduplication. I couldn't find software which does both deduplication of data in transit (incremental transfers, with variable size chunks) and deduplication of data at rest (with a chosen scope). It seems like a sweet design spot, especially with a p2p approach. Are you aware of https://github.com/ronomon/deduplication ?
It achieves 1GB+/sec throughput. If I'm not mistaken, Rabin-Karp hashes at throughputs in the order of 100MB/s. It seems like that could help to deal with the import speed issues which led to removing content fingerprinting? If I understand correctly there are a number of aspects to consider to implement deduplication in dat:
Here's a permalink to on master today https://github.com/mafintosh/hyperdrive/blob/b7121f5ecc97596722af4b65ecea74b8f2158774/index.js#L404 |
@mafintosh Could hypertrie's customisable rolling hash function be used for deduplication and/or incremental transfers? @andrewosh Could the corestore approach to have the content feed be a dependent of the metadata feed be used to add content aware chunking? Seems like namespaces could be a good way to define deduplication scopes too... |
people are very busy i guess - just saw your comment. |
Yes, we are active & busy with other things. @jmatsushita to get the ball rolling on something like this it would be a good idea to join a comm-comm meeting for finding allies and gaining perspective. |
Three questions about the current/stable dat client and ecosystem library behavior:
Questions 1+2 are just curiosity about current behavior; I don't see anything in the spec that would prevent clients for implementing these optimizations in the future. Question 3 comes after working on an implementation; maybe I need to go back and re-read the spec.
The text was updated successfully, but these errors were encountered: