-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combining efforts around data storage / deduplication? #9
Comments
@remram44 is the author of https://github.com/remram44/adler32-rs , which seems to be used by https://gitlab.wellbehavedsoftware.com/well-behaved-software/rzbackup . |
@dswd is the author of https://github.com/dswd/zvault which has a few chunking implementations at https://github.com/dswd/zvault/tree/master/chunking/src |
I'd be ok with a shared org, my interest is more around the algorithms rather than tools. I can imagine different people wanting different things from their tools though, so maybe it makes sense to keep them separate? Though then there would only be one crate under the org, the one containing the algorithms (and maybe my rollsum testing/benching repo at https://github.com/aidanhs/rollsum-tests?). I don't particularly have strong opinions, but it would be nice to deduplicate some of the work here ;) |
Indeed a few of those crates are implementing the same thing. I agree that it would be best to have at least a common interface or trait for chunking algorithms, so we all speak the same language. Looking over the APIs for the crates you linked, none of them seem general enough. My use case is iterating on chunks streamed out of a Interestingly enough I have another implementation here which is used like this (you can see it reuses the buffer, and never holds more than It's good to see an organized effort happening here! |
I think there are at least two top-level traits here (not actual Rust):
and
The former allows zero-copy slicing up a buffer into its component chunks, but requires a contiguous buffer to start with, the latter is more general and allows a finer-grained input buffer. It's trivial to adapt any R : Read into an iterator for the second one, for example. |
@remram44 You might be interested in an article I wrote about how this is being done @RAOF , Yes that looks like a nice, idiomatic API. Also, we will have to distinguish between rolling sum engine and CDC traits. |
@RAOF Are you sure the compiler will be able to optimize the If I were to propose a generic chunker interface, it would be something along these lines:
|
@dswd Depending on the internal implementation it should be quite simple for the compiler to optimise. For example, with Edit: Particularly, I'd expect |
So how about this?
I actually tested the performance difference in my code of trait object vs. monomorphism but it came out to be almost zero. Maybe this is because the calls to |
The For the general interface I think an Iterator is more appropriate - you're not always going to be chunking files, and it's easy to go |
The |
Huh. Why is it doing that? What's the point of copying from |
It still needs to have an internal buffer as reading and writing bytewise is too slow but this way, that internal buffer can be of small fixed size as it does not have to fit a whole chunk but only the window of the rolling hash. |
I guess I should probably read the code in context, because I still don't see the point in copying from |
I'm thinking about an API like this: https://github.com/remram44/cdchunking-rs/blob/fb72120869ab0ba070b30079e4d9ab735f0fb011/src/lib.rs#L19-L65 |
If you're reading from a file and writing chunks out, either to separate files on disk or to the network, requiring a buffer that can hold wholes chunks at a time is completely unnecessary. Having an way of getting chunks a bit at a time and knowing when a new chunk starts is perfectly doable and I don't believe it complicates the API unnecessarily (and of course both ways can be exposed). |
I think for starter, we should move the discussion to a separate issue and discuss the API there. :) |
Another one: @green-coder is the owner of https://github.com/green-coder/cdc |
One thought on the general structure - as pointed out https://github.com/aidanhs/rsroll/pull/11/files#r131031976, rolling checksums (the namesake of this repo) like bupsplit, rabin, gear are a subset of algorithms used for chunking (which include fastcdc etc). So it probably makes sense to separate out the different expected APIs, possibly into different crates - one for rolling checksums (possibly an api similar to the one in this crate), one for the higher level concept of chunking (with apis dealing with readers/writers and so on). What do people think of this structure? If there's agreement from a few people then the org and two repos can be created and API discussion can move to the appropriate place. Or, if people have other thoughts, I'd love to hear them! There is also the question of licenses, which I just realised is an important one - if we can't agree then it's difficult to collaborate :) For libraries like this repo I tend to prefer 'permissive' licenses (bsd, apache, mit). I don't particularly have a license preference for tools/applications, if we decide the org is the correct place to host them. |
I am personally a GPL-ish person, but since this whole effort would be to be a backbone of Rust ecosystem, it only makes sense to make it as permissive as possible. On top of it, we can build our own toys which licenses of our choosing. |
Note that chunking is also a subset of what rolling checksums are good at (for instance, in rsync, they are used to find the original chunks in modified files in linear time). A good model might be to keep the rolling checksums in their own crates, and have a chunking crate that depends on those? With each particular algorithm gated behind a cargo feature? I'm working towards that in my tentative cdchunking crate, where a |
On the other hand rolling hashes are rather simple software-wise, merely algorithms with very little dependencies, and putting that code inside the chunking crate (possibly behind cargo features) should be easy as well. |
As stated in #11, I am ok with licensing my chunking code under BSD/MIT although I would prefer LGPL. Also I would contribute my code for a planned chunking crate if the crate provides an interface that is efficient for my use case. |
Please move the API discussion to #13. Thank you! :) |
Hi everyone,
I've created this issue under
rsroll
because my initial idea was to introduce authors of 3-different software packages focused rolling sums / CDC, andrsroll
is the one thatrdedup
is currently using, and "the oldest" according to my knowledge. However, when writing this post, it appeared to me that maybe expanding the whole theme toward general data deduplication or even data industry (backup, data storage, etc) might make sense.Let me start by stating my opinion, that Rust is very well positioned to excel in the data industry. The general attitude toward reliability, easiness of writing parallel code and avoiding copying, zero-cost abstractions, possibilities of exposing C-APIs, etc. make it a great tool for the job.
@aidanhs is the author of
rollsum
/rsroll
, and I am aware that he is currently busy with more Rust-core related work.@jmesmon is the author https://github.com/jmesmon/hash-roll which is quite a comparable crate, that has more rolling sums algos already implemented and I really like some of the abstractions there: https://docs.rs/hash-roll/0.2.0/hash_roll/trait.Splitter.html
On reddit I've got in touch with @RAOF, who is working on https://github.com/RAOF/chunky , which is focused on Content Defined Chunking, a very related subject to rolling sums.
I am the the author of https://github.com/dpc/rdedup, which is data deduplication engine / backup tool, that has reached level of maturity and functionality comparable to more estabilished alternatives in different programming languages. @phillipCouto is the biggest contributor to
rdedup
.So my question would be: would you be interested in cooperating? There is no need to commit to anything (like time that we are all probably always short on).
In practical terms: for starter, we could create a github community around data/data deduplication for Rust. I would move
rdedup
under that ubrella. On top of that we could work on combining all three libraries under more broad "data deduplication algorithms" (rollsum, CDC, etc.) .We could ping other authors of backup and data tools in Rust, have an IRC channel etc. just like other topic-center communities (gamedev/graphic, embedded, webdev, networking, etc) do.
One way, or another, thanks for your attention and all the good work!
The text was updated successfully, but these errors were encountered: