-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscelanous RFC #12
Miscelanous RFC #12
Conversation
* Split CDC and rolling shum APIs * Change the `find_chunk` API to something more idiomatic. * Make `gear` be `u64` again * Improve tests a bit
They don't really help with anything
I always have the same remark, which is that loading the full file in memory is unnecessary, and a deal breaker for me. |
@remram44 Neither the existing nor this API require loading the whole file into memory. You can look how rdedup is using the current one: https://github.com/dpc/rdedup/blob/b016278070cdcc3b6432b00452080aa45da4eb19/lib/src/sg.rs#L163 . It looks a bit more complicated due to |
This fixes optimization fail rust-lang/rust#43653
Yes, it is definitely possible to do this by managing the buffers at a higher level and feeding those to the chunking library. However because this is such a common operation I do believe this should be part of the API exposed by the chunking crate. I went a bit overboard and really fleshed out this chunking library, borrowing from your design and the nested crate I had in dhstore. It ends up having a bunch of different methods/iterators for different uses. I would appreciate your feedback on it, especially the low-level stuff (it should be also possible to feed stuff manually to the rolling-hash like you do). |
@remram44 I'm fine with higher-level wrapper APIs, but the low-level API like the one here is optimal since it does not involve copying anything, or allocating, or arbitrary buffer sizes. There are plenty of situations where someone already does have a data in the buffer. Also, the higher-level APis can come up with a lot of different versions, and people can argue forever which one is best. I rather have all algos with a common, optimal, tiny API, and have everyone wrap it in whatever higher-level API that suits them best. |
@remram44 I've filled dswd/zvault#10 as an example of how much of a performance drop does copying-API cost. For |
How would you go and chunk a file of arbitrary size (e.g. larger than memory) with this API? Can you provide an example. I am asking this, because this is my use case and I guess it is quite generic. |
Just like in rdedup. Using bunch of smaller buffers in |
I adapted |
I was toying a bit, trying to find a way to remove a performance hit found in #11 , and also with the API.
I really like the
API. It gives a nice, idiomatic, low-level handling loop:
and should be easy to build iterators,
Read/Write
adapters, and other APIs on top of it.I'm not sure if this should get merged as a whole, but if you can let me know what seems OK, and what not, I could prepare more targeted patches.