Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscelanous RFC #12

Closed
wants to merge 3 commits into from
Closed

Miscelanous RFC #12

wants to merge 3 commits into from

Conversation

dpc
Copy link
Collaborator

@dpc dpc commented Aug 5, 2017

I was toying a bit, trying to find a way to remove a performance hit found in #11 , and also with the API.

I really like the

fn find_chunk<'a>(&mut self, buf: &'a [u8]) -> Option<(&'a [u8], &'a [u8])>;

API. It gives a nice, idiomatic, low-level handling loop:

    let mut buf = v.as_slice();
    while let Some((last, rest)) = cdc.find_chunk(buf) {
        // handle `last` piece of the chunk
        buf = rest;
    }
    // handle leftover `buf` here 

and should be easy to build iterators, Read/Write adapters, and other APIs on top of it.

I'm not sure if this should get merged as a whole, but if you can let me know what seems OK, and what not, I could prepare more targeted patches.

dpc added 2 commits August 4, 2017 23:21
* Split CDC and rolling shum APIs
* Change the `find_chunk` API to something more idiomatic.
* Make `gear` be `u64` again
* Improve tests a bit
They don't really help with anything
@dpc dpc changed the title Reword Miscelanous RFC Aug 5, 2017
@dpc dpc mentioned this pull request Aug 5, 2017
@remram44
Copy link

remram44 commented Aug 5, 2017

I always have the same remark, which is that loading the full file in memory is unnecessary, and a deal breaker for me.

@dpc
Copy link
Collaborator Author

dpc commented Aug 5, 2017

@remram44 Neither the existing nor this API require loading the whole file into memory. You can look how rdedup is using the current one: https://github.com/dpc/rdedup/blob/b016278070cdcc3b6432b00452080aa45da4eb19/lib/src/sg.rs#L163 . It looks a bit more complicated due to OwningRef. Read more about the whole zero-copy, minimum memory usage processing in rdedup: https://dpc.pw/blog/2017/04/rusts-fearless-concurrency-in-rdedup/ .

@dpc dpc mentioned this pull request Aug 6, 2017
@remram44
Copy link

remram44 commented Aug 6, 2017

Yes, it is definitely possible to do this by managing the buffers at a higher level and feeding those to the chunking library. However because this is such a common operation I do believe this should be part of the API exposed by the chunking crate.

I went a bit overboard and really fleshed out this chunking library, borrowing from your design and the nested crate I had in dhstore. It ends up having a bunch of different methods/iterators for different uses. I would appreciate your feedback on it, especially the low-level stuff (it should be also possible to feed stuff manually to the rolling-hash like you do).

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

@remram44 I'm fine with higher-level wrapper APIs, but the low-level API like the one here is optimal since it does not involve copying anything, or allocating, or arbitrary buffer sizes. There are plenty of situations where someone already does have a data in the buffer.

Also, the higher-level APis can come up with a lot of different versions, and people can argue forever which one is best. I rather have all algos with a common, optimal, tiny API, and have everyone wrap it in whatever higher-level API that suits them best.

@dpc dpc mentioned this pull request Aug 6, 2017
@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

@remram44 I've filled dswd/zvault#10 as an example of how much of a performance drop does copying-API cost.

For rdedupI already have a zero-copy chunking, that I want to be as fast as possible, as it's the only piece of rdedup that is not easily multi-threaded.

@dswd
Copy link

dswd commented Aug 6, 2017

How would you go and chunk a file of arbitrary size (e.g. larger than memory) with this API? Can you provide an example. I am asking this, because this is my use case and I guess it is quite generic.

@dpc
Copy link
Collaborator Author

dpc commented Aug 6, 2017

Just like in rdedup. Using bunch of smaller buffers in OwningArc. I can paste more links later.

@dpc
Copy link
Collaborator Author

dpc commented Aug 10, 2017

@dpc dpc closed this Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants