Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement pub fn de/compress_into_slice(input: &[u8], output: &mut [u8]) -> usize #11

Closed
milesgranger opened this issue Feb 15, 2021 · 17 comments

Comments

@milesgranger
Copy link

Hi!

Thanks for taking the time to put this together, I think it's great! 🚀

I'm in a situation, where I tell some other lib what size slice I want (pre-allocating it into Python), and then I get a mutable reference to that. However, it seems all de/compress APIs in lz4_flex take or result in Vec<u8>.

Is it worth-while to add something like mentioned in the title?

@PSeitz
Copy link
Owner

PSeitz commented Feb 15, 2021

Hi,

Generally it's possible, but currently more space is allocated than needed to allow more aggressive optimizations. That means the slice size would not reflect the actual length of the uncompressed output.

The de/compress methods which push into the Vec would need to be changed copy add the current output pos instead.

@milesgranger
Copy link
Author

This may be too specific to my use case.

However, while I agree optimally and ergonomically, using the existing API is best for dealing with things in Rust. It turns out making a single (over) allocation into Python, doing the compression op, then resizing the python bytearray to the actual size is quite a bit faster than going through two allocations; one on Rust then another to create and copy that buffer into a Python object.

I understand if it's not worth it; it's already slightly faster than existing Python lz4 bindings, but having this ability would make it decisively faster; at least that's the experience when using snappy b/c it has implemented the Read trait; giving me the option to pre-allocate then resize the buffer (if there is a way to calculate expected size) or read it into a regular &mut Vec<u8> if unknown. read vs read_to_end; but again, not sure it'd be worthwhile here for your project.

@milesgranger
Copy link
Author

milesgranger commented Feb 24, 2021

I'd formally like to change my request to have de/compress_into take anything for output which implements Write. Looking briefly at the source code, not sure how feasible such a request would be; but anyway having the ability to pass &mut [u8] would still be helpful.

@PSeitz
Copy link
Owner

PSeitz commented Feb 26, 2021

Write trait support would be part of the lz4 frame format, which is missing currently. Currently only the block format is implemented.

From a performance point of view, the frame API would probably be a little slower, because of some additional overhead and allocations.

@milesgranger
Copy link
Author

Right, that makes sense, suppose I could still very much make use of the original request here and would make use of framed when/if that comes about. Thanks for the clarification between framed and block format here. 👍

@athre0z
Copy link

athre0z commented Mar 31, 2021

I'd also like having this kind of API. I'd be totally okay with having to over-allocate the buffer, then having the function return how many bytes were actually written. As long as the padding/alignment requirements are documented – or, perhaps even better, a function for calculating the required buffer size is present – I don't think that's a problem.

My more concrete use case is placing the LZ4 outputs into a bump allocator. In that sense, generalizing compress_into to be

pub fn compress_into<A: std::alloc::Allocator>(input: &[u8], compressed: &mut Vec<u8, A>)

would work as well, however with allocator_api currently still being nightly-only, that's not really a reasonable thing to request.

@PSeitz
Copy link
Owner

PSeitz commented Apr 1, 2021

@athre0z thanks, good to know there would be multiple use cases. Internally that's already how it works, it over allocates with get_maximum_output_size, then afterwards sets the actual size of the vector. I thought about it and this api would make sense on the block format even before the frame format, since it could be utilized from the frame format to avoid double allocations.

@PSeitz
Copy link
Owner

PSeitz commented Apr 4, 2021

on the mut_slice branch, it accepts now Sink as parameter for compress_into. Sink is a &mut u8 slice, with an additional pos information. Not sure if the api design will stay like this or just a &mut u8 is fine.
You can give it a try if you like. Would also be interesting to know if this improves performance for you.

@athre0z
Copy link

athre0z commented Apr 4, 2021

Thanks for implementing this! Can't really compare against previous performance because having to copy the buffer was a reason for not trying to use the library previously, but I can compare against the lz4 crate which I used previously!

Trivially compressible data

LZ4 flex safe compression: 1.32 GiB/s
LZ4 flex unsafe compression: 1.32 GiB/s (8.3% exec time spent doing compression)
lz4 crate compression (level = 1): 1.00GiB/s (33.8% exec time spent doing compression)

Mostly incompressible data

LZ4 flex safe compression: 348 MiB/s (70% exec time)
LZ4 flex unsafe compression: 455 MiB/s (61% exec time)
lz4 crate compression (level = 1): 420 MiB/s (64% exec time)

These throughput rates include not only compression, but also processing in my application (a sort of message queue tinkered towards high throughput). I profiled and annotated the amount of CPU time spent within the respective LZ4 implementation. It's worth noting that this is an unfair comparison, because I had been running the lz4 library in framed mode (without noticing it) and it spends quite a bit of of execution time in XXH32_update, which I suspect is part of the frame header. Nevertheless, the results look very promising!

Do you plan on also reworking the decompression API in the same manner?

@PSeitz
Copy link
Owner

PSeitz commented Apr 6, 2021

Yes, I updated the master branch to also include decompression with the sink. It's even around 10% faster for compressible data with safe-decoding, which makes sense, since push has additional overhead to check the capacity which is not required.
I think you should have either use checked-decode or hash and check the hash (which is part of the frame format, which is not yet implemented), or it could lead to memory corruption if the the data is corrupted or manipulated. Memory corruption would not a big issue if the program is sandboxed like in wasm, but I don't think that's the case for phyton.

@athre0z
Copy link

athre0z commented Apr 7, 2021

Nice, thanks! Yes -- since my application is networked, I already had the checked-decode enabled. :) It seems to be working great in general, but I noticed that the pos field in the Sink object didn't get updated properly. My compressor looks as following:

struct LZ4Compressor(i32);

impl Compressor for LZ4Compressor {
    fn compress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
        // Write uncompressed payload size.
        let len: u32 = input
            .len()
            .try_into()
            .map_err(|_| CompressError::PayloadTooLarge(input.len()))?;

        let cap = lz4_flex::block::compress::get_maximum_output_size(input.len()) + 4;
        let mut out = Vec::with_capacity_in(cap, *input.allocator());
        out.write_u32::<LE>(len).unwrap();

        // Write compressed data.
        unsafe { out.set_len(cap) };
        let actual = lz4_flex::compress_into(&input, &mut (&mut out[4..]).into());
        assert!(actual + 4 <= cap);
        unsafe { out.set_len(actual + 4) };

        Ok(out)
    }

    fn decompress<A: Allocator + Copy>(&self, input: Vec<u8, A>) -> CompressResult<Vec<u8, A>> {
        // Read uncompressed payload size.
        let mut read_slice = &input[..];
        let uncompressed_len = read_slice.read_u32::<LE>().unwrap();

        // Decompress buffer.
        let mut decompressed = Vec::with_capacity_in(uncompressed_len as _, *input.allocator());
        unsafe { decompressed.set_len(uncompressed_len as _) };
        let mut sink: lz4_flex::block::Sink = (&mut decompressed[..]).into();
        lz4_flex::decompress_into(read_slice, &mut sink).unwrap();
        assert_eq!(sink.pos(), uncompressed_len as _);

        Ok(decompressed)
    }
}

And the assert! in decompress fails because pos() is always 0. Perhaps I'm just misunderstanding the field.

@PSeitz
Copy link
Owner

PSeitz commented Apr 8, 2021

Ah yes, the pos parameter is not updated currently in the decompression case. It's probably a very niche issue, where decompression succeeds and is smaller than uncompressed_len, but could be covered by checking pos.

There is currently an issue I'm looking into where overallocation is required in the decompression case, but it should not. I think the logic needs to be changed a little how the fast loop is exited and then the last bytes are copied. This was previously no issue, because of the owned vec.

@PSeitz PSeitz mentioned this issue Apr 8, 2021
11 tasks
@athre0z
Copy link

athre0z commented Apr 8, 2021

Just like with the compression case, I wouldn't mind having to over-allocate the decompression buffer in presence of a function calculating the required size.

On a more general note, I think it would be best to keep Sink out of the public interface. I do understand the purpose, but feel that the public API would preferably abstract this implementation detail away from the user.

I also wonder if std::io::Cursor could be used internally (which is pretty much the same thing as Sink, implementing Write on Cursor<&mut [u8]>). It might make unsafe writes more complicated, so if you had previously known about this and it simply isn't a good fit, feel very much free to disregard this comment. Sometimes, reusing existing things just isn't a good solution.

@athre0z
Copy link

athre0z commented Apr 10, 2021

#[test]
fn lz4_test() {
    let data = vec![4u8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    let encoded = LZ4Compressor.compress(data.clone()).unwrap();
    let decoded = LZ4Compressor.decompress(encoded).unwrap();
    assert_eq!(data, decoded);
}

Currently fails with OutputTooSmall { expected_size: 0, actual_size: 20 }. I suppose that's one of the cases where "There is currently an issue I'm looking into where overallocation is required in the decompression case" cases? It looks like the expected_size is wrong here.

@PSeitz
Copy link
Owner

PSeitz commented Apr 11, 2021

Yes, that's the issue I meant. I agree Sink should be replaced. It's mainly there to have a smooth transition from the vec methods and Cursor would have been the better choice. I will consider it in the refactoring.

Over-allocation for decompression is not compatible with the frame format, because it could overwrite parts of other blocks.

@PSeitz
Copy link
Owner

PSeitz commented May 16, 2021

@athre0z This should work now, can you retest? I replaced the Sink api with &mut [u8] + pos.

@athre0z
Copy link

athre0z commented Jun 3, 2021

Sorry for the late response! Thanks, everything seems to be working as expected now, and the interface is a lot less cumbersome. :) The output_pos argument is slightly redundant (because you can also just do &mut output[output_pos..]), but I also don't mind it.

@PSeitz PSeitz closed this as completed Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants