Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for memoryview and PickleBuffer #99

Closed
crusaderky opened this issue Apr 26, 2023 · 17 comments · Fixed by #100
Closed

Support for memoryview and PickleBuffer #99

crusaderky opened this issue Apr 26, 2023 · 17 comments · Fixed by #100

Comments

@crusaderky
Copy link

As of 2.7.0rc1, cramjam seems to be incompatible with memoryview and PickleBuffer objects.
This is a blocker to the adoption in dask/distributed.

>>> cramjam.bzip2.compress(memoryview(b"123"))
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'memoryview' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'memoryview' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'memoryview' object cannot be converted to 'PyArray<T, D>'

>>> import pickle, numpy
>>> a = numpy.ones(10)
>>> buffers = []
>>> pickle.dumps(a, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7ff1f5b28540>]
>>> cramjam.bzip2.compress(buffers[0])
TypeError: argument 'data': failed to extract enum BytesType ('bytes | bytearray | File | Buffer | numpy')
- variant Bytes (bytes): TypeError: failed to extract field BytesType::Bytes.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyBytes'
- variant ByteArray (bytearray): TypeError: failed to extract field BytesType::ByteArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyByteArray'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'File'
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'Buffer'
- variant NumpyArray (numpy): TypeError: failed to extract field BytesType::NumpyArray.0, caused by TypeError: 'PickleBuffer' object cannot be converted to 'PyArray<T, D>'
@milesgranger
Copy link
Owner

Can leverage the buffer protocol to make zero copy into numpy and then pass that; would this be suitable? (same goes for PickleBuffer)

Prefer to steer clear of implementing support for an arbitrary amount of types, especially if those types already implement the buffer protocol.

>>> import numpy as np
>>> data = memoryview(b'data')
>>> buf = np.frombuffer(data, dtype=np.uint8)
>>> cramjam.lz4.compress(buf)
cramjam.Buffer(len=23)

@milesgranger
Copy link
Owner

In retrospect, can likely add a generic PyObject variant and at runtime poke at it to see if it implements the buffer protocol and get the buffer from there. Suspect it may be slightly slower, but would be a good general solution for the random objects implementing it.

@crusaderky
Copy link
Author

Suspect it may be slightly slower

I seriously doubt the slowdown will be noticeable as long as the compression itself deals with 10kiB+

@crusaderky
Copy link
Author

Prefer to steer clear of implementing support for an arbitrary amount of types

I agree that the individual types should not be explicitly implemented (numpy shouldn't either!)

@milesgranger
Copy link
Owner

Will the suggested numpy.frombuffer work for you for now?

@milesgranger
Copy link
Owner

Was interested and prototype'd it this morning:

In [1]: import cramjam

In [2]: data = memoryview(b'data')

In [3]: cramjam.lz4.compress(data)
Out[3]: cramjam.Buffer(len=23)

In [4]: import array

In [5]: out = array.array('B', list(range(23)))

In [6]: out
Out[6]: array('B', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22])

In [7]: cramjam.lz4.compress_into(data, out)
Out[7]: 23

In [8]: out
Out[8]: array('B', [4, 34, 77, 24, 68, 64, 94, 4, 0, 0, 128, 100, 97, 116, 97, 0, 0, 0, 0, 53, 138, 34, 33])

Guess a proper implementation would take a day or two. May take some time for me to get around to in my free time; hopefully the np.frombuffer is sufficient for you in the meantime.

@martindurant
Copy link

Like this?

use pyo3::buffer::PyBuffer;

        // value: &PyAny
        let buf: PyBuffer<u8> = value.extract().unwrap(); // expose possible error
        let rustbuf: &[u8] = unsafe {
            slice::from_raw_parts(buf.buf_ptr() as *const u8, buf.len_bytes())
        };

@milesgranger
Copy link
Owner

milesgranger commented Apr 27, 2023

Roughly, basically how the prototype goes. Probably would need to wrap it similar to PyBytes/PyByteArray since all variants in BytesType have Read/Write implemented and would need to maintain some positional state value between the read/write call(s). PyBuffer has the handy checks for thing slike readonly but as_slice gives &[ReadOnlyCell<T>] which is less ergonomic than getting the buffer directly like you did here.


Think whatever wrapper it is, BytesType(PythonBuffer), for example, PythonBuffer ought to have TryFrom<PyBuffer<u8>> which checks for c_contiguous, dimensions == 1, readonly is appropriate for the context (ie, needs writable for _into calls) and maybe some other invariants. If all that checks out, use the underlying buffer directly from there.

@crusaderky
Copy link
Author

Will the suggested numpy.frombuffer work for you for now?

Yes, the hack works. Didn't performance-test it but it should be negligible.

@milesgranger
Copy link
Owner

Can try out 2.7.0rc2 for the buffer updates done in #100

@crusaderky
Copy link
Author

It doesn't seem to work

>>> cramjam.__version__
'2.7.0-rc2'
>>> cramjam.lz4.compress_block(bytearray(b"123"))
cramjam.Buffer<len=8>
>>> cramjam.lz4.compress_block(memoryview(b"123"))
cramjam.Buffer<len=8>
>>> cramjam.lz4.compress_block(numpy.ones(10))
TypeError: argument 'data': failed to extract enum BytesType ('Buffer | File | pybuffer')
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'ndarray' object cannot be converted to 'Buffer'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'ndarray' object cannot be converted to 'File'
- variant PyBuffer (pybuffer): TypeError: failed to extract field BytesType::PyBuffer.0, caused by BufferError: buffer contents are not compatible with u8
>>> cramjam.lz4.compress_block(memoryview(numpy.ones(10)))
TypeError: argument 'data': failed to extract enum BytesType ('Buffer | File | pybuffer')
- variant RustyBuffer (Buffer): TypeError: failed to extract field BytesType::RustyBuffer.0, caused by TypeError: 'memoryview' object cannot be converted to 'Buffer'
- variant RustyFile (File): TypeError: failed to extract field BytesType::RustyFile.0, caused by TypeError: 'memoryview' object cannot be converted to 'File'
- variant PyBuffer (pybuffer): TypeError: failed to extract field BytesType::PyBuffer.0, caused by BufferError: buffer contents are not compatible with u8

@milesgranger
Copy link
Owner

milesgranger commented May 2, 2023

Needs to be bytes

>>> np.ones(10).dtype
dtype('float64')
>>> cramjam.lz4.compress_block(np.ones(10, dtype=np.uint8))  # or np.ones(10).tobytes()
cramjam.Buffer<len=15>

@milesgranger
Copy link
Owner

Or better yet np.ones(10).view(np.uint8) for no copies.

@crusaderky
Copy link
Author

Or

m = memoryview(a).cast("B")
cramjam.lz4.compress_block(m)

All other compression libraries don't have this caveat and are happy to ingest any PickleBuffer; could you fix it? (no rush)

@milesgranger
Copy link
Owner

milesgranger commented May 2, 2023

All other compression libraries don't have this caveat and are happy to ingest any PickleBuffer; could you fix it?

I agree, this is not great.

I'd like to get the bytes view of buffers directly which can done, but the current implementation of PyBuffer<T> requires a type, in this case u8, and we don't actually need that as we can always just get the bytes view. So would 'only' need a refactor to use the cffi more directly to avoid using PyBuffer<T> at all.

@milesgranger
Copy link
Owner

@crusaderky v2.7.0rc3 ought to work for you.

@crusaderky
Copy link
Author

Yep works great! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants