Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance when converting python bytes/bytearray to Vec<u8> #2888

Open
vxgmichel opened this issue Jan 18, 2023 · 8 comments · Fixed by #2899
Open

Improve performance when converting python bytes/bytearray to Vec<u8> #2888

vxgmichel opened this issue Jan 18, 2023 · 8 comments · Fixed by #2899

Comments

@vxgmichel
Copy link
Contributor

Hi all, I noticed a performance issue when extracting a PyBytes or a PyByteArray object into a Vec<u8>.

This is an issue one can easily run into without realizing it. Here's a scenario, let's say we'd like to expose a simple checksum function:

#[pyfunction]
fn checksum(data: &[u8]) -> PyResult<u8> {
   let mut result = 0;
   for x in data {
       result ^= x;
   }
   Ok(result)
}

See how it performs against the equivalent python implementation, processing 1MB a hundred times:

2.65s call     test_checksum.py::test_perf[py-bytes]
0.00s call     test_checksum.py::test_perf[rs-bytes]

Looks really fast! However, it won't accept a bytearray as an argument:

TypeError: argument 'data': 'bytearray' object cannot be converted to 'PyBytes'

So we update our implementation to take a Vec<u8> instead:

#[pyfunction]
fn checksum(data: Vec<u8>) -> PyResult<u8> {
    let mut result = 0;
    for x in data {
        result ^= x;
    }
    Ok(result)
}

And now the results:

2.61s call     test_checksum.py::test_perf[py-bytearray]
2.55s call     test_checksum.py::test_perf[py-bytes]
1.92s call     test_checksum.py::test_perf[rs-bytearray]
1.87s call     test_checksum.py::test_perf[rs-bytes]

It performs roughly the same as python, which makes sense if we look at the FromPyObject implementation for Vec<T>:

pyo3/src/types/sequence.rs

Lines 314 to 318 in bed4f9d

let mut v = Vec::with_capacity(seq.len().unwrap_or(0));
for item in seq.iter()? {
v.push(item?.extract::<T>()?);
}
Ok(v)

The bytes/bytearray object is iterated and each item (i.e a python integer) is separately extracted into a u8.

This could be fixed by specializing the extract logic in the case of a Vec<u8> and use specific methods such as PyBytes::as_bytes().to_vec() and PyByteArray::to_vec(). Here's a possible patch:
https://gist.github.com/vxgmichel/367e01e8504cb9c9e700a22525e8b68d

With this patch applied, the performance is now similar to what we had with the &[u8] slice:

2.70s call     test_checksum.py::test_perf[py-bytearray]
2.65s call     test_checksum.py::test_perf[py-bytes]
0.00s call     test_checksum.py::test_perf[rs-bytes]
0.00s call     test_checksum.py::test_perf[rs-bytearray]
@adamreichold
Copy link
Member

I understand this might be unsatisfying, but I think without stable specialization, the preferred course of action is manual downcasting to the relevant types like PyBytes and PyByteArray. Using an enum with #[derive(FromPyObject)] can make limit the required boilerplate. Especially since downcasting to PyByteArray opens the unsafe route towards not copying at all.

The Option-based specialization seems to be a bit of wart for this special case. And if applied, I would probably suggest limiting it to Vec<u8> as anything beyond "bytes" goes right into buffer protocol and/or NumPy array territory.

@davidhewitt
Copy link
Member

I agree that adding extract_sequence to FromPyObject as proposed in the gist is very unsatisfying.

That said, I've hit the Vec<u8> performance cliff myself in production (it can also cause OOM), which took me some time to debug, so I would be open to a discussion if there's something we can do to make this case less of a footgun.

@vxgmichel
Copy link
Contributor Author

the preferred course of action is manual downcasting to the relevant types like PyBytes and PyByteArray. Using an enum with #[derive(FromPyObject)] can make limit the required boilerplate.

I'm not sure I understand this part correctly, is that a suggestion for pyo3 itself or the user code? At the moment the best solution I found for the user code is using a dedicated wrapper type that implements FromPyObject:

struct BytesWrapper(Vec<u8>);

impl<'source> FromPyObject<'source> for BytesWrapper {
    fn extract(object: &'source PyAny) -> PyResult<Self> {
        Ok(BytesWrapper(match object.extract::<&PyByteArray>() {
            Ok(x) => x.to_vec(),
            Err(_) => object.extract::<&PyBytes>()?.as_bytes().to_vec(),
        }))
    }
}

impl From<BytesWrapper> for Vec<u8> {
    fn from(data: BytesWrapper) -> Self {
        data.0
    }
}

Then it can be used as such:

#[pyfunction]
fn checksum_fixed(data: BytesWrapper) -> PyResult<u8> {
    let data: Vec<u8> = data.into();
    let mut result = 0;
    for x in data {
        result ^= x;
    }
    Ok(result)
}

It's not entirely satisfactory though because the unwrapping can be quite annoying in some cases, say if the argument is something like Option<Vec<Vec<u8>>>. In this case, the type conversion would look like this:

#[pyfunction]
fn some_function(data: Option<Vec<BytesWrapper>>) -> PyResult<u8> {
    let data: Option<Vec<Vec<u8>>> = data.map(|x| x.into_iter().map(|y| y.into()).collect());
    ...
}

Any suggestion on how to address this issue?

@adamreichold
Copy link
Member

adamreichold commented Jan 20, 2023

I'm not sure I understand this part correctly, is that a suggestion for pyo3 itself or the user code?

This was a suggestion for user code, but less about wrapping the result, but more about manually writing the "extract chains", e.g.

#[derive(FromPyObject)]
enum BytesWrapper<'py> {
  Bytes(&'py PyBytes),
  ByteArray(&'py PyByteArray),
}

impl From<BytesWrapper<'_>> for Vec<u8> {
  fn from(wrapper: BytesWrapper) -> Self {
    match wrapper {
      BytesWrapper::Bytes(bytes) => bytes.as_bytes().to_vec(),
      BytesWrapper::ByteArray(byte_array) => byte_array.to_vec(),
    }
  }
}

but of course this does not help with getting rid of the wrapper type. It might enable zero-copy usage though, e.g.

impl<'py> From<BytesWrapper<'py>> for Cow<'py, [u8]> {
  fn from(wrapper: BytesWrapper<'py>) -> Self {
    match wrapper {
      BytesWrapper::Bytes(bytes) => Cow::Borrowed(bytes.as_bytes()),
      BytesWrapper::ByteArray(byte_array) => Cow::Owned(byte_array.to_vec()),
    }
  }
}

In the approach you outlined above, I think implementing Deref and DerefMut might help. It would not remove the wrapper, maybe you could use it transparently in most places? (Beyond that, your wrapper could use the exact same representation as Vec<u8> via #[repr(transparent)] so complex types built on top could be unsafely transmuted, but I would not recommend that.)

As written above, the proper solution here would be specialization so that we could provide specialized impls for impl FromPyObject<'a> for Vec<u8> which considers PyBytes and PyByteArray.

That said, I've hit the Vec performance cliff myself in production (it can also cause OOM), which took me some time to debug, so I would be open to a discussion if there's something we can do to make this case less of a footgun.

For now, I see two hacks which would be palatable to me personally:

  • impl FromPyObject<'a> for Cow<'a, [u8]> which explicitly handles PyBytes and PyByteArray. I think this would work as we have no more general impl for Cow<'a, T> with which it could conflict ATM.
  • Use TypeId::of to check TypeId::of::<T>() == TypeId::of::<u8>() in the generic impl FromPyObject<'a> for Vec<T> and if that condition holds, handle PyBytes and PyByteArray and unsafely transmute the resulting Vec<u8> into Vec<T> (which are the same in that case). Ideally, the compiler optimize away this poor man's specialization as both TypeId are known at compile time.

@adamreichold
Copy link
Member

Use TypeId::of to check TypeId::of::() == TypeId::of::() in the generic impl FromPyObject<'a> for Vec and if that condition holds, handle PyBytes and PyByteArray and unsafely transmute the resulting Vec into Vec (which are the same in that case). Ideally, the compiler optimize away this poor man's specialization as both TypeId are known at compile time.

Ah, this one of course does not work as the T: 'static bound is not fulfilled. Sorry for the noise...

@vxgmichel
Copy link
Contributor Author

vxgmichel commented Jan 20, 2023

This was a suggestion for user code, but less about wrapping the result, but more about manually writing the "extract chains", e.g. [...]

Looks nice, I'll do that :)

In the approach you outlined above, I think implementing Deref and DerefMut might help. It would not remove the wrapper, maybe you could use it transparently in most places? (Beyond that, your wrapper could use the exact same representation as Vec via #[repr(transparent)] so complex types built on top could be unsafely transmuted, but I would not recommend that.)

I think I'll implement an UnwrapBytesWrapper trait as such:

pub trait UnwrapBytesWrapper {
    type ResultType;
    fn unwrap_bytes(self) -> Self::ResultType;
}

impl UnwrapBytesWrapper for BytesWrapper<'_> {
    type ResultType = Vec<u8>;
    fn unwrap_bytes(self) -> Self::ResultType {
        self.into()
    }
}

impl<T> UnwrapBytesWrapper for Option<T>
where
    T: UnwrapBytesWrapper,
{
    type ResultType = Option<T::ResultType>;
    fn unwrap_bytes(self) -> Self::ResultType {
        self.map(|x| x.unwrap_bytes())
    }
}

impl<T> UnwrapBytesWrapper for Vec<T>
where
    T: UnwrapBytesWrapper,
{
    type ResultType = Vec<T::ResultType>;
    fn unwrap_bytes(self) -> Self::ResultType {
        self.into_iter().map(|x| x.unwrap_bytes()).collect()
    }
}

This way I can use the same conversion code everywhere:

#[pyfunction]
fn some_function(data: Option<Vec<BytesWrapper>>) -> PyResult<u8> {
    let data = data.unwrap_bytes();
    ...
}

As written above, the proper solution here would be specialization so that we could provide specialized impls for impl FromPyObject<'a> for Vec which considers PyBytes and PyByteArray.

I'm looking forward to that, thanks for your help :)

vxgmichel added a commit to Scille/parsec-cloud that referenced this issue Jan 20, 2023
vxgmichel added a commit to Scille/parsec-cloud that referenced this issue Jan 20, 2023
vxgmichel added a commit to Scille/parsec-cloud that referenced this issue Jan 26, 2023
bors bot added a commit that referenced this issue Feb 22, 2023
2899: RFC: Provide a special purpose FromPyObject impl for byte slices r=davidhewitt a=adamreichold

This enables efficiently and safely getting a byte slice from either bytes or byte arrays.

The main issue I see here is discoverability, i.e. should this be mention in the docs of `PyBytes` and `PyByteArray` or in the guide?

It is also not completely clear whether this really _fixes_ the issue.

Closes #2888 

Co-authored-by: Adam Reichold <[email protected]>
@bors bors bot closed this as completed in bbaceb8 Feb 22, 2023
@vxgmichel
Copy link
Contributor Author

vxgmichel commented Feb 27, 2023

Hi all, I think the bot went a bit too quick on closing this issue.

PR #2899 is an interesting feature to PyO3 but I don't think it addresses the specific performance issue I reported. In my opinion, either bytes -> Vec<u8> is supported or it's not.

If it is, the python iteration over the individual bytes should really be avoided as it makes programs much slower than their pure python alternative, which defeats an important purpose of PyO3.

If it is not, it should not appear in the docs like it does today:

| `bytes` | `Vec<u8>`, `&[u8]`, `Cow<[u8]>` | `&PyBytes` |

I also quickly read through the comments in #2899 about banning Vec<u8> using a linter. I'm not an experienced rust developer but isn't it more convoluted than the fix I suggested in my original post, for specializing the specific bytes -> Vec<u8> conversion using as_bytes().to_vec()? I understand it does not conform to your quality standards, but it does achieve its goals as far as I could tell. Maybe someone more experienced could get a satisfying implementation of this specialization?

@davidhewitt
Copy link
Member

I am happy to reopen this, as I agree the current situation around Vec<u8> is a performance footgun, both as an input and output argument to #[pyfunction].

Agreed that either we need to make this clear in the documentation (presumably recommending the new Cow implementation) or change the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants