Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Thell · 2024-02-15T22:08:56Z

Firstly, thanks for PyO3, it's great!

This issue is being opened following a [chat post on gitter].(https://matrix.to/#/!AAhjIWoaKSExrkkhlG:gitter.im/$9gOL75NMgSzKVrKugcQUMBUFcy5QMpLvxuJafAHLieU?via=gitter.im&via=matrix.org&via=nitro.chat)

In short, consider these performance ordered cargo bench results:

running 5 tests
test tests::bench_xoshiro_struct       ... bench:  56,930,651 ns/iter (+/- 145,472)
test tests::bench_lcg_struct           ... bench:  86,424,781 ns/iter (+/- 337,028)
test tests::bench_lcg_static_lazy      ... bench:  86,614,389 ns/iter (+/- 781,944)
test tests::bench_lcg_static           ... bench:  87,029,922 ns/iter (+/- 353,495)
test tests::bench_xoshiro_static_lazy  ... bench: 108,713,707 ns/iter (+/- 1,500,070)

and contrast that to these timeit results:

lcg_static:            2.2382071590000123
lcg_static_lazy:       2.2475717499983148
xoshiro_static_lazy:   2.333924032998766
lcg_struct:            2.603333091999957
xoshiro_struct:        2.6484997750012553

See something odd there? 🤓 Hey structs! What is going on!?!? Sure we have overhead but all of these test functions are returning a simple type to python and, yeah, the structs have to be mutable but for xoshiro_struct to go from being about twice as fast as the lazy to about 14% slower is unexpected, and to see both of the 'struct' versions leading in the Rust benches to trailing in the Python timeits is unexpected.

Running some long pytest loops on these to get enough samples to see what's going on reveals that our two structs are getting penalized for being structs:

The question is can this be avoided through some hint/annotation to short-circuit that extract/type-info stack?

Here's one of the minimal structs:

#[pyclass]
pub struct XoshiroStruct {
    state: Xoshiro256Plus,
}

impl XoshiroStruct {
    fn next_state(&mut self) -> u64 {
        self.state.next_u64()
    }
}

#[pymethods]
impl XoshiroStruct {
    #[new]
    pub fn new() -> Self {
        Self {
            state: Xoshiro256Plus::from_entropy(),
        }
    }

    pub fn do_something(&mut self) -> u64 {
        do_it(self.next_state())
    }
}

Here's the profile: pytest 2024-02-15 22.59 profile.json.gz

And here's a repo with these examples: https://github.com/Thell/struct_perf

adamreichold · 2024-02-16T05:42:11Z

Generally speaking, some amount of overhead is unavoidable due pervasive shared mutability applied in Python which requires us to use interior mutability patterns to enable safe Rust code from obtaining a &mut Self at all.

You might be able to reduce that overhead by opting out of dynamic borrow check by marking your pyclass as frozen using #[pyclass(frozen)] (check the guide for details). This does imply that do_something must take &self and hence you need to manage the interior mutability yourself. But since you seem to be benchmarking small PRNG, using std::cell::Cell should be possible with doing reference checking, e.g.

let mut state = self.state.get();
let value = self.next_state();
self.state.set(state);

Thell · 2024-02-16T17:53:59Z

Thanks for the reply @adamreichold. Unfortunately that doesn't seem to have negated the "extract" cost, although it is now extract_pyclass_ref instead of extract_pyclass_ref_mut. :)

The image cut off the struct declaration:

#[pyclass(frozen)]
pub struct XoshiroStruct {
    state: Cell<Xoshiro256Plus>,
}

note: if anyone else tries to do Cell<Xoshiro256Plus> be warned you'll need to use a local rand_xoshiro and add the impl for Copy since it only has Clone in the official repo.

davidhewitt · 2024-02-16T22:31:57Z

So from your flamegraphs it looks like the time is dominated by retrieval of the type object, which in general is a necessary part of the type check. I've got a couple of thoughts here:

@Thell can you try using lto = true in your Cargo.toml if you are not already doing so? I wonder if more aggressive inlining can help here.
I wonder if this is a hint that we should be exploring storing these type objects within in the module state; the "lazy type object" machinery seems to be doing a lot of work. (This would be a bigger refactor of PyO3).
We don't currently have a flag to opt out of the type check. Depending on the protection that CPython gives us, this type check may actually not be needed at all.

@samuelcolvin might have hit the same thing when observing "methods overhead" in #3827

Thell · 2024-02-16T22:42:21Z

@davidhewitt It's set already.

ariebovenberg mentioned this issue Mar 10, 2024

Ability to include bare ffi methods in a #[pyclass] #3949

Open

This was referenced Mar 29, 2024

add benchmark for class / method calls #4016

Merged

skip type check of method receivers where unnecessary #4026

Draft

davidhewitt mentioned this issue Oct 11, 2024

Performance: calling overhead #3827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Thell commented Feb 15, 2024

adamreichold commented Feb 16, 2024

Thell commented Feb 16, 2024 •

edited

Loading

davidhewitt commented Feb 16, 2024

Thell commented Feb 16, 2024

Avoiding struct method call overhead of extract_pyclass_ref_mut #3843

Avoiding struct method call overhead of extract_pyclass_ref_mut #3843

Comments

Thell commented Feb 15, 2024

adamreichold commented Feb 16, 2024

Thell commented Feb 16, 2024 • edited Loading

davidhewitt commented Feb 16, 2024

Thell commented Feb 16, 2024

Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Avoiding struct method call overhead of `extract_pyclass_ref_mut` #3843

Thell commented Feb 16, 2024 •

edited

Loading