perf: Add object rvalue overload for accessors. Enables reference stealing #3970

Skylion007 · 2022-05-24T16:43:15Z

Description

The attr accessors had a fairly inefficient code path. Specifically, every access of that use a py::object to access it (like a py::str, py::int_, etc...), was converted to a handle and then copied back into an object, causing unnecessary reference count operations (INCREF/DECREF). We can simplify this significantly for the common case of accessing an attr using an rvalue. We just need to add an additional object&& specialization and using the corresponding move ctor. This doesn't change directly observable behavior, it's just a simple performance optimization eliminating unnecessary reference count operations by using the object's move ctor.

Testing has observed up to 30% speed up in code with heavy attr accesses. Any code that accesses key or values of accessor with rvalues should benefit.

Suggested changelog entry:

* Added an accessor overload of ``(object &&key)`` to reference steal the object when using python types as keys. This prevents unnecessary reference count overhead for attr, dictionary, tuple, and sequence look ups. Added additional regression tests.
* Fixed a performance bug the caused accessor assignments to potentially perform unnecessary copies.

rwgk · 2022-05-25T23:30:15Z

I tend to be deliberately a bit ignorant towards low-level mechanics***, but trying to understand this PR, is the following correct?

Both handle and object are just PyObject * underneath.
Copying a pointer or moving a pointer does not make a difference in terms of runtime performance.
Deleting an object means DECREF, and creating a new one with reinterpret_borrow<object> means INCREF.

Are those DECREF/INCREF what this PR is eliminating?

If that's the point, what kind of test could ensure that this optimization is not accidentally undone?

Also, did you already run a micro benchmark to quantify the best-case performance gain? — Something temporary and quick & dirty to get a rough idea would seem sufficient.

*** I'm generally more focused on high-level aspects & safety than low-level efficiency considerations, trusting that modern compilers will optimize out obvious inefficiencies. I believe cluttering or complicating code to squeeze out a tiny bit of extra efficiency negatively impacts readability & long-term maintainability, and it makes future development work & refactoring more bug-prone. That could result in a significant waste of human time, the most expensive resource we have, just to save a tiny amount of very cheap machine time.

Skylion007 · 2022-05-25T23:45:40Z

Are those DECREF/INCREF what this PR is eliminating?

Yes. I didn't do a speed benchmark per say, I just tried to eliminate as many INCREFs and DECREFs as possible for both performance reasons and to remove potential race conditions if the GIL.

rwgk · 2022-05-25T23:47:09Z

Are those DECREF/INCREF what this PR is eliminating?
Yes

What test could ensure that this optimization is not accidentally undone?

Skylion007 · 2022-05-25T23:51:31Z

What test could ensure that this optimization is not accidentally undone?

Writing the tests for that would be pretty difficult. I would normally use constructor_stats.h, but we would need to override the ctors and dtor of the object class to measure that. Any thoughts on how to do it @rwgk? I think it's more of a style issue TBH. reinterpret_borrow should not be used on objects. Maybe we could add an overload to reinterpret_borrow that throws an exception if it's an pyobject (to prevent the implicit conversion of a handle). There is no reason to use it with objects, but it can be done due to the implicit handle conversion. Thoughts @rwgk?

Skylion007 · 2022-05-25T23:57:31Z

@rwgk TLDR: the real evil here is the antipattern of objects -> handle -> objects it does incref / decrefs for no reason. I am not sure of a way to guard against it though since handle -> object are a valid use case. I normally agree with you about trusting the compiler to do this stuff for us, but this is an obvious case where it is failing, the overhead of maintainig these overloads is negligible.

Skylion007 · 2022-05-25T23:58:55Z

@rwgk Also, I wouldn't normally add these extra overloads just for that, but the fact that is such a hot code path (any attr accesses) made me want to optimize it.

rwgk · 2022-05-26T00:09:50Z

To me it feels like we're driving blindly (no performance measurements) and without guard rails (test to ensure the optimization is not accidentally undone). Re performance measurements: I believe that is only a few minutes worth of effort, to get a rough idea at least; i.e. we're driving blindly unnecessarily. Re tests, letting basic best practices slip needs to be clearly stated in the PR description, with a rationale, to prove (incl. to ourselves) that we carefully thought about it.

Skylion007 · 2022-05-26T00:41:22Z

@rwgk Care to do the benchmarking then? We currently don't have any benchmarking framework on master at all currently.

Skylion007 · 2022-05-26T00:45:40Z

@rwgk Googling, this is effectively the same as benchmarking incref / decref and I don't expect any performance difference there from the CPython benchmarks. Do not expect any speed significant difference from this PR anyhow. I'll update the PR description with this explanation.

rwgk · 2022-05-26T00:46:48Z

@rwgk Care to do the benchmarking then? We currently don't have any benchmarking framework on master at all currently.

We don't need one for this (although it would be nice to have), a quick 10 minute thing like this will do:

https://github.com/rwgk/rwgk_tbx/blob/main/test_perf_error_already_set.cpp
https://github.com/rwgk/rwgk_tbx/blob/main/test_perf_error_already_set.py

Just post a link to your code and the results here.

rwgk · 2022-05-26T01:36:09Z

I've been staring at this code for a while:

    template <typename T>
    void operator=(T &&value) & {
        get_cache() = ensure_object(object_or_cast(std::forward<T>(value)));
    }

With

private:
    static object ensure_object(object &&o) { return std::move(o); }
    static object ensure_object(handle h) { return reinterpret_borrow<object>(h); }

    object &get_cache() const {
        if (!cache) {
            cache = Policy::get(obj, key);
        }
        return cache;
    }
private:
    handle obj;
    key_type key;
    mutable object cache;```

I.e. in operator=(T &&value), on the lhs we have a object & (r-value) and on the right side object (r-value) IIUC.

Is the compiler forced or only permitted to elide the copy here?

Mainly I was looking for a way to test that the INCREF/DECREF cycle is in fact eliminated, but I failed to see one.

But what stood out to me, does cache have to be object? Could we make it handle? Then we have obvious & full control over when INCREF and DECREF are invoked. AFAICT, if we update the (copy, move) x (ctor, operator=) and dtor accordingly we should be fine. Could that unlock other manual optimizations?

Generally thinking:

If we had a test that proves the INCREF/DECREF cycle is in fact eliminated, ok. We don't know how much that helps, but at least we're sure it's gone and won't come back.
If we could at least measure a gain, at least one platform, ok.

But without either I feel very uneasy about making the code more complex.

If none of us can see a way to implement a test, then a rough benchmark would be useful. If and only if that gives us measurements that prove it's worth the trouble, the handle approach might be interesting, although with such measurements I'd also be OK with this PR as-is.

Skylion007 · 2022-05-26T02:46:06Z

Is the compiler forced or only permitted to elide the copy here?

It's explicitly allowed to on C++14, but not requited to elide it. It's explicitly required to elide the copy on C++17. I think it should elide it on C++11.

It doesn't have to be an object, it could be a handle, but when we would to handle the inc_ref and dec_ref and the cleanest way to do that is as an object. I also don't see how this makes the code any more complex here. I could reduce the duplication with templates, but that would just make it even harder to read IMO. @rwgk If you have an idea about how to handle the incref and decref's go ahead, but I don't see it being any cleaner than it being an object like it currently is.

But without either I feel very uneasy about making the code more complex.

I don't see how it makes the code more complex, it just adds a few overloads.

rwgk · 2022-05-26T04:43:38Z

It's explicitly allowed to on C++14, but not requited to elide it. It's explicitly required to elide the copy on C++17. I think it should elide it on C++11.

I looked around some more and now I'm thinking that my "Is the compiler forced or only permitted to elide the copy here?" was actually besides the point. The correct question: "Is the compiler forced or only permitted to use the move ctor here?" I don't know all the rules that determine that, but my guess is it is actually forced to use the move ctor here. So that's great.

Could you suggest a code snippet for a best-case micro benchmark? Just the part to loop over, I could plug that into my ad-hoc benchmark code I pointed out earlier. That's really quick.

Skylion007 · 2022-05-26T05:01:33Z

The correct question: "Is the compiler forced or only permitted to use the move ctor here?" I don't know all the rules that determine that, but my guess is it is actually forced to use the move ctor here. So that's great.

Yes, it's forced to use the move ctor.

Skylion007 · 2022-05-26T15:24:09Z

Could you suggest a code snippet for a best-case micro benchmark? Just the part to loop over, I could plug that into my ad-hoc benchmark code I pointed out earlier.

Now, I was thinking I could change tuple, list, and other classes to use a deduced template arg with perfect forwarding, that would reduce the code duplication a bit, but make it a bit more annoying to read / debug. Thoughts @rwgk? That way, this code would only add two additional overloads.

Something like this for the benchmark @rwgk. Here, I think you can see why even if the performance change is relatively minor it's such a hot code path.

auto tup = tuple(100000);
auto l = list();
auto obj = object();
for (int i = 0; i < tup.size(); i++){
    tup[py::int_(i)] = py::int(i);
    obj.attr(py::int_(i)) = py::int(i);
     ... for dict, list, etc...
}

include/pybind11/pytypes.h

rwgk · 2022-05-27T17:35:44Z

That's exactly why I want to get rid of the incref / decrefs, more than the optimization use case. Reduce the number of incs and decs to the global counter.

That is a very slippery slope:

I'd need to check what the documentation says at the minute, but I hope it's either nothing or "For any operations involving pytype.h types you must hold the GIL."
If someone asks, we could say "There are a few operations that do not actually require the GIL, be we do not recommend exploiting that."

Because a year later someone else looks at their code, thinking "sure they are holding the GIL here", and adds an innocently looking line that does require the GIL. Two weeks later we have an emergency-level production bug. — Better don't play with fire.

In regards to the optimization being undone, this is also pretty stable, rather untouched part of the code base (the methods I am adding to overloads have been untouched since 2016).

It's like running a red light. You do it once, twice, ten times, eventually you need a new car. And a new arm. — I'd want to make exceptions only for really tiny things that are OK if we lose them, or if it is unreasonably onerous. I don't think it is here. On the contrary, establishing a way to test such optimization could pay off for other things as well.

pybind11.h triggers those code path several places in the py::enum_base and py::class classes.

Yesterday or the day before, I inserted pybind11_fail() in operator=, ran all unit tests, and there was only exactly one subtest that failed: test_pytypes.py test_accessors() d = m.accessor_assignment(). But that's the only code path for which I did that.

We have global performance monitoring of some sorts looking at production jobs. That might be one way to pin-point high-value optimization targets (assuming our usage is representative for the world at large; probably). I never use those, though, would need to learn.

Skylion007 · 2022-05-27T17:51:47Z

@rwgk I've also gone ahead and added some additional unit tests to address your concerns. This ensures that regardless of other refactoring, we should have coverage of the new overload path.

Skylion007 · 2022-05-27T18:01:49Z

Yesterday or the day before, I inserted pybind11_fail() in operator=, ran all unit tests, and there was only exactly one subtest that failed: test_pytypes.py test_accessors() d = m.accessor_assignment(). But that's the only code path for which I did that.

When testing, I inserted pybind11_fail for object_api<D>::operator[](object &&) and object_api<D>::attr(object &&) and I wasn't even able to import any pybind11 modules due to that codepath being used so often in pybind11. The operator= bit is definitely more of a niche improvement.

rwgk · 2022-05-27T18:09:01Z

(I'll be out for a few hours. I'll catch up when I'm back.)

rwgk · 2022-05-27T23:15:07Z

The new tests look great!

I'm trying thread_local for the handle::inc_ref_counter() (d6dda2d). If that works reliably on just some platforms, I believe that's all we need for a simple and rock-solid test that ensures the moves are working and will not get lost accidentally.

I don't think we need to drag the benchmark code with us, even though it's pretty simple, too. But it's easy enough to pull in as needed to answer specific questions.

rwgk · 2022-05-28T05:57:32Z

thread_local works with all 16 debug builds that we currently have (below; determined by intentionally breaking #3977).

15 Windows debug builds.
Only 1 non-Windows: deadsnakes Valgrind

Not a great sampling, but plenty of coverage for tests using handle::inc_ref_counter(). Only just one would give us the assurance that the new moves will not regress.

I'll fix up #3977 again.

From https://github.com/pybind/pybind11/actions/runs/2399290506, logs_20303.zip:

1______3.6_____MSVC_2019_____x86.txt
1______3.8_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=17.txt
1______3.9-dbg__deadsnakes______Valgrind_____x64.txt
1______3.9_____MSVC_2022_C++20_____x64.txt
2______3.7_____MSVC_2019_____x86.txt
2______3.9_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=20.txt
3______3.8_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=17.txt
4______3.9_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=20.txt
7______3.6_____windows-2022_____x64.txt
8______3.9_____windows-2022_____x64.txt
9______3.10_____windows-2022_____x64.txt
10______pypy-3.7_____windows-2022_____x64.txt
11______pypy-3.8_____windows-2022_____x64.txt
12______pypy-3.9_____windows-2022_____x64.txt
19______3.6_____windows-2019_____x64_-DPYBIND11_FINDPYTHON=ON.txt
20______3.9_____windows-2019_____x64.txt

rwgk · 2022-05-28T07:12:35Z

#3977 is now almost what I believe we need here. The perf test is gone. The inc_refs code is in test_pytypes. It covers list and tuple, but sequence and attr are still missing. It's a pretty small test, except for the really clunky repetitive added code in test_pytypes.cpp. But I'm too sleepy to complete it right now.

rwgk · 2022-05-28T23:23:12Z

I'm done with #3977, see the PR description.

Could you please review the new unit tests? I know they are fully covering the behavior changes, but if you can think of improvements, could you please apply them directly to 3977? After you're done, I will go through again top-to-bottom to verify that the behavior change coverage is still complete.

When we're both happy with 3977, could you please transfer the unit tests here, without the PYBIND11_PR3970 #ifdefs?

rwgk · 2022-05-28T23:25:55Z

include/pybind11/pytypes.h

@@ -2057,6 +2073,10 @@ iterator object_api<D>::end() const {
    return iterator::sentinel();
 }
 template <typename D>
+item_accessor object_api<D>::operator[](object &&key) const {


The declaration appears after the overload for handle. Could you please put them in the same order here?

rwgk · 2022-05-28T23:27:35Z

include/pybind11/pytypes.h

@@ -2065,6 +2085,10 @@ item_accessor object_api<D>::operator[](const char *key) const {
    return {derived(), pybind11::str(key)};
 }
 template <typename D>
+obj_attr_accessor object_api<D>::attr(object &&key) const {


Please keep the order of declarations here, too.

…ion007/attr-key-reference-stealing

* Add test_perf_accessors (to be merged into test_pytypes). * Python < 3.8 f-string compatibility * Use thread_local in inc_ref_counter() * Intentional breakage, brute-force way to quickly find out how many platforms reach the PYBIND11_HANDLE_REF_DEBUG code, with and without threads. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove Intentional breakage * Drop perf test, move inc_refs tests to test_pytypes * Fold in PR #3970 with `#ifdef`s * Complete test coverage for all newly added code. * Condense new unit tests via a simple local helper macro. * Remove PYBIND11_PR3970 define. See #3977 (comment) * Move static keyword first (fixes silly oversight). Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

rwgk

I think now you just have to git rebase master (or merge), adjust the inc_refs line in test_pytypes.py, and merge.

…ion007/attr-key-reference-stealing

rwgk · 2022-05-31T20:16:45Z

I went ahead and modified the PR description, to be explicit out the reference counting overhead. Please correct in case I didn't get it right.

Skylion007 · 2022-05-31T20:26:04Z

@rwgk Do I need to add a special define to the CI? It doesn't seem to be running the incref tests.

rwgk · 2022-05-31T20:29:18Z

@rwgk Do I need to add a special define to the CI? It doesn't seem to be running the incref tests.

Nope, I expect to see 16 failures again. The current CI run is effectively repeating this small experiment: #3970 (comment)

Skylion007 added 4 commits May 24, 2022 12:19

Add object rvalue overload for accessors. Enables reference stealing

79f35a9

Fix comments

df69ada

Fix more comment typos

7df50dc

Fix bug

9a14da2

Skylion007 requested review from rwgk and henryiii May 24, 2022 16:43

henryiii approved these changes May 24, 2022

View reviewed changes

Skylion007 changed the title ~~(perf): Add object rvalue overload for accessors. Enables reference stealing~~ (=perf: Add object rvalue overload for accessors. Enables reference stealing May 25, 2022

Skylion007 changed the title ~~(=perf: Add object rvalue overload for accessors. Enables reference stealing~~ perf: Add object rvalue overload for accessors. Enables reference stealing May 25, 2022

Skylion007 added 4 commits May 25, 2022 12:15

Merge branch 'master' into skylion007/attr-key-reference-stealing

1ee3e17

reorder declarations for clarity

653eed3

fix another perf bug

8955551

should be static

65d511a

Skylion007 commented May 26, 2022

View reviewed changes

include/pybind11/pytypes.h Outdated Show resolved Hide resolved

future proof operator overloads

382515d

add object attr tests

29250ab

Optimize STL map caster and cleanup enum

33fb072

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request May 28, 2022

Fold in PR pybind#3970 with #ifdefs

9758f6e

rwgk mentioned this pull request May 28, 2022

addl unit tests for PR #3970 #3977

Merged

rwgk reviewed May 28, 2022

View reviewed changes

Skylion007 added 2 commits May 31, 2022 12:55

Reorder to match declarations

6de8287

Merge branch 'master' of https://github.com/pybind/pybind11 into skyl…

3a444cd

…ion007/attr-key-reference-stealing

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request May 31, 2022

Fold in PR pybind#3970 with #ifdefs

f1752ff

rwgk approved these changes May 31, 2022

View reviewed changes

Merge branch 'master' of https://github.com/pybind/pybind11 into skyl…

a82bfcf

…ion007/attr-key-reference-stealing

Skylion007 added 4 commits May 31, 2022 16:31

adjust increfs

3c3d346

Remove comment

fbe2404

revert value change

1603ac2

add missing move

75bf0bd

Skylion007 merged commit 58802de into pybind:master Jun 1, 2022

github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 1, 2022

henryiii removed the needs changelog Possibly needs a changelog entry label Jul 7, 2022

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#3970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add object rvalue overload for accessors. Enables reference stealing #3970

perf: Add object rvalue overload for accessors. Enables reference stealing #3970

Skylion007 commented May 24, 2022 •

edited by henryiii

Loading

rwgk commented May 25, 2022

Skylion007 commented May 25, 2022 •

edited

Loading

rwgk commented May 25, 2022

Skylion007 commented May 25, 2022 •

edited

Loading

Skylion007 commented May 25, 2022

Skylion007 commented May 25, 2022

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022

Skylion007 commented May 26, 2022

rwgk commented May 26, 2022

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022 •

edited

Loading

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022

Skylion007 commented May 26, 2022

rwgk commented May 27, 2022

Skylion007 commented May 27, 2022

Skylion007 commented May 27, 2022

rwgk commented May 27, 2022

rwgk commented May 27, 2022

rwgk commented May 28, 2022 •

edited

Loading

rwgk commented May 28, 2022

rwgk commented May 28, 2022

rwgk May 28, 2022

rwgk May 28, 2022

rwgk left a comment

rwgk commented May 31, 2022

Skylion007 commented May 31, 2022

rwgk commented May 31, 2022

perf: Add object rvalue overload for accessors. Enables reference stealing #3970

perf: Add object rvalue overload for accessors. Enables reference stealing #3970

Conversation

Skylion007 commented May 24, 2022 • edited by henryiii Loading

Description

Suggested changelog entry:

rwgk commented May 25, 2022

Skylion007 commented May 25, 2022 • edited Loading

rwgk commented May 25, 2022

Skylion007 commented May 25, 2022 • edited Loading

Skylion007 commented May 25, 2022

Skylion007 commented May 25, 2022

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022

Skylion007 commented May 26, 2022

rwgk commented May 26, 2022

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022 • edited Loading

rwgk commented May 26, 2022

Skylion007 commented May 26, 2022

Skylion007 commented May 26, 2022

rwgk commented May 27, 2022

Skylion007 commented May 27, 2022

Skylion007 commented May 27, 2022

rwgk commented May 27, 2022

rwgk commented May 27, 2022

rwgk commented May 28, 2022 • edited Loading

rwgk commented May 28, 2022

rwgk commented May 28, 2022

rwgk May 28, 2022

Choose a reason for hiding this comment

rwgk May 28, 2022

Choose a reason for hiding this comment

rwgk left a comment

Choose a reason for hiding this comment

rwgk commented May 31, 2022

Skylion007 commented May 31, 2022

rwgk commented May 31, 2022

Skylion007 commented May 24, 2022 •

edited by henryiii

Loading

Skylion007 commented May 25, 2022 •

edited

Loading

Skylion007 commented May 25, 2022 •

edited

Loading

Skylion007 commented May 26, 2022 •

edited

Loading

rwgk commented May 28, 2022 •

edited

Loading