Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Add object rvalue overload for accessors. Enables reference stealing #3970

Merged

Conversation

Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented May 24, 2022

Description

The attr accessors had a fairly inefficient code path. Specifically, every access of that use a py::object to access it (like a py::str, py::int_, etc...), was converted to a handle and then copied back into an object, causing unnecessary reference count operations (INCREF/DECREF). We can simplify this significantly for the common case of accessing an attr using an rvalue. We just need to add an additional object&& specialization and using the corresponding move ctor. This doesn't change directly observable behavior, it's just a simple performance optimization eliminating unnecessary reference count operations by using the object's move ctor.

Testing has observed up to 30% speed up in code with heavy attr accesses. Any code that accesses key or values of accessor with rvalues should benefit.

Suggested changelog entry:

* Added an accessor overload of ``(object &&key)`` to reference steal the object when using python types as keys. This prevents unnecessary reference count overhead for attr, dictionary, tuple, and sequence look ups. Added additional regression tests.
* Fixed a performance bug the caused accessor assignments to potentially perform unnecessary copies. 

@Skylion007 Skylion007 requested review from rwgk and henryiii May 24, 2022 16:43
@Skylion007 Skylion007 changed the title (perf): Add object rvalue overload for accessors. Enables reference stealing (=perf: Add object rvalue overload for accessors. Enables reference stealing May 25, 2022
@Skylion007 Skylion007 changed the title (=perf: Add object rvalue overload for accessors. Enables reference stealing perf: Add object rvalue overload for accessors. Enables reference stealing May 25, 2022
@rwgk
Copy link
Collaborator

rwgk commented May 25, 2022

I tend to be deliberately a bit ignorant towards low-level mechanics***, but trying to understand this PR, is the following correct?

  • Both handle and object are just PyObject * underneath.
  • Copying a pointer or moving a pointer does not make a difference in terms of runtime performance.
  • Deleting an object means DECREF, and creating a new one with reinterpret_borrow<object> means INCREF.

Are those DECREF/INCREF what this PR is eliminating?

If that's the point, what kind of test could ensure that this optimization is not accidentally undone?

Also, did you already run a micro benchmark to quantify the best-case performance gain? — Something temporary and quick & dirty to get a rough idea would seem sufficient.


*** I'm generally more focused on high-level aspects & safety than low-level efficiency considerations, trusting that modern compilers will optimize out obvious inefficiencies. I believe cluttering or complicating code to squeeze out a tiny bit of extra efficiency negatively impacts readability & long-term maintainability, and it makes future development work & refactoring more bug-prone. That could result in a significant waste of human time, the most expensive resource we have, just to save a tiny amount of very cheap machine time.

@Skylion007
Copy link
Collaborator Author

Skylion007 commented May 25, 2022

Are those DECREF/INCREF what this PR is eliminating?

Yes. I didn't do a speed benchmark per say, I just tried to eliminate as many INCREFs and DECREFs as possible for both performance reasons and to remove potential race conditions if the GIL.

@rwgk
Copy link
Collaborator

rwgk commented May 25, 2022

Are those DECREF/INCREF what this PR is eliminating?
Yes

What test could ensure that this optimization is not accidentally undone?

@Skylion007
Copy link
Collaborator Author

Skylion007 commented May 25, 2022

What test could ensure that this optimization is not accidentally undone?

Writing the tests for that would be pretty difficult. I would normally use constructor_stats.h, but we would need to override the ctors and dtor of the object class to measure that. Any thoughts on how to do it @rwgk? I think it's more of a style issue TBH. reinterpret_borrow should not be used on objects. Maybe we could add an overload to reinterpret_borrow that throws an exception if it's an pyobject (to prevent the implicit conversion of a handle). There is no reason to use it with objects, but it can be done due to the implicit handle conversion. Thoughts @rwgk?

@Skylion007
Copy link
Collaborator Author

@rwgk TLDR: the real evil here is the antipattern of objects -> handle -> objects it does incref / decrefs for no reason. I am not sure of a way to guard against it though since handle -> object are a valid use case. I normally agree with you about trusting the compiler to do this stuff for us, but this is an obvious case where it is failing, the overhead of maintainig these overloads is negligible.

@Skylion007
Copy link
Collaborator Author

@rwgk Also, I wouldn't normally add these extra overloads just for that, but the fact that is such a hot code path (any attr accesses) made me want to optimize it.

@rwgk
Copy link
Collaborator

rwgk commented May 26, 2022

To me it feels like we're driving blindly (no performance measurements) and without guard rails (test to ensure the optimization is not accidentally undone). Re performance measurements: I believe that is only a few minutes worth of effort, to get a rough idea at least; i.e. we're driving blindly unnecessarily. Re tests, letting basic best practices slip needs to be clearly stated in the PR description, with a rationale, to prove (incl. to ourselves) that we carefully thought about it.

@Skylion007
Copy link
Collaborator Author

@rwgk Care to do the benchmarking then? We currently don't have any benchmarking framework on master at all currently.

@Skylion007
Copy link
Collaborator Author

@rwgk Googling, this is effectively the same as benchmarking incref / decref and I don't expect any performance difference there from the CPython benchmarks. Do not expect any speed significant difference from this PR anyhow. I'll update the PR description with this explanation.

@rwgk
Copy link
Collaborator

rwgk commented May 26, 2022

@rwgk Care to do the benchmarking then? We currently don't have any benchmarking framework on master at all currently.

We don't need one for this (although it would be nice to have), a quick 10 minute thing like this will do:

https://github.com/rwgk/rwgk_tbx/blob/main/test_perf_error_already_set.cpp
https://github.com/rwgk/rwgk_tbx/blob/main/test_perf_error_already_set.py

Just post a link to your code and the results here.

@rwgk
Copy link
Collaborator

rwgk commented May 26, 2022

I've been staring at this code for a while:

    template <typename T>
    void operator=(T &&value) & {
        get_cache() = ensure_object(object_or_cast(std::forward<T>(value)));
    }

With

private:
    static object ensure_object(object &&o) { return std::move(o); }
    static object ensure_object(handle h) { return reinterpret_borrow<object>(h); }

    object &get_cache() const {
        if (!cache) {
            cache = Policy::get(obj, key);
        }
        return cache;
    }
private:
    handle obj;
    key_type key;
    mutable object cache;```

I.e. in operator=(T &&value), on the lhs we have a object & (r-value) and on the right side object (r-value) IIUC.

Is the compiler forced or only permitted to elide the copy here?

Mainly I was looking for a way to test that the INCREF/DECREF cycle is in fact eliminated, but I failed to see one.

But what stood out to me, does cache have to be object? Could we make it handle? Then we have obvious & full control over when INCREF and DECREF are invoked. AFAICT, if we update the (copy, move) x (ctor, operator=) and dtor accordingly we should be fine. Could that unlock other manual optimizations?

Generally thinking:

  • If we had a test that proves the INCREF/DECREF cycle is in fact eliminated, ok. We don't know how much that helps, but at least we're sure it's gone and won't come back.
  • If we could at least measure a gain, at least one platform, ok.

But without either I feel very uneasy about making the code more complex.

If none of us can see a way to implement a test, then a rough benchmark would be useful. If and only if that gives us measurements that prove it's worth the trouble, the handle approach might be interesting, although with such measurements I'd also be OK with this PR as-is.

@Skylion007
Copy link
Collaborator Author

Skylion007 commented May 26, 2022

Is the compiler forced or only permitted to elide the copy here?

It's explicitly allowed to on C++14, but not requited to elide it. It's explicitly required to elide the copy on C++17. I think it should elide it on C++11.

It doesn't have to be an object, it could be a handle, but when we would to handle the inc_ref and dec_ref and the cleanest way to do that is as an object. I also don't see how this makes the code any more complex here. I could reduce the duplication with templates, but that would just make it even harder to read IMO. @rwgk If you have an idea about how to handle the incref and decref's go ahead, but I don't see it being any cleaner than it being an object like it currently is.

But without either I feel very uneasy about making the code more complex.

I don't see how it makes the code more complex, it just adds a few overloads.

@rwgk
Copy link
Collaborator

rwgk commented May 26, 2022

It's explicitly allowed to on C++14, but not requited to elide it. It's explicitly required to elide the copy on C++17. I think it should elide it on C++11.

I looked around some more and now I'm thinking that my "Is the compiler forced or only permitted to elide the copy here?" was actually besides the point. The correct question: "Is the compiler forced or only permitted to use the move ctor here?" I don't know all the rules that determine that, but my guess is it is actually forced to use the move ctor here. So that's great.

Could you suggest a code snippet for a best-case micro benchmark? Just the part to loop over, I could plug that into my ad-hoc benchmark code I pointed out earlier. That's really quick.

@Skylion007
Copy link
Collaborator Author

The correct question: "Is the compiler forced or only permitted to use the move ctor here?" I don't know all the rules that determine that, but my guess is it is actually forced to use the move ctor here. So that's great.

Yes, it's forced to use the move ctor.

@Skylion007
Copy link
Collaborator Author

Could you suggest a code snippet for a best-case micro benchmark? Just the part to loop over, I could plug that into my ad-hoc benchmark code I pointed out earlier.

Now, I was thinking I could change tuple, list, and other classes to use a deduced template arg with perfect forwarding, that would reduce the code duplication a bit, but make it a bit more annoying to read / debug. Thoughts @rwgk? That way, this code would only add two additional overloads.

Something like this for the benchmark @rwgk. Here, I think you can see why even if the performance change is relatively minor it's such a hot code path.

auto tup = tuple(100000);
auto l = list();
auto obj = object();
for (int i = 0; i < tup.size(); i++){
    tup[py::int_(i)] = py::int(i);
    obj.attr(py::int_(i)) = py::int(i);
     ... for dict, list, etc...
}

@rwgk
Copy link
Collaborator

rwgk commented May 27, 2022

That's exactly why I want to get rid of the incref / decrefs, more than the optimization use case. Reduce the number of incs and decs to the global counter.

That is a very slippery slope:

  • I'd need to check what the documentation says at the minute, but I hope it's either nothing or "For any operations involving pytype.h types you must hold the GIL."
  • If someone asks, we could say "There are a few operations that do not actually require the GIL, be we do not recommend exploiting that."

Because a year later someone else looks at their code, thinking "sure they are holding the GIL here", and adds an innocently looking line that does require the GIL. Two weeks later we have an emergency-level production bug. — Better don't play with fire.

In regards to the optimization being undone, this is also pretty stable, rather untouched part of the code base (the methods I am adding to overloads have been untouched since 2016).

It's like running a red light. You do it once, twice, ten times, eventually you need a new car. And a new arm. — I'd want to make exceptions only for really tiny things that are OK if we lose them, or if it is unreasonably onerous. I don't think it is here. On the contrary, establishing a way to test such optimization could pay off for other things as well.

pybind11.h triggers those code path several places in the py::enum_base and py::class classes.

Yesterday or the day before, I inserted pybind11_fail() in operator=, ran all unit tests, and there was only exactly one subtest that failed: test_pytypes.py test_accessors() d = m.accessor_assignment(). But that's the only code path for which I did that.

We have global performance monitoring of some sorts looking at production jobs. That might be one way to pin-point high-value optimization targets (assuming our usage is representative for the world at large; probably). I never use those, though, would need to learn.

@Skylion007
Copy link
Collaborator Author

@rwgk I've also gone ahead and added some additional unit tests to address your concerns. This ensures that regardless of other refactoring, we should have coverage of the new overload path.

@Skylion007
Copy link
Collaborator Author

Yesterday or the day before, I inserted pybind11_fail() in operator=, ran all unit tests, and there was only exactly one subtest that failed: test_pytypes.py test_accessors() d = m.accessor_assignment(). But that's the only code path for which I did that.

When testing, I inserted pybind11_fail for object_api<D>::operator[](object &&) and object_api<D>::attr(object &&) and I wasn't even able to import any pybind11 modules due to that codepath being used so often in pybind11. The operator= bit is definitely more of a niche improvement.

@rwgk
Copy link
Collaborator

rwgk commented May 27, 2022

(I'll be out for a few hours. I'll catch up when I'm back.)

@rwgk
Copy link
Collaborator

rwgk commented May 27, 2022

The new tests look great!

I'm trying thread_local for the handle::inc_ref_counter() (d6dda2d). If that works reliably on just some platforms, I believe that's all we need for a simple and rock-solid test that ensures the moves are working and will not get lost accidentally.

I don't think we need to drag the benchmark code with us, even though it's pretty simple, too. But it's easy enough to pull in as needed to answer specific questions.

@rwgk
Copy link
Collaborator

rwgk commented May 28, 2022

thread_local works with all 16 debug builds that we currently have (below; determined by intentionally breaking #3977).

15 Windows debug builds.
Only 1 non-Windows: deadsnakes Valgrind

Not a great sampling, but plenty of coverage for tests using handle::inc_ref_counter(). Only just one would give us the assurance that the new moves will not regress.

I'll fix up #3977 again.

From https://github.com/pybind/pybind11/actions/runs/2399290506, logs_20303.zip:

1______3.6_____MSVC_2019_____x86.txt
1______3.8_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=17.txt
1______3.9-dbg__deadsnakes______Valgrind_____x64.txt
1______3.9_____MSVC_2022_C++20_____x64.txt
2______3.7_____MSVC_2019_____x86.txt
2______3.9_____MSVC_2019__Debug______x86_-DCMAKE_CXX_STANDARD=20.txt
3______3.8_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=17.txt
4______3.9_____MSVC_2019_____x86_-DCMAKE_CXX_STANDARD=20.txt
7______3.6_____windows-2022_____x64.txt
8______3.9_____windows-2022_____x64.txt
9______3.10_____windows-2022_____x64.txt
10______pypy-3.7_____windows-2022_____x64.txt
11______pypy-3.8_____windows-2022_____x64.txt
12______pypy-3.9_____windows-2022_____x64.txt
19______3.6_____windows-2019_____x64_-DPYBIND11_FINDPYTHON=ON.txt
20______3.9_____windows-2019_____x64.txt

@rwgk
Copy link
Collaborator

rwgk commented May 28, 2022

#3977 is now almost what I believe we need here. The perf test is gone. The inc_refs code is in test_pytypes. It covers list and tuple, but sequence and attr are still missing. It's a pretty small test, except for the really clunky repetitive added code in test_pytypes.cpp. But I'm too sleepy to complete it right now.

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request May 28, 2022
@rwgk
Copy link
Collaborator

rwgk commented May 28, 2022

I'm done with #3977, see the PR description.

Could you please review the new unit tests? I know they are fully covering the behavior changes, but if you can think of improvements, could you please apply them directly to 3977? After you're done, I will go through again top-to-bottom to verify that the behavior change coverage is still complete.

When we're both happy with 3977, could you please transfer the unit tests here, without the PYBIND11_PR3970 #ifdefs?

@@ -2057,6 +2073,10 @@ iterator object_api<D>::end() const {
return iterator::sentinel();
}
template <typename D>
item_accessor object_api<D>::operator[](object &&key) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declaration appears after the overload for handle. Could you please put them in the same order here?

@@ -2065,6 +2085,10 @@ item_accessor object_api<D>::operator[](const char *key) const {
return {derived(), pybind11::str(key)};
}
template <typename D>
obj_attr_accessor object_api<D>::attr(object &&key) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the order of declarations here, too.

rwgk pushed a commit to rwgk/pybind11 that referenced this pull request May 31, 2022
rwgk pushed a commit that referenced this pull request May 31, 2022
* Add test_perf_accessors (to be merged into test_pytypes).

* Python < 3.8 f-string compatibility

* Use thread_local in inc_ref_counter()

* Intentional breakage, brute-force way to quickly find out how many platforms reach the PYBIND11_HANDLE_REF_DEBUG code, with and without threads.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove Intentional breakage

* Drop perf test, move inc_refs tests to test_pytypes

* Fold in PR #3970 with `#ifdef`s

* Complete test coverage for all newly added code.

* Condense new unit tests via a simple local helper macro.

* Remove PYBIND11_PR3970 define. See #3977 (comment)

* Move static keyword first (fixes silly oversight).

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now you just have to git rebase master (or merge), adjust the inc_refs line in test_pytypes.py, and merge.

@rwgk
Copy link
Collaborator

rwgk commented May 31, 2022

I went ahead and modified the PR description, to be explicit out the reference counting overhead. Please correct in case I didn't get it right.

@Skylion007
Copy link
Collaborator Author

@rwgk Do I need to add a special define to the CI? It doesn't seem to be running the incref tests.

@rwgk
Copy link
Collaborator

rwgk commented May 31, 2022

@rwgk Do I need to add a special define to the CI? It doesn't seem to be running the incref tests.

Nope, I expect to see 16 failures again. The current CI run is effectively repeating this small experiment: #3970 (comment)

@Skylion007 Skylion007 merged commit 58802de into pybind:master Jun 1, 2022
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Jun 1, 2022
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants