-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Add object rvalue overload for accessors. Enables reference stealing #3970
perf: Add object rvalue overload for accessors. Enables reference stealing #3970
Conversation
I tend to be deliberately a bit ignorant towards low-level mechanics***, but trying to understand this PR, is the following correct?
Are those DECREF/INCREF what this PR is eliminating? If that's the point, what kind of test could ensure that this optimization is not accidentally undone? Also, did you already run a micro benchmark to quantify the best-case performance gain? — Something temporary and quick & dirty to get a rough idea would seem sufficient. *** I'm generally more focused on high-level aspects & safety than low-level efficiency considerations, trusting that modern compilers will optimize out obvious inefficiencies. I believe cluttering or complicating code to squeeze out a tiny bit of extra efficiency negatively impacts readability & long-term maintainability, and it makes future development work & refactoring more bug-prone. That could result in a significant waste of human time, the most expensive resource we have, just to save a tiny amount of very cheap machine time. |
Yes. I didn't do a speed benchmark per say, I just tried to eliminate as many INCREFs and DECREFs as possible for both performance reasons and to remove potential race conditions if the GIL. |
What test could ensure that this optimization is not accidentally undone? |
Writing the tests for that would be pretty difficult. I would normally use constructor_stats.h, but we would need to override the ctors and dtor of the object class to measure that. Any thoughts on how to do it @rwgk? I think it's more of a style issue TBH. reinterpret_borrow should not be used on objects. Maybe we could add an overload to reinterpret_borrow that throws an exception if it's an pyobject (to prevent the implicit conversion of a handle). There is no reason to use it with objects, but it can be done due to the implicit handle conversion. Thoughts @rwgk? |
@rwgk TLDR: the real evil here is the antipattern of |
@rwgk Also, I wouldn't normally add these extra overloads just for that, but the fact that is such a hot code path (any attr accesses) made me want to optimize it. |
To me it feels like we're driving blindly (no performance measurements) and without guard rails (test to ensure the optimization is not accidentally undone). Re performance measurements: I believe that is only a few minutes worth of effort, to get a rough idea at least; i.e. we're driving blindly unnecessarily. Re tests, letting basic best practices slip needs to be clearly stated in the PR description, with a rationale, to prove (incl. to ourselves) that we carefully thought about it. |
@rwgk Care to do the benchmarking then? We currently don't have any benchmarking framework on master at all currently. |
@rwgk Googling, this is effectively the same as benchmarking incref / decref and I don't expect any performance difference there from the CPython benchmarks. Do not expect any speed significant difference from this PR anyhow. I'll update the PR description with this explanation. |
We don't need one for this (although it would be nice to have), a quick 10 minute thing like this will do: https://github.com/rwgk/rwgk_tbx/blob/main/test_perf_error_already_set.cpp Just post a link to your code and the results here. |
I've been staring at this code for a while:
With
I.e. in Is the compiler forced or only permitted to elide the copy here? Mainly I was looking for a way to test that the INCREF/DECREF cycle is in fact eliminated, but I failed to see one. But what stood out to me, does Generally thinking:
But without either I feel very uneasy about making the code more complex. If none of us can see a way to implement a test, then a rough benchmark would be useful. If and only if that gives us measurements that prove it's worth the trouble, the |
It's explicitly allowed to on C++14, but not requited to elide it. It's explicitly required to elide the copy on C++17. I think it should elide it on C++11. It doesn't have to be an object, it could be a handle, but when we would to handle the inc_ref and dec_ref and the cleanest way to do that is as an object. I also don't see how this makes the code any more complex here. I could reduce the duplication with templates, but that would just make it even harder to read IMO. @rwgk If you have an idea about how to handle the incref and decref's go ahead, but I don't see it being any cleaner than it being an object like it currently is.
I don't see how it makes the code more complex, it just adds a few overloads. |
I looked around some more and now I'm thinking that my "Is the compiler forced or only permitted to elide the copy here?" was actually besides the point. The correct question: "Is the compiler forced or only permitted to use the move ctor here?" I don't know all the rules that determine that, but my guess is it is actually forced to use the move ctor here. So that's great. Could you suggest a code snippet for a best-case micro benchmark? Just the part to loop over, I could plug that into my ad-hoc benchmark code I pointed out earlier. That's really quick. |
Yes, it's forced to use the move ctor. |
Now, I was thinking I could change tuple, list, and other classes to use a deduced template arg with perfect forwarding, that would reduce the code duplication a bit, but make it a bit more annoying to read / debug. Thoughts @rwgk? That way, this code would only add two additional overloads. Something like this for the benchmark @rwgk. Here, I think you can see why even if the performance change is relatively minor it's such a hot code path.
|
That is a very slippery slope:
Because a year later someone else looks at their code, thinking "sure they are holding the GIL here", and adds an innocently looking line that does require the GIL. Two weeks later we have an emergency-level production bug. — Better don't play with fire.
It's like running a red light. You do it once, twice, ten times, eventually you need a new car. And a new arm. — I'd want to make exceptions only for really tiny things that are OK if we lose them, or if it is unreasonably onerous. I don't think it is here. On the contrary, establishing a way to test such optimization could pay off for other things as well.
Yesterday or the day before, I inserted We have global performance monitoring of some sorts looking at production jobs. That might be one way to pin-point high-value optimization targets (assuming our usage is representative for the world at large; probably). I never use those, though, would need to learn. |
@rwgk I've also gone ahead and added some additional unit tests to address your concerns. This ensures that regardless of other refactoring, we should have coverage of the new overload path. |
When testing, I inserted pybind11_fail for |
(I'll be out for a few hours. I'll catch up when I'm back.) |
The new tests look great! I'm trying I don't think we need to drag the benchmark code with us, even though it's pretty simple, too. But it's easy enough to pull in as needed to answer specific questions. |
15 Windows debug builds. Not a great sampling, but plenty of coverage for tests using handle::inc_ref_counter(). Only just one would give us the assurance that the new moves will not regress. I'll fix up #3977 again. From https://github.com/pybind/pybind11/actions/runs/2399290506, logs_20303.zip:
|
#3977 is now almost what I believe we need here. The perf test is gone. The inc_refs code is in test_pytypes. It covers list and tuple, but sequence and attr are still missing. It's a pretty small test, except for the really clunky repetitive added code in test_pytypes.cpp. But I'm too sleepy to complete it right now. |
I'm done with #3977, see the PR description. Could you please review the new unit tests? I know they are fully covering the behavior changes, but if you can think of improvements, could you please apply them directly to 3977? After you're done, I will go through again top-to-bottom to verify that the behavior change coverage is still complete. When we're both happy with 3977, could you please transfer the unit tests here, without the |
include/pybind11/pytypes.h
Outdated
@@ -2057,6 +2073,10 @@ iterator object_api<D>::end() const { | |||
return iterator::sentinel(); | |||
} | |||
template <typename D> | |||
item_accessor object_api<D>::operator[](object &&key) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The declaration appears after the overload for handle
. Could you please put them in the same order here?
include/pybind11/pytypes.h
Outdated
@@ -2065,6 +2085,10 @@ item_accessor object_api<D>::operator[](const char *key) const { | |||
return {derived(), pybind11::str(key)}; | |||
} | |||
template <typename D> | |||
obj_attr_accessor object_api<D>::attr(object &&key) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the order of declarations here, too.
…ion007/attr-key-reference-stealing
* Add test_perf_accessors (to be merged into test_pytypes). * Python < 3.8 f-string compatibility * Use thread_local in inc_ref_counter() * Intentional breakage, brute-force way to quickly find out how many platforms reach the PYBIND11_HANDLE_REF_DEBUG code, with and without threads. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove Intentional breakage * Drop perf test, move inc_refs tests to test_pytypes * Fold in PR #3970 with `#ifdef`s * Complete test coverage for all newly added code. * Condense new unit tests via a simple local helper macro. * Remove PYBIND11_PR3970 define. See #3977 (comment) * Move static keyword first (fixes silly oversight). Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now you just have to git rebase master (or merge), adjust the inc_refs
line in test_pytypes.py, and merge.
…ion007/attr-key-reference-stealing
I went ahead and modified the PR description, to be explicit out the reference counting overhead. Please correct in case I didn't get it right. |
@rwgk Do I need to add a special define to the CI? It doesn't seem to be running the incref tests. |
Nope, I expect to see 16 failures again. The current CI run is effectively repeating this small experiment: #3970 (comment) |
Description
The attr accessors had a fairly inefficient code path. Specifically, every access of that use a py::object to access it (like a py::str, py::int_, etc...), was converted to a handle and then copied back into an object, causing unnecessary reference count operations (INCREF/DECREF). We can simplify this significantly for the common case of accessing an attr using an rvalue. We just need to add an additional
object&&
specialization and using the corresponding move ctor. This doesn't change directly observable behavior, it's just a simple performance optimization eliminating unnecessary reference count operations by using the object's move ctor.Testing has observed up to 30% speed up in code with heavy attr accesses. Any code that accesses key or values of accessor with rvalues should benefit.
Suggested changelog entry:
* Added an accessor overload of ``(object &&key)`` to reference steal the object when using python types as keys. This prevents unnecessary reference count overhead for attr, dictionary, tuple, and sequence look ups. Added additional regression tests. * Fixed a performance bug the caused accessor assignments to potentially perform unnecessary copies.