-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] [python client] Better Python garbage collection management for C++-owned objects #16535
[fix] [python client] Better Python garbage collection management for C++-owned objects #16535
Conversation
WIP notes: the segfault is fixed as expected, but threads now leak if the log level's low enough to result in at least one call to the Python logger's Pull on the thread of "why" and you arrive at an unconditional call to Fortunately, there's a way to turn that off. Unfortunately, that's the The easiest thing I know how to do is write a Python wrapper for the logger that temporarily manipulates that value, which I'l try next. Edit: this is fixed; issue was due to some kinda silly assumptions |
…jects to outlive the Python interpreter
WIP update:
|
Resolved |
Any update on this? To my knowledge it should be mergeable. |
Could you fix this test?
|
@BewareMyPower fixing that test is proving troublesome. Something happening in the destructor is corrupting the python interpreter state and causing that test to fail. The failure's not a testing issue though; it's a real problem. I'll keep working on it. |
@zbentley Could you merge latest master to have some CI fixes? @merlimat @Demogorgon314 @RobertIndie Could anyone take a second look? |
@tuteng done |
/pulsarbot run-failure-checks |
could you please cherry-pick this PR to branch-2.9? thanks. |
@congbobo184 Done. |
…nt for C++-owned objects (#16535) (#18921) Co-authored-by: Zac Bentley <[email protected]>
…nt for C++-owned objects (apache#16535) (apache#18921) Co-authored-by: Zac Bentley <[email protected]> (cherry picked from commit 5b67614)
…nt for C++-owned objects (apache#16535) (apache#18921) Co-authored-by: Zac Bentley <[email protected]> (cherry picked from commit 5b67614)
Fixes #16527
Motivation
Copied from issue discussion:
The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.
Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.
Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.
Modifications
Verifying this change
This change added tests and can be verified via the Python client test suite:
python pulsar_test.py
.doc-not-needed
Bugfix; external contracts do not change.