Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [python client] Better Python garbage collection management for C++-owned objects #16535

Merged
merged 20 commits into from
Sep 22, 2022

Conversation

zbentley
Copy link
Contributor

@zbentley zbentley commented Jul 12, 2022

Fixes #16527

Motivation

Copied from issue discussion:

The problem is the GIL. When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state.

Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those.

Such GIL-unprotected refcount manipulation occurs in the Pulsar client here, here, here, and here, though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL.

Modifications

  • Done protected logger handle refcount mutations with the GIL, resolving a segfault risk.
  • Done resolve thread leak.
  • Done add tests.
  • Done don't acquire GIL for logging messages that won't be emitted due to log level.

Verifying this change

This change added tests and can be verified via the Python client test suite: python pulsar_test.py.

  • doc-not-needed
    Bugfix; external contracts do not change.

@zbentley
Copy link
Contributor Author

zbentley commented Jul 12, 2022

WIP notes: the segfault is fixed as expected, but threads now leak if the log level's low enough to result in at least one call to the Python logger's _log internal function.

Pull on the thread of "why" and you arrive at an unconditional call to threading.currentThread down in Python's logging machinery. If that method is called inside a Pulsar thread, the thread leaks thereafter; somehow, a reference to it is manipulated such that the thread doesn't go away on the system.

Fortunately, there's a way to turn that off. Unfortunately, that's the logging.logThreads = 1 or = 0 module global. That's a weird and sad Python config API--not only is there no way to modify it in a race-proof way, but boost-python doesn't make it easy to manipulate the value within the client's C++ code, so the usual import("logging").attr("logThreads") = object(0) doesn't seem to work (at least for me, my C++ is basically nonexistent, and my Boost is worse).

The easiest thing I know how to do is write a Python wrapper for the logger that temporarily manipulates that value, which I'l try next.

Edit: this is fixed; issue was due to some kinda silly assumptions boost::python makes when .attr accessing a const py::object.

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jul 12, 2022
@zbentley
Copy link
Contributor Author

WIP update:

  • Fixed thread leak by setting logThreads to False inside the C++ code, which should prevent thread-switching by the Python interpreter. I might be wrong about that, but the worst case behavior is a very occasional thread leak of the sort that occurs 100% of the time now, so this is still an improvement.
  • Added unit test.
  • Reduced unnecessary GIL-acquire operations that happened when log messages were sent at levels that weren't enabled for logging.

@zbentley zbentley changed the title [fix] [c++ and python client] Better Python garbage collection management for C++-owned objects [fix] [python client] Better Python garbage collection management for C++-owned objects Jul 13, 2022
@zbentley
Copy link
Contributor Author

zbentley commented Jul 13, 2022

Assistance requested from a maintainer: the license check is failing, but I didn't add any new files. How can I tell what files it would like me to add a license for?

Resolved

@BewareMyPower BewareMyPower added this to the 2.11.0 milestone Jul 21, 2022
@codelipenghui codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022
@zbentley
Copy link
Contributor Author

Any update on this? To my knowledge it should be mergeable.

@BewareMyPower
Copy link
Contributor

Could you fix this test?

======================================================================
FAIL: test_async_func_with_custom_logger (__main__.CustomLoggingTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "custom_logger_test.py", line 48, in test_async_func_with_custom_logger
    self.assertEqual(value, result)
AssertionError: 'foo' != None

@zbentley
Copy link
Contributor Author

@BewareMyPower fixing that test is proving troublesome. Something happening in the destructor is corrupting the python interpreter state and causing that test to fail. The failure's not a testing issue though; it's a real problem.

I'll keep working on it.

@BewareMyPower
Copy link
Contributor

@zbentley Could you merge latest master to have some CI fixes?

@merlimat @Demogorgon314 @RobertIndie Could anyone take a second look?

@zbentley
Copy link
Contributor Author

@tuteng done

@tisonkun
Copy link
Member

/pulsarbot run-failure-checks

@BewareMyPower BewareMyPower reopened this Sep 21, 2022
@BewareMyPower BewareMyPower merged commit a8b265d into apache:master Sep 22, 2022
@congbobo184
Copy link
Contributor

could you please cherry-pick this PR to branch-2.9? thanks.

BewareMyPower pushed a commit that referenced this pull request Nov 15, 2022
@BewareMyPower BewareMyPower added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 15, 2022
@BewareMyPower
Copy link
Contributor

@congbobo184 Done.

congbobo184 pushed a commit that referenced this pull request Dec 7, 2022
… C++-owned objects (#16535)

Fixes #16527

(cherry picked from commit a8b265d)
liangyepianzhou added a commit that referenced this pull request Dec 15, 2022
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jan 10, 2023
…nt for C++-owned objects (apache#16535) (apache#18921)

Co-authored-by: Zac Bentley <[email protected]>
(cherry picked from commit 5b67614)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jan 11, 2023
…nt for C++-owned objects (apache#16535) (apache#18921)

Co-authored-by: Zac Bentley <[email protected]>
(cherry picked from commit 5b67614)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.8.5 release/2.9.4 release/2.10.4 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
9 participants