Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daemonic threads not killed in some circumstances in python 3.13 #123940

Open
rhpvorderman opened this issue Sep 11, 2024 · 12 comments
Open

Daemonic threads not killed in some circumstances in python 3.13 #123940

rhpvorderman opened this issue Sep 11, 2024 · 12 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes deferred-blocker type-bug An unexpected behavior, bug, or error

Comments

@rhpvorderman
Copy link
Contributor

rhpvorderman commented Sep 11, 2024

Bug report

Bug description:

I got a bug report on python-zlib-ng where using the threaded opening the program would hang if an exception occurred and a context manager was not used.

I fixed it by changing the threads to be daemonic: pycompression/python-zlib-ng#54.

This fixes the issue on python 3.9, 3.10, 3.11 and 3.12, but not 3.13-rc2. Also the latest main branch is affected. Unfortunately I could not create a minimal reproducer, other than running the test directly. Daemonic threads seem not to work in this particular case, where there are some BufferedIO streams and queues involved.

git clone -b issue53 https://github.com/pycompression/python-zlib-ng
# Build Cpython
./python -m ensurepip
./python -m pip install pytest pytest-timeout <path to python-zlib-ng repo>
./python -m pytest <path to python-zlib-ng repo>/tests/test_gzip_ng_threaded.py::test_threaded_program_can_exit_on_error

Since there are many code changes between 3.12 in 3.13 the underlying cause is surely there, but it is hard to pinpoint.

CPython versions tested on:

3.13, CPython main branch

Operating systems tested on:

Linux

Linked PRs

@rhpvorderman rhpvorderman added the type-bug An unexpected behavior, bug, or error label Sep 11, 2024
@rhpvorderman rhpvorderman changed the title Daemonic threads not killed in python 3.13 Daemonic threads not killed in some circumstances in python 3.13 Sep 11, 2024
@y5c4l3
Copy link
Contributor

y5c4l3 commented Sep 12, 2024

Joining daemon threads at finalizer of _ThreadedGzipReader and _ThreadedGzipWriter (i.e. close()) gives a deadlock in recent versions of Python. I am still investigating on this and trying to give an MRE. The following patch makes it pass the test, but I am not sure whether it will break other versions (tested on Python 3.12 and latest).

diff --git a/src/zlib_ng/gzip_ng_threaded.py b/src/zlib_ng/gzip_ng_threaded.py
index 7fa7249..65bd95c 100644
--- a/src/zlib_ng/gzip_ng_threaded.py
+++ b/src/zlib_ng/gzip_ng_threaded.py
@@ -156,7 +156,6 @@ class _ThreadedGzipReader(io.RawIOBase):
         if self._closed:
             return
         self.running = False
-        self.worker.join()
         self.fileobj.close()
         if self.closefd:
             self.raw.close()
@@ -329,7 +328,7 @@ class _ThreadedGzipWriter(io.RawIOBase):
         if self._closed:
             return
         self.flush()
-        self.stop()
+        self.running = False
         if self.exception:
             self.raw.close()
             self._closed = True

Is it possible to workaround it by starting the worker in first call to read() / write() and using non-daemon threads?

@rhpvorderman
Copy link
Contributor Author

@y5c4l3 Thanks! I will investigate and report back.

@y5c4l3
Copy link
Contributor

y5c4l3 commented Sep 14, 2024

Reproducible:

# mwe.py
import io
import threading
from spawner import ModuleSpawnerIO

class SpawnerIO(io.RawIOBase):
    def __init__(self):
        self.task = threading.Thread(target=self._process, daemon=True)
        self.task.start()
    def _process(self):
        while True:
            pass
    def readable(self):
        return True
    def close(self):
        print('finalizing...')
        self.task.join()

# exit as expected, not calling finalizer
# r = io.BufferedReader(SpawnerIO())

# finalized and stuck
r = io.BufferedReader(ModuleSpawnerIO())
# spawner.py
import io
import threading

class ModuleSpawnerIO(io.RawIOBase):
    def __init__(self):
        self.task = threading.Thread(target=self._process, daemon=True)
        self.task.start()
    def _process(self):
        while True:
            pass
    def readable(self):
        return True
    def close(self):
        print('finalizing...')
        self.task.join()

Wrapping ModuleSpawnerIO from another module inside BufferedReader ensures that the finalizer will be called. The spawned thread dies without notifying the main thread of thread_is_exiting:

cpython/Python/ceval_gil.c

Lines 294 to 302 in 1de4613

if (_PyThreadState_MustExit(tstate)) {
/* bpo-39877: If Py_Finalize() has been called and tstate is not the
thread which called Py_Finalize(), exit immediately the thread.
This code path can be reached by a daemon thread after Py_Finalize()
completes. In this case, tstate is a dangling pointer: points to
PyThreadState freed memory. */
PyThread_exit_thread();
}

Simpler Reproducible with atexit

import io
import threading
import atexit

def process():
    while True:
        pass

t1 = threading.Thread(target=process, daemon=True)
t1.start()
atexit.register(lambda: t1.join())

@hauntsaninja hauntsaninja added 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Sep 15, 2024
@hauntsaninja
Copy link
Contributor

hauntsaninja commented Sep 15, 2024

Thanks for the repro! Bisects to #114839
cc @mpage in case you have time to take a look :-)

@mpage
Copy link
Contributor

mpage commented Sep 15, 2024

Thanks for the minimal repo! I can take a look on Monday.

@mpage mpage self-assigned this Sep 15, 2024
@mpage
Copy link
Contributor

mpage commented Sep 15, 2024

@hauntsaninja - What were you using as the test case for bisection? The simpler reproducer using atexit has the same behavior on 3.12.4 and main. Both hang until interrupted, which is what I would expect to happen.

@hauntsaninja
Copy link
Contributor

I was using the mwe.py + spawner.py repro (which exits in 3.12.5). I think y5c4l3 edited in the atexit example after I posted.

mpage added a commit to mpage/cpython that referenced this issue Sep 17, 2024
During finalization, daemon threads are force to exit immediately (without returning
through the call-stack normally) upon acquiring the GIL. Finalizers that run after
this must be able to join the forcefully terminated threads.

The current implemenation notified of thread completetion before returning
from `thread_run`. This code will never execute if the thread is forced to exit during
finalization. Any code that attempts to join such a thread will block indefinitely.

To fix this, use the old approach of notifying of thread completion when the thread state is
cleared. This happens both when `thread_run` exits normally and when thread states are
destroyed as part of finalization (which happens immediately after forcing daemon threads
to exit, before any python code can run).
mpage added a commit to mpage/cpython that referenced this issue Sep 17, 2024
During finalization, daemon threads are force to exit immediately (without returning
through the call-stack normally) upon acquiring the GIL. Finalizers that run after
this must be able to join the forcefully terminated threads.

The current implementation notified of thread completion before returning
from `thread_run`. This code will never execute if the thread is forced to exit during
finalization. Any code that attempts to join such a thread will block indefinitely.

To fix this, use the old approach of notifying of thread completion when the thread state is
cleared. This happens both when `thread_run` exits normally and when thread states are
destroyed as part of finalization (which happens immediately after forcing daemon threads
to exit, before any python code can run).
mpage added a commit to mpage/cpython that referenced this issue Sep 17, 2024
During finalization, daemon threads are force to exit immediately (without returning
through the call-stack normally) upon acquiring the GIL. Finalizers that run after
this must be able to join the forcefully terminated threads.

The current implementation notified of thread completion before returning
from `thread_run`. This code will never execute if the thread is forced to exit during
finalization. Any code that attempts to join such a thread will block indefinitely.

To fix this, use the old approach of notifying of thread completion when the thread state is
cleared. This happens both when `thread_run` exits normally and when thread states are
destroyed as part of finalization (which happens immediately after forcing daemon threads
to exit, before any python code can run).
mpage added a commit to mpage/cpython that referenced this issue Sep 17, 2024
@colesbury
Copy link
Contributor

Matt is working on a fix, but I don't think the approach used in https://github.com/pycompression/python-zlib-ng is a good idea:

In particular:

  • Don't join threads in destructors. That seems likely to cause problems because now you have blocking synchronization potentially at arbitrary places. In general, you can't be sure which thread will call destructors due to GC.
  • Don't rely on deamon threads being joinable at shutdown.

On the second point: both the join and shutdown behavior has changed in recent releases and is likely to continue to change even with Matt's fix. For example, join() now calls the OS join API, so if something keeps the daemon thread alive (like blocking I/O), the join() will block forever in 3.13 (even with Matt's fix). Additionally, it relies on daemon threads being force killed, which started in 3.9 and has been a continual source of problems especially with interoperability with C++ code and C++ destructors. In other words, I think it's something we will want to move away from.

Finally, slight variants of the reproducer will hard crash in 3.11 and 3.12 (and possibly earlier versions), which makes me think it's not a robust approach.

mpage added a commit to mpage/cpython that referenced this issue Sep 18, 2024
mpage added a commit to mpage/cpython that referenced this issue Sep 18, 2024
@rhpvorderman
Copy link
Contributor Author

Don't join threads in destructors. That seems likely to cause problems because now you have blocking synchronization potentially at arbitrary places. In general, you can't be sure which thread will call destructors due to GC.

There is no custom destructors in python-zlib-ng that join threads. Where did you see this code? In fact, for the classes that cause the issue, there are no custom destructors.

Don't rely on deamon threads being joinable at shutdown [...] not a robust approach.

I really appreciate your explanation. What patterns do you recommend to avoid this sort of behaviour? I have a thread that does the IO. Are there any well documented patterns to prevent this from blocking?

There should be at least some documentation on this in the release notes if it is going to break currently used code. Python-isal uses the same code and is marked as an essential package for the python ecosystem. I'd like it to be able to work with python 3.13 when it comes out.

@colesbury
Copy link
Contributor

There is no custom destructors in python-zlib-ng that join threads. Where did you see this code?

The __del__ for RawIOBase calls close() and you have a thread.join() called from your close() (via stop()).

What patterns do you recommend to avoid this sort of behaviour?

Don't call thread.join() in close() if it's being called from __del__(). Something like:

class _ThreadedGzipReader(io.RawIOBase):
    def stop(self):
        if self.running:
            self.running = False
            if not self._finalizing:
                for worker in self.compression_workers:
                    worker.join()
                self.output_worker.join()

    def __del__(self):
        self._finalizing = True
        super().__del__()

@rhpvorderman
Copy link
Contributor Author

@colesbury Thank you so much for this additional information. Now I understand the problem and can fix it appropriately.

@hauntsaninja
Copy link
Contributor

People are leaning towards not treating this as blocking for 3.13.0 (x-ref colesbury's message on internal Discord)

colesbury also mentioned #116514 as a reference for hard crashes on 3.11/3.12 (I wasn't able to reproduce #124150 (comment) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes deferred-blocker type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

6 participants