reset internal state on fork to prevent deadlocks in worker threads #139

terencehonles · 2021-01-25T18:54:37Z

I'm opening this PR up to start a discussion on behavior while forking. I believe this may be related to #89 and #31. It looks like there has been recent activity on #31 (comment) and this may possibly need some adjustment.

This change allows subprocesses not to deadlock when calling os.fork() because it will reset the logger's internal state on forking (threads don't survive a fork, but self.queue[s].put may be locked while forking). It does seem like boto3 sessions could also end up causing issues, but I'm not sure if the thread safety comment is actually an issue for this library.

This is still not "perfect" because the documentation on os._exit suggests that it should be called when exiting a fork, which bypasses the logging module's default shutdown routine, which includes flushing the logs. If the documentation suggests that watchtower should be flushed before exiting a fork then I believe this change plus making sure flush is called will ensure that the logs are delivered.

Even without documentation suggesting to flush the logs, this change will at least make sure there new threads created to handle the reset queues. Otherwise forked copies of this library believe the threads to be still alive even though they are not. An alternative implementation could check if the thread for the queue is still alive, but that case would still need to handle what to do with the queued logs, and in the case of forking those logs are duplicates.

terencehonles · 2021-01-25T18:57:47Z

watchtower/__init__.py

        self.create_log_stream = create_log_stream
        self.log_group_retention_days = log_group_retention_days
+        self._init_state()


This will be called with the parent's __init__ which creates its lock. Since it's ok to init the state twice I made it more obvious it's being called by calling it directly. I can add a comment on super().__init__ if that seems better (I went with this way in case the stdlib changed, but that seems unlikely).

kislyuk · 2021-01-30T23:52:55Z

I agree with this change in principle for reinitializing state in _at_fork_reinit(). I am not convinced that it's safe to reinitialize state in createLock(). Can you please remove those lines?

While reinitializing state in _at_fork_reinit() is unambiguously good as a defensive move, it does not ensure that watchtower handlers can be used safely after forking. So in general (and on Python before 3.9), I would like to add a section to the documentation advising users to initialize watchtower handlers after forking.

terencehonles · 2021-04-30T02:47:57Z

Sorry I got pulled away from this, but it looks like we're actually having issues with this on shutdown of uWSGI and my attention was reverted back to this.

I agree with this change in principle for reinitializing state in _at_fork_reinit(). I am not convinced that it's safe to reinitialize state in createLock(). Can you please remove those lines?

That would only allow this code to support Python 3.9 and above mentioned. In other scenarios it would cause a deadlock. If there is something else you'd like to see in createLock() we probably should explore that.

While reinitializing state in _at_fork_reinit() is unambiguously good as a defensive move, it does not ensure that watchtower handlers can be used safely after forking. So in general (and on Python before 3.9), I would like to add a section to the documentation advising users to initialize watchtower handlers after forking.

I'm not sure I follow exactly the issue. Yes, I agree, there may be data loss. However after forking there will not be a thread listening to the queue and it will need to be "restarted". This code resets the internal state so that the child can restart its queue and threads. Alternatively we may be able to store a weakref or just check to see if the thread is alive and if it's not alive try to restart it. The reason I hadn't gone with that approach is it might send too much data. The only time I "really" see a scenario for data loss is if the parent forks, and then execs itself to something else. Otherwise the thread in the parent process will eventually send the messages that are expected to be sent and the child should be able to continue as expected. The only other scenario I can see messing things up is if someone else is calling the createLock() or _at_fork_reinit() code outside of the expected life cycles of these objects.

kislyuk · 2021-05-08T23:48:18Z

OK, thanks. I'll have to think more about this and what kinds of warnings to add in the docs, but on the face of it this seems correct.

The safest thing to do is always to share nothing between threads/processes. That's the gist of what I was trying to get at, and what I'll try to advise in the docs.

Flauschbaellchen · 2021-07-08T12:52:40Z

@kislyuk Are you able to release a new version so this bugfix/PR gets published? Would also help me with #141 - Thanks!

Flauschbaellchen · 2021-09-22T11:38:51Z

@kislyuk May I ask again if a more recent release would be possible to finally review and close #141?

reset internal state on fork to prevent deadlocks in worker threads

eace4d8

terencehonles commented Jan 25, 2021

View reviewed changes

This was referenced Apr 30, 2021

Follow upon cloudwatch formatters #138

Closed

Flushing a queue on shutdown in case of reconfiguration results in an unresponsive app #141

Closed

kislyuk merged commit 3f97c20 into kislyuk:develop May 8, 2021

terencehonles deleted the handle-forking branch May 10, 2021 20:55

rafidka mentioned this pull request Mar 1, 2022

Cloudwatch Integration: SIGTERM/SIGKILL Sent Following DAG Completion, Causing Errors in Worker Logs apache/airflow#13824

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reset internal state on fork to prevent deadlocks in worker threads #139

reset internal state on fork to prevent deadlocks in worker threads #139

terencehonles commented Jan 25, 2021

terencehonles Jan 25, 2021 •

edited

Loading

kislyuk commented Jan 30, 2021

terencehonles commented Apr 30, 2021 •

edited

Loading

kislyuk commented May 8, 2021

Flauschbaellchen commented Jul 8, 2021

Flauschbaellchen commented Sep 22, 2021

reset internal state on fork to prevent deadlocks in worker threads #139

reset internal state on fork to prevent deadlocks in worker threads #139

Conversation

terencehonles commented Jan 25, 2021

terencehonles Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

kislyuk commented Jan 30, 2021

terencehonles commented Apr 30, 2021 • edited Loading

kislyuk commented May 8, 2021

Flauschbaellchen commented Jul 8, 2021

Flauschbaellchen commented Sep 22, 2021

terencehonles Jan 25, 2021 •

edited

Loading

terencehonles commented Apr 30, 2021 •

edited

Loading