-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling External Logging can result in a deadlock that prevents jobs from running #4181
Comments
While it's fairly contrived, this code change will cause dispatcher workers to fork and then exit immediately. This means that if you throw many simultaneous tasks onto the message bus in short bursts, there will be a lot of process creation and destruction constantly going on, making the race condition described at https://bugs.python.org/issue37429 much more likely to occur: ~/dev/awx git diff awx/main/dispatch/
diff --git a/awx/main/dispatch/worker/base.py b/awx/main/dispatch/worker/base.py
index a48ed7d1d3..c18004a59d 100644
--- a/awx/main/dispatch/worker/base.py
+++ b/awx/main/dispatch/worker/base.py
@@ -119,6 +119,7 @@ class BaseWorker(object):
ppid = os.getppid()
signal_handler = WorkerSignalHandler()
while not signal_handler.kill_now:
+ signal_handler.kill_now = True
# if the parent PID changes, this process has been orphaned
# via e.g., segfault or sigkill, we should exit too
if os.getppid() != ppid: |
The symptoms of this bug range in severity, but they generally mean that at least one process in the dispatcher's worker pool becomes permanently hung and unresponsive, and cannot be recovered without forcibly killing the process. In the worst case scenario, the task which becomes hung is the global task manager; when this is hung, no new jobs can launch at all, and all launched jobs will remain in Processes in this state can be discovered using
Closer inspection with gdb shows the actual Python stack trace for this deadlock: ~ yum install gdb
~ debuginfo-install python2-2.7.15-2.fc28.x86_64 # py2
~ debuginfo-install python3-3.6.5-1.fc28.x86_64 # py3
Further details on the specifics of this bug, and a longer gdb stack example of a simplified reproduction can be found at: |
Ultimately, I've got some ideas for a way to work around this bug in Python, but long-term I'd like to rethink how we've implemented the external logging functionality - including potentially moving the log shipping out of band into a separate background daemon so that we can prevent these types of critical language bugs going forward (the bug still appears present for me in a py3.7 I compiled this morning). Until that time, given that this is a race condition bug in Python at fork time, I think our best option is simply to work around it. My current plan is to introduce a change which will cause the dispatcher's main thread to no longer ship external logs when the feature is enabled. This should help us avoid the fork race entirely, because the process we're forking shouldn't ever be entering the critical path for the lock in question. |
this is a fairly esoteric change that attempts to work around a bug we've discovered in cpython itself context: ansible#4181
this is a fairly esoteric change that attempts to work around a bug we've discovered in cpython itself context: ansible#4181
thread locals don't care about your feelings see: ansible#4181
ISSUE TYPE
ENVIRONMENT
STEPS TO REPRODUCE
AWX's task dispatcher manages a pool of worker processes to execute playbooks in the background when jobs are launched. Much like its predecessor, http://www.celeryproject.org, it forks new processes on demand as necessary to scale up when multiple jobs are running.
When external logging is enabled and configured to send logs to an HTTP/HTTPS endpoint (e.g., logstash, splunk), every AWX python process maintains a pool of background threads that emit log lines asynchronously to the destination log aggregator.
cpython (2.7 and 3.6) have a race condition on fork which can cause a deadlock when this external logging mechanism is in use:
https://bugs.python.org/issue37429
Due to nature of this race condition, and the very small critical path of the underlying locking mechanism in cpython, it's very difficult to observe this bug, but in the reproductions I've seen, it happens when:
DEBUG
level.os.fork()
occurs at nearly the same moment in time that one of the background threads is beginning to emit a log.The text was updated successfully, but these errors were encountered: