Enabling External Logging can result in a deadlock that prevents jobs from running #4181

ryanpetrello · 2019-06-27T17:12:56Z

ISSUE TYPE

Bug Report

ENVIRONMENT

AWX version: 4.0.0+
AWX install method: any
Ansible version: any

STEPS TO REPRODUCE

AWX's task dispatcher manages a pool of worker processes to execute playbooks in the background when jobs are launched. Much like its predecessor, http://www.celeryproject.org, it forks new processes on demand as necessary to scale up when multiple jobs are running.

When external logging is enabled and configured to send logs to an HTTP/HTTPS endpoint (e.g., logstash, splunk), every AWX python process maintains a pool of background threads that emit log lines asynchronously to the destination log aggregator.

cpython (2.7 and 3.6) have a race condition on fork which can cause a deadlock when this external logging mechanism is in use:

https://bugs.python.org/issue37429

Due to nature of this race condition, and the very small critical path of the underlying locking mechanism in cpython, it's very difficult to observe this bug, but in the reproductions I've seen, it happens when:

External logging is enabled, particularly at the DEBUG level.
A number of simultaneous jobs are scheduled at nearly the same time.
The dispatcher scales up by forking several new worker processes.
Critically, this os.fork() occurs at nearly the same moment in time that one of the background threads is beginning to emit a log.

The text was updated successfully, but these errors were encountered:

ryanpetrello · 2019-06-27T17:13:38Z

While it's fairly contrived, this code change will cause dispatcher workers to fork and then exit immediately. This means that if you throw many simultaneous tasks onto the message bus in short bursts, there will be a lot of process creation and destruction constantly going on, making the race condition described at https://bugs.python.org/issue37429 much more likely to occur:

~/dev/awx git diff awx/main/dispatch/
diff --git a/awx/main/dispatch/worker/base.py b/awx/main/dispatch/worker/base.py
index a48ed7d1d3..c18004a59d 100644
--- a/awx/main/dispatch/worker/base.py
+++ b/awx/main/dispatch/worker/base.py
@@ -119,6 +119,7 @@ class BaseWorker(object):
         ppid = os.getppid()
         signal_handler = WorkerSignalHandler()
         while not signal_handler.kill_now:
+            signal_handler.kill_now = True
             # if the parent PID changes, this process has been orphaned
             # via e.g., segfault or sigkill, we should exit too
             if os.getppid() != ppid:

ryanpetrello · 2019-06-27T17:16:52Z

The symptoms of this bug range in severity, but they generally mean that at least one process in the dispatcher's worker pool becomes permanently hung and unresponsive, and cannot be recovered without forcibly killing the process.

In the worst case scenario, the task which becomes hung is the global task manager; when this is hung, no new jobs can launch at all, and all launched jobs will remain in pending state until services are restarted.

Processes in this state can be discovered using awx-manage run_dispatcher --status. In the reproductions I've seen, the processes in question are always hung on a futex:

~ strace -s 100000 -ffttv -p 9268
strace: Process 9268 attached
11:06:36.426652 futex(0x478f800, FUTEX_WAIT_PRIVATE, 0, NULL

Closer inspection with gdb shows the actual Python stack trace for this deadlock:

~ yum install gdb
~ debuginfo-install python2-2.7.15-2.fc28.x86_64  # py2
~ debuginfo-install python3-3.6.5-1.fc28.x86_64  # py3

#4 Waiting for a lock (e.g. GIL)
#5 Waiting for a lock (e.g. GIL)
#7 Frame 0x7f8521b04bf0, for file /usr/lib64/python2.7/Queue.py, line 118, in put (self=<Queue(unfinished_tasks=2, queue=<collections.deque at remote 0x7f8521b5f1a0>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb710>, mutex=<thread.lock at remote 0x7f8521c03950>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb650>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at...(truncated)
    self.not_full.acquire()
#11 Frame 0x41f5d10, for file /var/lib/awx/venv/awx/lib/python2.7/site-packages/concurrent/futures/thread.py, line 129, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, fn=<function at remote 0x7f8520a90e60>, args=('POST', u'https://splunkhf.app.vumc.org/services/collector/event'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, f=<Future(_exception=None, _result=None, _condition=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f8520a4c4...(truncated)
    self._work_queue.put(w)
#16 Frame 0x7f8521b63c20, for file /lib/python2.7/site-packages/awx/main/utils/handlers.py, line 79, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, func=<function at remote 0x7f8520a90b18>, args=('POST', u'https://redacted/splunk/http/event/collector'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, _wrapped=<function at remote 0x7f8520a90e60>)
Python Exception <type 'exceptions.IOError'> [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py':
Error occurred in Python command: [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py'```

Further details on the specifics of this bug, and a longer gdb stack example of a simplified reproduction can be found at:

https://bugs.python.org/issue37429

ryanpetrello · 2019-06-27T17:19:24Z

Ultimately, I've got some ideas for a way to work around this bug in Python, but long-term I'd like to rethink how we've implemented the external logging functionality - including potentially moving the log shipping out of band into a separate background daemon so that we can prevent these types of critical language bugs going forward (the bug still appears present for me in a py3.7 I compiled this morning).

Until that time, given that this is a race condition bug in Python at fork time, I think our best option is simply to work around it. My current plan is to introduce a change which will cause the dispatcher's main thread to no longer ship external logs when the feature is enabled. This should help us avoid the fork race entirely, because the process we're forking shouldn't ever be entering the critical path for the lock in question.

this is a fairly esoteric change that attempts to work around a bug we've discovered in cpython itself context: ansible#4181

thread locals don't care about your feelings see: ansible#4181

see: ansible#4181

ryanpetrello added component:api priority:high type:bug labels Jun 27, 2019

ryanpetrello self-assigned this Jun 27, 2019

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jun 27, 2019

don't ship external logs from the main thread of the dispatcher

955b3d0

this is a fairly esoteric change that attempts to work around a bug we've discovered in cpython itself context: ansible#4181

ryanpetrello mentioned this issue Jun 27, 2019

don't ship external logs from the main thread of the dispatcher #4183

Merged

ryanpetrello added state:needs_test and removed state:in_progress labels Jun 27, 2019

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jun 27, 2019

don't ship external logs from the main thread of the dispatcher

dfa8d44

this is a fairly esoteric change that attempts to work around a bug we've discovered in cpython itself context: ansible#4181

Spredzy closed this as completed Jul 8, 2019

Spredzy removed the state:needs_test label Jul 10, 2019

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 12, 2019

fix a subtle logging regression related to ansible#4181

866d082

thread locals don't care about your feelings see: ansible#4181

ryanpetrello mentioned this issue Sep 12, 2019

fix a subtle logging regression related to #4181 #4722

Closed

ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 12, 2019

attempt to avoid a logging deadlock bug by another means

8644a13

see: ansible#4181

ryanpetrello mentioned this issue Oct 24, 2019

POC: replace our external log aggregation feature with rsyslog #5092

Closed

snyk-bot mentioned this issue Jan 21, 2022

[Snyk] Upgrade @patternfly/react-icons from 3.12.0 to 3.15.17 guardant/awx#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling External Logging can result in a deadlock that prevents jobs from running #4181

Enabling External Logging can result in a deadlock that prevents jobs from running #4181

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

Enabling External Logging can result in a deadlock that prevents jobs from running #4181

Enabling External Logging can result in a deadlock that prevents jobs from running #4181

Comments

ryanpetrello commented Jun 27, 2019 • edited Loading

ISSUE TYPE

ENVIRONMENT

STEPS TO REPRODUCE

ryanpetrello commented Jun 27, 2019 • edited Loading

ryanpetrello commented Jun 27, 2019 • edited Loading

ryanpetrello commented Jun 27, 2019 • edited Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading

ryanpetrello commented Jun 27, 2019 •

edited

Loading