Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling External Logging can result in a deadlock that prevents jobs from running #4181

Closed
ryanpetrello opened this issue Jun 27, 2019 · 3 comments

Comments

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jun 27, 2019

ISSUE TYPE
  • Bug Report
ENVIRONMENT
  • AWX version: 4.0.0+
  • AWX install method: any
  • Ansible version: any
STEPS TO REPRODUCE

AWX's task dispatcher manages a pool of worker processes to execute playbooks in the background when jobs are launched. Much like its predecessor, http://www.celeryproject.org, it forks new processes on demand as necessary to scale up when multiple jobs are running.

When external logging is enabled and configured to send logs to an HTTP/HTTPS endpoint (e.g., logstash, splunk), every AWX python process maintains a pool of background threads that emit log lines asynchronously to the destination log aggregator.

cpython (2.7 and 3.6) have a race condition on fork which can cause a deadlock when this external logging mechanism is in use:

https://bugs.python.org/issue37429

Due to nature of this race condition, and the very small critical path of the underlying locking mechanism in cpython, it's very difficult to observe this bug, but in the reproductions I've seen, it happens when:

  1. External logging is enabled, particularly at the DEBUG level.
  2. A number of simultaneous jobs are scheduled at nearly the same time.
  3. The dispatcher scales up by forking several new worker processes.
  4. Critically, this os.fork() occurs at nearly the same moment in time that one of the background threads is beginning to emit a log.
@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Jun 27, 2019

While it's fairly contrived, this code change will cause dispatcher workers to fork and then exit immediately. This means that if you throw many simultaneous tasks onto the message bus in short bursts, there will be a lot of process creation and destruction constantly going on, making the race condition described at https://bugs.python.org/issue37429 much more likely to occur:

~/dev/awx git diff awx/main/dispatch/
diff --git a/awx/main/dispatch/worker/base.py b/awx/main/dispatch/worker/base.py
index a48ed7d1d3..c18004a59d 100644
--- a/awx/main/dispatch/worker/base.py
+++ b/awx/main/dispatch/worker/base.py
@@ -119,6 +119,7 @@ class BaseWorker(object):
         ppid = os.getppid()
         signal_handler = WorkerSignalHandler()
         while not signal_handler.kill_now:
+            signal_handler.kill_now = True
             # if the parent PID changes, this process has been orphaned
             # via e.g., segfault or sigkill, we should exit too
             if os.getppid() != ppid:

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Jun 27, 2019

The symptoms of this bug range in severity, but they generally mean that at least one process in the dispatcher's worker pool becomes permanently hung and unresponsive, and cannot be recovered without forcibly killing the process.

In the worst case scenario, the task which becomes hung is the global task manager; when this is hung, no new jobs can launch at all, and all launched jobs will remain in pending state until services are restarted.

Processes in this state can be discovered using awx-manage run_dispatcher --status. In the reproductions I've seen, the processes in question are always hung on a futex:

~ strace -s 100000 -ffttv -p 9268
strace: Process 9268 attached
11:06:36.426652 futex(0x478f800, FUTEX_WAIT_PRIVATE, 0, NULL

Closer inspection with gdb shows the actual Python stack trace for this deadlock:

~ yum install gdb
~ debuginfo-install python2-2.7.15-2.fc28.x86_64  # py2
~ debuginfo-install python3-3.6.5-1.fc28.x86_64  # py3
#4 Waiting for a lock (e.g. GIL)
#5 Waiting for a lock (e.g. GIL)
#7 Frame 0x7f8521b04bf0, for file /usr/lib64/python2.7/Queue.py, line 118, in put (self=<Queue(unfinished_tasks=2, queue=<collections.deque at remote 0x7f8521b5f1a0>, maxsize=0, all_tasks_done=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb710>, mutex=<thread.lock at remote 0x7f8521c03950>, not_full=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at remote 0x7f8521c03950>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f8521c03950>) at remote 0x7f8521afb650>, not_empty=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f8521c03950>, acquire=<built-in method acquire of thread.lock object at...(truncated)
    self.not_full.acquire()
#11 Frame 0x41f5d10, for file /var/lib/awx/venv/awx/lib/python2.7/site-packages/concurrent/futures/thread.py, line 129, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, fn=<function at remote 0x7f8520a90e60>, args=('POST', u'https://splunkhf.app.vumc.org/services/collector/event'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, f=<Future(_exception=None, _result=None, _condition=<_Condition(_Condition__lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f8520a4c4...(truncated)
    self._work_queue.put(w)
#16 Frame 0x7f8521b63c20, for file /lib/python2.7/site-packages/awx/main/utils/handlers.py, line 79, in submit (self=<VerboseThreadPoolExecutor(_shutdown_lock=<thread.lock at remote 0x7f8521c03970>, _shutdown=False, _max_workers=2, _thread_name_prefix='ThreadPoolExecutor-0', _threads=set([<...>, <...>]), _work_queue=<...>) at remote 0x7f8521afbb10>, func=<function at remote 0x7f8520a90b18>, args=('POST', u'https://redacted/splunk/http/event/collector'), kwargs={'verify': False, 'json': None, 'data': '{"event": {"cluster_host_id": "ans1001lp.hs.it.vumc.io", "level": "INFO", "tower_uuid": "", "@timestamp": "2019-04-18T14:47:34.069Z", "host": "ans1001lp", "logger_name": "awx.main.tasks", "message": "inventory_update 47961 (running) finished running, producing 7044 events.", "type": "splunk"}}', 'timeout': 5}, _wrapped=<function at remote 0x7f8520a90e60>)
Python Exception <type 'exceptions.IOError'> [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py':
Error occurred in Python command: [Errno 2] No such file or directory: '/lib/python2.7/site-packages/awx/main/utils/handlers.py'```

Further details on the specifics of this bug, and a longer gdb stack example of a simplified reproduction can be found at:

https://bugs.python.org/issue37429

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Jun 27, 2019

Ultimately, I've got some ideas for a way to work around this bug in Python, but long-term I'd like to rethink how we've implemented the external logging functionality - including potentially moving the log shipping out of band into a separate background daemon so that we can prevent these types of critical language bugs going forward (the bug still appears present for me in a py3.7 I compiled this morning).

Until that time, given that this is a race condition bug in Python at fork time, I think our best option is simply to work around it. My current plan is to introduce a change which will cause the dispatcher's main thread to no longer ship external logs when the feature is enabled. This should help us avoid the fork race entirely, because the process we're forking shouldn't ever be entering the critical path for the lock in question.

@ryanpetrello ryanpetrello self-assigned this Jun 27, 2019
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jun 27, 2019
this is a fairly esoteric change that attempts to work around a bug
we've discovered in cpython itself

context: ansible#4181
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Jun 27, 2019
this is a fairly esoteric change that attempts to work around a bug
we've discovered in cpython itself

context: ansible#4181
@Spredzy Spredzy closed this as completed Jul 8, 2019
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 12, 2019
thread locals don't care about your feelings
see: ansible#4181
ryanpetrello added a commit to ryanpetrello/awx that referenced this issue Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants