Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitogen intermittent hangs on "Connection timed out" target #598

Closed
antigenius0910 opened this issue Jun 29, 2019 · 5 comments
Closed

Mitogen intermittent hangs on "Connection timed out" target #598

antigenius0910 opened this issue Jun 29, 2019 · 5 comments

Comments

@antigenius0910
Copy link

antigenius0910 commented Jun 29, 2019

* Which version of Ansible are you running?
     ansible 2.4.6.0
     config file = /etc/ansible/ansible.cfg
     configured module search path = [u'/home/yen306/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
     ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
     executable location = /usr/local/bin/ansible
     python version = 2.7.15rc1 (default, Nov 12 2018, 14:31:15) [GCC 7.3.0]

* Is your version of Ansible patched in any way?
     No

* Are you running with any custom modules, or `module_utils` loaded?
     No, /etc/ansible/ansible.cfg:18:#module_utils   = /usr/share/my_module_utils/ 

* Have you tried the latest master version from Git?
     Yes

* Do you have some idea of what the underlying problem may be?
    NO, using "-vvv -e mitogen_ssh_debug_level=3" for log capturing

* Mention your host and target OS and versions
    Linux  4.15.0-1014 #16-Ubuntu SMP Tue Dec 11 11:19:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

* Mention your host and target Python versions
    Python 2.7.15rc1

* If reporting any kind of problem with Ansible, please include the Ansible
  version along with output of "ansible-config dump --only-changed".

ANSIBLE_SSH_ARGS(/etc/ansible/ansible.cfg) = -o UserKnownHostsFile=/dev/null
DEFAULT_DEBUG(env: ANSIBLE_DEBUG) = True
DEFAULT_FORKS(/etc/ansible/ansible.cfg) = 10
DEFAULT_STRATEGY(/etc/ansible/ansible.cfg) = mitogen_linear
DEFAULT_STRATEGY_PLUGIN_PATH(/etc/ansible/ansible.cfg) = [u'/home/mitogen-0.2.7/ansible_mitogen/plugins/strategy']
HOST_KEY_CHECKING(/etc/ansible/ansible.cfg) = False
INVENTORY_ENABLED(/etc/ansible/ansible.cfg) = ['host_list', 'virtualbox', 'yaml', 'constructed', 'ini', 'script']

First thanks for the awesome project! I had a 11 times performance improvement!!! after using mitogen. However, I am facing below problem and wondering if you can take a look.

Submit ansible playbook jobs in a for loop for 20 time (96 hosts)

export ANSIBLE_DEBUG=1; for ((n=0;n<20;n++)); do time ansible-playbook -i mzoneinis/mzoneXXX.ini ping.yml -vvv -e mitogen_ssh_debug_level=3; done

and it will intermittently hang on (Observed output)

[mux  60775] 20:27:59.833484 D mitogen: mitogen.core.Stream('unix_client.61010').on_disconnect()
[task 61010] 20:27:59.833479 D mitogen: Waker(Broker(0x7fbb6fef01d0) rfd=41, wfd=42).on_disconnect()
[task 61010] 20:27:59.834034 D mitogen: Router(Broker(0x7fbb6fef01d0)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
 61010 1561840079.83510: done running TaskExecutor() for hostname/TASK: ping [506b4bf5-3974-a649-31ae-000000000056]
 61010 1561840079.83533: sending task result for task 506b4bf5-3974-a649-31ae-000000000056
 61010 1561840079.83567: done sending task result for task 506b4bf5-3974-a649-31ae-000000000056
 61010 1561840079.83575: WORKER PROCESS EXITING
fatal: [hostname]: UNREACHABLE! => {
    "changed": false, 
    "msg": "Connection timed out.", 
    "unreachable": true
}
[mux  60775] 20:28:03.169269 D mitogen: ssh.10.249.3.28: debug3: send packet: type 80
[mux  60775] 20:28:03.169646 D mitogen: ssh.10.249.3.28: debug3: receive packet: type 82
[mux  60775] 20:28:03.172083 D mitogen: ssh.10.249.3.33: debug3: send packet: type 80
[mux  60775] 20:28:03.172364 D mitogen: ssh.10.249.3.33: debug3: receive packet: type 82
[mux  60775] 20:28:03.172592 D mitogen: ssh.10.249.3.32: debug3: send packet: type 80
[mux  60775] 20:28:03.172764 D mitogen: ssh.10.249.3.32: debug3: receive packet: type 82

If it doesn't hang it looks like below (Expected output)

task 9426] 19:32:07.468708 D mitogen: mitogen.core.Stream('unix_listener.9171').on_disconnect()
[mux  9171] 19:32:07.469339 D mitogen: mitogen.core.Stream('unix_client.9426').on_disconnect()
[task 9426] 19:32:07.469341 D mitogen: Waker(Broker(0x7fca93433810) rfd=41, wfd=42).on_disconnect()
[task 9426] 19:32:07.469874 D mitogen: Router(Broker(0x7fca93433810)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
  9426 1561836727.47090: done running TaskExecutor() for hostname/TASK: ping [506b4bf5-3974-0ca7-6a9c-000000000056]
  9426 1561836727.47111: sending task result for task 506b4bf5-3974-0ca7-6a9c-000000000056
  9426 1561836727.47143: done sending task result for task 506b4bf5-3974-0ca7-6a9c-000000000056
  9426 1561836727.47149: WORKER PROCESS EXITING
fatal: [hostname]: UNREACHABLE! => {
    "changed": false, 
    "msg": "Connection timed out.", 
    "unreachable": true
}
  9158 1561836727.47340: no more pending results, returning what we have

if I take out strategy_plugins = /home/zabbixserver/zabbixautomation/mitogen-0.2.7/ansible_mitogen/plugins/strategy strategy = mitogen_linearI can loop it for 50 times and still won't hang

Below is the content of ping.yml playbook

---
- hosts: all
  gather_facts: no
  tasks:
    - ping:

@antigenius0910
Copy link
Author

more log with "MITOGEN_DUMP_THREAD_STACKS=10"

-# PID 41572 ThreadID: (mitogen.broker) 140670864463616; <frame object at 0x7ff07c262de0>
+# PID 41572 ThreadID: (mitogen.broker) 140670864463616; <frame object at 0x7ff07c26d450>
 File: "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
     self.__bootstrap_inner()
 File: "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner

[mux  23927] 22:02:48.050536 D mitogen: mitogen.ssh.Stream('ssh.10.249.3.122').on_disconnect()
[mux  23927] 22:02:48.050908 D mitogen: mitogen.parent.DiagLogStream(fd=243, 'ssh.10.249.3.122').on_disconnect()
[mux  23927] 22:02:48.051213 D mitogen: mitogen.ssh.Stream('ssh.10.249.3.122'): child process still alive, sending SIGTERM
 24244 1561845768.05236: _execute() done
 24244 1561845768.05249: dumping result to json
 24244 1561845768.05263: done dumping result, returning
[task 24244] 22:02:48.052871 D mitogen: mitogen.core.Stream('unix_listener.23927').on_disconnect()
[mux  23927] 22:02:48.053470 D mitogen: mitogen.core.Stream('unix_client.24244').on_disconnect()
[task 24244] 22:02:48.053491 D mitogen: Waker(Broker(0x7f1a391d7210) rfd=41, wfd=42).on_disconnect()
[task 24244] 22:02:48.054089 D mitogen: Router(Broker(0x7f1a391d7210)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
 24244 1561845768.05521: done running TaskExecutor() for dal3-qz1-sr1-rk314-s24/TASK: PWD [506b4bf5-3974-0960-3023-000000000056]
 24244 1561845768.05545: sending task result for task 506b4bf5-3974-0960-3023-000000000056
 24244 1561845768.05579: done sending task result for task 506b4bf5-3974-0960-3023-000000000056
 24244 1561845768.05587: WORKER PROCESS EXITING
fatal: [dal3-qz1-sr1-rk314-s24]: UNREACHABLE! => {
    "changed": false, 
    "msg": "Connection timed out.", 
    "unreachable": true
}
[mux  23927] 22:02:48.475726 I mitogen.ctx.ssh.10.249.3.111: mitogen.debug: PERIODIC THREAD DUMP



# PID 35706 ThreadID: (mitogen.service.Pool.7fa239d6fe50.worker-15) 140333476206336; <frame object at 0x7fa239d18810>
File: "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
File: "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
File: "<stdin>", line 561, in _profile_hook
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 598, in _worker_main
    self._worker_run()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 584, in _worker_run
    event = self._select.get_event()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/select.py", line 315, in get_event
    recv = self._latch.get(timeout=timeout, block=block)
File: "<stdin>", line 2208, in get
File: "<stdin>", line 2225, in _get_sleep
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/parent.py", line 927, in _poll
    events, _ = mitogen.core.io_op(self._pollobj.poll, timeout)
File: "<stdin>", line 482, in io_op

# PID 35706 ThreadID: (mitogen.service.Pool.7fa239d6fe50.worker-14) 140333484599040; <frame object at 0x7fa239d18240>
File: "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
File: "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
File: "<stdin>", line 561, in _profile_hook
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 598, in _worker_main
    self._worker_run()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 584, in _worker_run
    event = self._select.get_event()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/select.py", line 315, in get_event
    recv = self._latch.get(timeout=timeout, block=block)
File: "<stdin>", line 2208, in get
File: "<stdin>", line 2225, in _get_sleep
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/parent.py", line 927, in _poll
    events, _ = mitogen.core.io_op(self._pollobj.poll, timeout)
File: "<stdin>", line 482, in io_op

# PID 35706 ThreadID: (mitogen.service.Pool.7fa239d6fe50.worker-0) 140334625449728; <frame object at 0x7fa239d7d050>
File: "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
File: "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
File: "<stdin>", line 561, in _profile_hook
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 598, in _worker_main
    self._worker_run()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 584, in _worker_run
    event = self._select.get_event()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/select.py", line 315, in get_event
    recv = self._latch.get(timeout=timeout, block=block)
File: "<stdin>", line 2208, in get
File: "<stdin>", line 2225, in _get_sleep
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/parent.py", line 927, in _poll
    events, _ = mitogen.core.io_op(self._pollobj.poll, timeout)
File: "<stdin>", line 482, in io_op

# PID 35706 ThreadID: (mitogen.service.Pool.7fa239d6fe50.worker-6) 140334029862656; <frame object at 0x7fa239d09240>
File: "/usr/lib/python2.7/threading.py", line 774, in __bootstrap
    self.__bootstrap_inner()
File: "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
File: "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
File: "<stdin>", line 561, in _profile_hook
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 598, in _worker_main
    self._worker_run()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/service.py", line 584, in _worker_run
    event = self._select.get_event()
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/select.py", line 315, in get_event
    recv = self._latch.get(timeout=timeout, block=block)
File: "<stdin>", line 2208, in get
File: "<stdin>", line 2225, in _get_sleep
File: "master:/home/zabbixserver/zabbixautomation/mitogen-0.2.7/mitogen/parent.py", line 927, in _poll
    events, _ = mitogen.core.io_op(self._pollobj.poll, timeout)
File: "<stdin>", line 482, in io_op

@antigenius0910
Copy link
Author

more logs

strace -ff -p 1361
strace: Process 1361 attached with 2 threads
[pid  1361] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=386} <unfinished ...>
[pid  1370] read(3,  <unfinished ...>
[pid  1361] <... select resumed> )      = 0 (Timeout)
[pid  1361] wait4(1814, 0x7ffe7e040864, WNOHANG, NULL) = 0
[pid  1361] wait4(1814, 0x7ffe7e040884, WNOHANG, NULL) = 0
[pid  1361] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid  1361] wait4(1814, 0x7ffe7e040864, WNOHANG, NULL) = 0
[pid  1361] wait4(1814, 0x7ffe7e040884, WNOHANG, NULL) = 0
[pid  1361] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid  1361] wait4(1814, 0x7ffe7e040864, WNOHANG, NULL) = 0
[pid  1361] wait4(1814, 0x7ffe7e040884, WNOHANG, NULL) = 0
[pid  1361] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
[pid  1361] wait4(1814, 0x7ffe7e040864, WNOHANG, NULL) = 0
[pid  1361] wait4(1814, 0x7ffe7e040884, WNOHANG, NULL) = 0
[pid  1361] select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
strace -ff -p 1371
strace: Process 1371 attached with 35 threads
[pid  1396] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1403] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1402] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1401] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1400] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1404] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1405] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1399] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1398] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1397] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1395] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1394] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1393] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1392] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1391] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1390] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1389] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1388] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1387] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1386] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1385] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1384] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1383] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1382] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1381] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1380] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1379] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1378] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1377] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1376] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1375] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1374] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid  1373] select(0, NULL, NULL, NULL, {tv_sec=2, tv_usec=237919} <unfinished ...>
[pid  1371] recvfrom(29,  <unfinished ...>
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=113, u64=94484985544817}}], 32, -1) = 1
[pid  1372] read(113, "MI\0\0\0\0\0\0ur\0\0ur\0\0\0f\0\0\0\0\0\0T\215mitoge"..., 131072) = 21671
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid  1372] write(1, "\33[0;34m[mux  1371] 22:49:04.7569"..., 1024) = 1024
[pid  1372] write(1, "ter:/home/zabbixserver/zabbixaut"..., 24576) = 24576
[pid  1372] write(1, "99c450>\33[0m\n", 12) = 12
[pid  1372] write(1, "\33[0;34m+# PID 38273 ThreadID: (m"..., 101) = 101
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 78) = 78
[pid  1372] write(1, "\33[0;34m     self.__bootstrap_inn"..., 41) = 41
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 84) = 84
[pid  1372] write(1, "\33[0;34m\33[0m\n", 12) = 12
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=204, u64=94484985544908}}], 32, -1) = 1
[pid  1372] read(204, "MI\0\0\0\0\0\0ux\0\0ux\0\0\0f\0\0\0\0\0\0T\213mitoge"..., 131072) = 21669
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid  1372] write(1, "\33[0;34m[mux  1371] 22:49:05.0195"..., 1024) = 1024
[pid  1372] write(1, "ter:/home/zabbixserver/zabbixaut"..., 24576) = 24576
[pid  1372] write(1, "8450>\33[0m\n", 10) = 10
[pid  1372] write(1, "\33[0;34m+# PID 13499 ThreadID: (m"..., 101) = 101
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 78) = 78
[pid  1372] write(1, "\33[0;34m     self.__bootstrap_inn"..., 41) = 41
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 84) = 84
[pid  1372] write(1, "\33[0;34m\33[0m\n", 12) = 12
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=242, u64=94484985544946}}], 32, -1) = 1
[pid  1372] read(242, "MI\0\0\0\0\0\0u\203\0\0u\203\0\0\0f\0\0\0\0\0\0T\215mitoge"..., 131072) = 21671
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid  1372] write(1, "\33[0;34m[mux  1371] 22:49:05.0482"..., 1024) = 1024
[pid  1372] write(1, "ter:/home/zabbixserver/zabbixaut"..., 24576) = 24576
[pid  1372] write(1, "39c450>\33[0m\n", 12) = 12
[pid  1372] write(1, "\33[0;34m+# PID 51294 ThreadID: (m"..., 101) = 101
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 78) = 78
[pid  1372] write(1, "\33[0;34m     self.__bootstrap_inn"..., 41) = 41
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 84) = 84
[pid  1372] write(1, "\33[0;34m\33[0m\n", 12) = 12
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=169, u64=94484985544873}}], 32, -1) = 1
[pid  1372] read(169, "MI\0\0\0\0\0\0ut\0\0ut\0\0\0f\0\0\0\0\0\0T\215mitoge"..., 131072) = 21671
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid  1372] write(1, "\33[0;34m[mux  1371] 22:49:05.0566"..., 1024) = 1024
[pid  1372] write(1, "ter:/home/zabbixserver/zabbixaut"..., 24576) = 24576
[pid  1372] write(1, "cac450>\33[0m\n", 12) = 12
[pid  1372] write(1, "\33[0;34m+# PID 49442 ThreadID: (m"..., 101) = 101
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 78) = 78
[pid  1372] write(1, "\33[0;34m     self.__bootstrap_inn"..., 41) = 41
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 84) = 84
[pid  1372] write(1, "\33[0;34m\33[0m\n", 12) = 12
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=208, u64=94484985544912}}], 32, -1) = 1
[pid  1372] read(208, "MI\0\0\0\0\0\0u~\0\0u~\0\0\0f\0\0\0\0\0\0T\215mitoge"..., 131072) = 21671
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid  1372] write(1, "\33[0;34m[mux  1371] 22:49:05.0706"..., 1024) = 1024
[pid  1372] write(1, "ter:/home/zabbixserver/zabbixaut"..., 24576) = 24576
[pid  1372] write(1, "cf5450>\33[0m\n", 12) = 12
[pid  1372] write(1, "\33[0;34m+# PID 62442 ThreadID: (m"..., 101) = 101
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 78) = 78
[pid  1372] write(1, "\33[0;34m     self.__bootstrap_inn"..., 41) = 41
[pid  1372] write(1, "\33[0;34m File: \"/usr/lib/python2."..., 84) = 84
[pid  1372] write(1, "\33[0;34m\33[0m\n", 12) = 12
[pid  1372] epoll_wait(31, [{EPOLLIN, {u32=231, u64=94484985544935}}], 32, -1) = 1
[pid  1372] read(231, "MI\0\0\0\0\0\0u\200\0\0u\200\0\0\0f\0\0\0\0\0\0T\215mitoge"..., 131072) = 21671
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
[pid  1372] getpid()                    = 1371
strace -ff -p 1814
strace: Process 1814 attached with 3 threads
[pid  1822] epoll_wait(43,  <unfinished ...>
[pid  1821] futex(0x55ef7717ecc0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid  1814] restart_syscall(<... resuming interrupted poll ...>

@dw
Copy link
Member

dw commented Aug 3, 2019

Sorry for the late acknowledgement, I've been busy elsewhere. :) Thanks for an amazing bug report! It sounds like you might be hitting a deadlock early during startup.

I will aim to set up a reproduction 'real soon'. The current master branch needs a soak test before release, will try running it against 100 nodes to see if we can flush this one out.

There are some forking-related deadlocks on current master that might explain this. I have not changed in the recent releases relating to forking, and the last soak I did was fine.

How is the quality of networking? I don't think it is a network issue, but worth asking just in case

@dw
Copy link
Member

dw commented Aug 11, 2019

I reproduced your issue using 96 Google Cloud nodes ( https://github.com/dw/mitogen/blob/master/tests/ansible/gcloud/mitogen-load-testing.tf ), 0.2.7 fails very quickly.

I have found and fixed 2 deadlocks, one during startup in the target that looks like it could have impacted a lot of people ( 769a8b2 ), and one in the master ( f78a5f0 ). After 120 runs (11,520 connections) I can no longer reproduce your issue

Thanks for reporting this, and apologies for the delay in responding.

This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/

@dw dw closed this as completed Aug 11, 2019
dw added a commit that referenced this issue Aug 11, 2019
dw added a commit that referenced this issue Aug 12, 2019
dw added a commit that referenced this issue Aug 12, 2019
* origin/dmw:
  issue #613: must await 'exit' and 'disconnect' in wait=False test
  Import LGTM config to disable some stuff
  Fix up another handful of LGTM errors.
  tests: work around AnsibleModule.run_command() race.
  docs: mention another __main__ safeguard
  docs: tweaks
  formatting error
  docs: make Sphinx install soft fail on Python 2.
  issue #598: allow disabling preempt in terraform
  issue #598: update Changelog.
dw added a commit that referenced this issue Aug 18, 2019
* origin/v028: (383 commits)
  Bump version for release.
  docs: update Changelog for 0.2.8.
  issue #627: add test and tweak Reaper behaviour.
  docs: lots more changelog concision
  docs: changelog concision
  docs: more changelog tweaks
  docs: reorder chapters
  docs: versionless <title>
  docs: update supported Ansible version, mention unsupported features
  docs: changelog fixes/tweaks
  issue #590: update Changelog.
  issue #621: send ADD_ROUTE earlier and add test for early logging.
  issue #590: whoops, import missing test modules
  issue #590: rework ParentEnumerationMethod to recursively handle bad modules
  issue #627: reduce the default pool size in a child to 2.
  tests: add a few extra service tests.
  docs: some more hyperlink joy
  docs: more hyperlinks
  docs: add domainrefs plugin to make link aliases everywhere \o/
  docs: link IS_DEAD in changelog
  docs: tweaks to better explain changelog race
  issue #533: update routing to account for DEL_ROUTE propagation race
  tests: use defer_sync() Rather than defer() + ancient sync_with_broker()
  tests: one case from doas_test was invoking su
  tests: hide memory-mapped files from lsof output
  issue #615: remove meaningless test
  issue #625: ignore SIGINT within MuxProcess
  issue #625: use exec() instead of subprocess in mitogen_ansible_playbook
  issue #615: regression test
  issue #615: update Changelog.
  issue #615: ensure 4GB max_message_size is configured for task workers.
  issue #615: update Changelog.
  issue #615: route a dead message to recipients when no reply is expected
  issue #615: fetch_file() might be called with AnsibleUnicode.
  issue #615: redirect 'fetch' action to 'mitogen_fetch'.
  issue #615: extricate slurp brainwrong from mitogen_fetch
  issue #615: ansible: import Ansible fetch.py action plug-in
  issue #533: include object identity of Stream in repr()
  docs: lots more changelog
  issue #595: add buildah to docs and changelog.
  docs: a few more internals.rst additions
  ci: update to Ansible 2.8.3
  tests: another random string changed in 2.8.3
  tests: fix sudo_flags_failure for Ansible 2.8.3
  ci: fix procps command line format warning
  Whoops, merge together lgtm.yml and .lgtm.yml
  issue #440: log Python version during bootstrap.
  docs: update changelog
  issue #558: disable test on OSX to cope with boundless mediocrity
  issue #558, #582: preserve remote tmpdir if caller did not supply one
  issue #613: must await 'exit' and 'disconnect' in wait=False test
  Import LGTM config to disable some stuff
  Fix up another handful of LGTM errors.
  tests: work around AnsibleModule.run_command() race.
  docs: mention another __main__ safeguard
  docs: tweaks
  formatting error
  docs: make Sphinx install soft fail on Python 2.
  issue #598: allow disabling preempt in terraform
  issue #598: update Changelog.
  issue #605: update Changelog.
  issue #605: ansible: share a sem_t instead of a pthread_mutex_t
  issue #613: add tests for all the weird shutdown methods
  Add mitogen.core.now() and use it everywhere; closes #614.
  docs: move decorator docs into core.py and use autodecorator
  preamble_size: make it work on Python 3.
  docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs.
  docs: fix Sphinx warnings, add LogHandler, more docstrings
  docs: tidy up some Changelog text
  issue #615: fix up FileService tests for new logic
  issue #615: another Py3x fix.
  issue #615: Py3x fix.
  issue #615: update Changelog.
  issue #615: use FileService for target->controll file transfers
  issue #482: another Py3 fix
  ci: try removing exclude: to make Azure jobs work again
  compat: fix Py2.4 SyntaxError
  issue #482: remove 'ssh' from checked processes
  ci: Py3 fix
  issue #279: add one more test for max_message_size
  issue #482: ci: add stray process checks to all jobs
  tests: fix format string error
  core: MitogenProtocol.is_privileged was not set in children
  issue #482: tests: fail DockerMixin tests if stray processes exist
  docs: update Changelog.
  issue #586: update Changelog.
  docs: update Changelog.
  [security] core: undirectional routing wasn't respected in some cases
  docs: tidy up Select.all()
  issue #612: update Changelog.
  master: fix TypeError
  pkgutil: fix Python3 compatibility
  parent: use protocol for getting remote_id
  docs: merge signals.rst into internals.rst
  os_fork: do not attempt to cork the active thread.
  parent: fix get_log_level() for split out loggers.
  issue #547: fix service_test failures.
  issue #547: update Changelog.
  issue #547: core/service: race/deadlock-free service pool init
  docs: update Changelog.
  ...
@backbord
Copy link

backbord commented Oct 5, 2019

Hi,

first and foremost: thanks for the great work!

Perhaps this might be of interest to you:
In my setup with strategy mitogen_linear and forks=64, pipelining=True for roughly 125 hosts, I still see the problem described in the bug report: Connection timed out.

This is a very minimal playbook which shows the problem for roughly 5 or more random hosts.

- hosts: all
  strategy: mitogen_linear
  gather_facts: no
  tasks:
    - setup:

If I reset pipelining or forks or both to the default settings, the problem still shows up.
If I set forks to a value greater than the number of hosts, the problem still occurs.

Any advice would be appreciated.

Thanks,
Tim

EDIT: From time to time this issue shows up even when I limit to a single machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants