-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.2.4] Ansible with sudo results in channel disconnect #481
Comments
This is some fabulous debugging on your part, thanks so much! It definitely looks like some kind of early startup crash, because there are no messages after ec0_received suggesting e.g. the child started importing modules. The 'DiagLogStream.on_disconnect()' means the bootstrap had completed entirely, and only the main process remained on the remote - there is no good debug log meaning "we registered the new stream with the io loop", instead you can only see disconnect messages generated by the io loop Do you have strace available on one of those machines? Is there any possibility you could try out this trick? Otherwise I will start combing the work done in November that you have isolated. Currently working on some other stuff, but will get to this ASAP. Thanks again |
Ah, doh, forget that. The strace won't work because strace can't trace a setuid app like sudo. You'll just get a completely different failure instead. If you're feeling brave, you could use the same 'strace -ff' attached to the top-level SSHd on the box. When it forks to accept the Ansible client, the tracer will eventually be installed on every process including the Python interpreter that is dying. This will produce a ton of files.. easiest way to find a relevant one is 'grep -E execve.*python /tmp/....' |
The commit you flagged was painful, if I remember it broke a few things. Will find time to look at all the setup code it touches, its very possibly the culprit. Note that ticket was also about cleaning up many FD leaks during connect. One case in particular that I definitely recall was having my process killed by the kernel because the controlling side of the child's PTY ended up owned by the parent. This code touched the order in which FDs got opened and closed, and it could be somehow related to the parent's interaction with the master side of the PTY.
Hangup! That is very TTY-related. Okay, working theory, just need to find time to look into this. Again thanks a ton for the bisecting, it helps so much |
I'm starting to recall a variant of this very issue, because I had the sudoers manpage open scanning for FD-related options very recently. Do you have any strangeish 'Defaults' lines in your sudoers, or maybe an SELinux policy loaded? It would be nice to know why I'm not seeing this here |
The only reason sudo could be getting SIGHUP from the kernel is if the parent process closed the master side of the TTY, so I don't think this can be explained by any sudoers options.. it doesn't even receive a copy of the relevant FD |
No selinux. The only potentially relevant sudo option -- I think -- is not-requiretty It's an old machine. LTS, but old, so it could be a kernel thing. It's also Python 2.6, so I'm trying to see if I can get a different version side-by-side. I want a later Python (3.x) for other reasons anyway; but it's going to take a while. btw, sudo will quit if run under strace. I did try sudo debug logging, but I got minimal information, so I'd have to compile a debug version. |
Its fine, don't worry about strace, SIGHUP really is the smoking gun. The only place that gets sent from in this scenario is by the kernel, and only when the Python bits in the parent close the master FD while any child still has the slave end as its controlling TTY. Another possibility is that the master FD is being overwritten. There are existing tickets open about dup2() safety in the code, and it's bitten a few times before. Those bugs tend to reveal themselves at strange times, and always manifest by an unrelated component failing |
FWIW when I'm debugging this stuff, I find it useful to litter the parent.py connect functions with "os.system('ls- l /proc/%d/fd' % os.getpid())" :) total hack, but the only way to get useful diagnostic info out a the exact moment it's meaningful |
Hi Berend :) Just a few questions to narrow this a bit:
|
Another idea is that for whatever reason on your config, the sudo invocation is closing the last FD connected to the slave PTY. That causes DiagLogStream.on_disconnect() to fire, which in turn causes the master to close the master side. That would trigger the SIGHUP. If you're feeling daring, comment out these two lines in parent.py (line ~1330):
That would prevent it from trying to read from the master TTY, and thus never noticing slave disconnect. The master FD would still be cleaned up as normal during the data FD's on_disconnect(). |
Yes, I checked forward/backwards twice. It happens reliably on older machines, but not newer machines, so it's possibly kernel/python/sudo version. I'm collecting more data on where it's a problem, ie. bisect OS. Broken: python 2.6.8 / old kernel / sudo 1.6.8 (RHEL patched) I think I found my solution. I shouldn't be running sudo 1.6.8 anywhere anyway. 1.7 still received critical fixes. |
Hi Berend, I'd like to reproduce this before release, could you tell me which precise version of RHEL/CentOS the sudo comes from? Otherwise I can try with CentOS 5.11 |
I have reproduced it on 5.0 Log
|
I think I have an idea of what's going on.. this bizarre old version of sudo doesn't hang around after it starts the child process. In fact, it overwrites itself with Python. Python then forks into a 'first stage', which keeps the TTY FD around, while the parent half of the fork re-execs Python to clean the inherited argv. Parent half setup completes, including by overwriting the inherited TTY FD with an IoLogger to capture spurious output generated by subprocesses, and the first stage exits -- closing the last remaining TTY FD that is open. |
Since 802de6a, sudo on CentOS 5 had begun failing due to a TTY FD leak in the parent process being fixed. The old versions of sudo doesn't hang around after starting a child -- they exec the privilege-escalated child process on top of themselves, meaning no spare copy of the TTY FD is kept alive by sudo. When the child starts up, it replaces stdio with IoLoggers, including the inherited stderr FD connected to DiagLogStream/the slave PTY. When the last process closes a slave PTY, the kernel sends SIGHUP to any processes still having it as the controlling TTY. Therefore we must either ignore SIGHUP until the first stage has been waited on (since the first stage also preserve the FD), or dup the inherited TTY FD and keep it around forever. Wasting one FD seems less annoying than modifying process signals for all potential library users, so that is the approach taken here.
* origin/dmw: docs: update Changelog; closes #481 issue #481: core: preserve stderr TTY FD if one is present. issue #481: avoid crash if disconnect occurs during forward_modules() Add a few more important modules to preamble_size.py. .ci: add verbiage for run_batches() too. .ci: add README.md. docs: update thanks docs: lose "approaching stability" language, we're pretty good now docs: fix changelog syntax/order/"20KB"
thanks a ton for persisting with this! This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/ Thanks for reporting this! |
* origin/master: (661 commits) Bump version for release. docs: update Changelog; closes #481 issue #481: core: preserve stderr TTY FD if one is present. issue #481: avoid crash if disconnect occurs during forward_modules() Add a few more important modules to preamble_size.py. .ci: add verbiage for run_batches() too. .ci: add README.md. docs: update thanks docs: lose "approaching stability" language, we're pretty good now docs: fix changelog syntax/order/"20KB" tests: add new compression parameter to mitogen_get_stack results tests: disable affinity_test on Travis :/ issue #508: fix responder stats test due to new smaller parent.py. issue #508: tests: skip minify_test Py2.4/2.5 for profiler.py. tests: fix fallout from 36fb318 issue #520: add AIX auth failure string to su. tests: move affinity_test to Ansible tests. core: cProfile is not available in 2.4. issue #505: docs: add new detail graph for one scenario. docs: update and re-record profile graphs in docs; closes #505 service: fix PushFileService exception tests: pad out localhost-* service: start pool shutdown on broker shutdown. master: .encode() needed for Py3. ansible: stash PID files in CWD if requested for debugging. issue #508: master: minify_safe_re must be bytes for Py3. bench: tidy up and cpu-pin some more files. tests: add localhost-x100 ansible: double the default pool size. ansible: raise error with correct exception type. issue #508: master: minify all Mitogen/ansible_mitogen sources. parent: PartialZlib docstrings. ansible: hacky parser to alow bools to be specified on command line parent: pre-cache bootstrap if possible. docs: update Changelog. ansible: add mitogen_ssh_compression variable. service: PushFileService never recorded a file as sent. parent: synchronize get_core_source() service: use correct profile aggregation name. SyntaxError. ansible: don't pin controller if <4 cores. tests: make soak testing work reliably on vanilla. docs: changelog tidyups. ansible: document and make affinity stuff portable to non-Linux ansible: fix affinity.py test failure on 2 cores. ansible: preheat PluginLoader caches before fork. tests: make mitogen_shutdown_all be run_once by default. docs: update Changelog. ansible: use Poller for WorkerProcess; closes #491. ansible: new multiplexer/workers configuration docs: update Changelog. docs: update Changelog. ansible: pin connection multiplexer to a single core utils: pad out reset_affinity() and integrate with detach_popen() utils: import reset_affinity() function. master: set Router.profiling if MITOGEN_PROFILING variable present. parent: don't kill children when profiling is active. ansible: hook strategy and worker processes into profiler profiler: import from linear2 branch core: tidy up existing profiling code and support MITOGEN_PROFILE_FMT issue #260: redundant if statement. ansible: ensure MuxProcess MITOGEN_PROFILING results reach disk. ansible/bench: make end= configurable. master: cache sent/forwarded module names Aggregate code coverage data across tox all runs Allow independant control of coverage erase and reporting Fix incorrect attempt to use coverage docs: update Changelog; closes #527. issue #527: catch new-style module tracebacks like vanilla. Fix DeprecationWarning in mitogen.utils.run_with_router() Generate coverage report even if some tests fail ci: fix incorrect partition/rpartition from 8a4caea issue #260: hide force-disconnect messages. issue #498: fix shutdown crash issue #260: avoid start_transmit()/on_transmit()/stop_transmit() core: ensure broker profiling output reaches disk master: keep is_stdlib_path() result as negative cache entry ci: Allow DISTROS="debian*32" variable, and KEEP=1 Use develop mode in tox issue #429: fix sudo regression. misc: rename to scripts. tab completion!! core: Latch._wake improvements issue #498: prevent crash on double 'disconnect' signal. issue #413: don't double-propagate DEL_ROUTE to parent. issue #498: wrap Router dict mutations in a lock issue #429: enable en_US locale to unbreak debops test. issue #499: fix another mind-numbingly stupid vanilla inconsistency issue #497: do our best to cope with crap upstream code ssh: fix test to match updated log format. issue #429: update Changelog. issue #429: update Changelog. issue #429: teach sudo about every know i18n password string. issue #429: install i18n-related bits in test images. ssh: tidy up logs and stream names. tests: ensure file is closed in connection_test. gcloud: small updates tests: give ansible/gcloud/ its own requirements file. issue #499: another totally moronic implementation difference issue #499: disable new test on vanilla. docs: update Changelog; closes #499. ...
* origin/v024: (662 commits) docs: update Changelog release date. Bump version for release. docs: update Changelog; closes #481 issue #481: core: preserve stderr TTY FD if one is present. issue #481: avoid crash if disconnect occurs during forward_modules() Add a few more important modules to preamble_size.py. .ci: add verbiage for run_batches() too. .ci: add README.md. docs: update thanks docs: lose "approaching stability" language, we're pretty good now docs: fix changelog syntax/order/"20KB" tests: add new compression parameter to mitogen_get_stack results tests: disable affinity_test on Travis :/ issue #508: fix responder stats test due to new smaller parent.py. issue #508: tests: skip minify_test Py2.4/2.5 for profiler.py. tests: fix fallout from 36fb318 issue #520: add AIX auth failure string to su. tests: move affinity_test to Ansible tests. core: cProfile is not available in 2.4. issue #505: docs: add new detail graph for one scenario. docs: update and re-record profile graphs in docs; closes #505 service: fix PushFileService exception tests: pad out localhost-* service: start pool shutdown on broker shutdown. master: .encode() needed for Py3. ansible: stash PID files in CWD if requested for debugging. issue #508: master: minify_safe_re must be bytes for Py3. bench: tidy up and cpu-pin some more files. tests: add localhost-x100 ansible: double the default pool size. ansible: raise error with correct exception type. issue #508: master: minify all Mitogen/ansible_mitogen sources. parent: PartialZlib docstrings. ansible: hacky parser to alow bools to be specified on command line parent: pre-cache bootstrap if possible. docs: update Changelog. ansible: add mitogen_ssh_compression variable. service: PushFileService never recorded a file as sent. parent: synchronize get_core_source() service: use correct profile aggregation name. SyntaxError. ansible: don't pin controller if <4 cores. tests: make soak testing work reliably on vanilla. docs: changelog tidyups. ansible: document and make affinity stuff portable to non-Linux ansible: fix affinity.py test failure on 2 cores. ansible: preheat PluginLoader caches before fork. tests: make mitogen_shutdown_all be run_once by default. docs: update Changelog. ansible: use Poller for WorkerProcess; closes #491. ansible: new multiplexer/workers configuration docs: update Changelog. docs: update Changelog. ansible: pin connection multiplexer to a single core utils: pad out reset_affinity() and integrate with detach_popen() utils: import reset_affinity() function. master: set Router.profiling if MITOGEN_PROFILING variable present. parent: don't kill children when profiling is active. ansible: hook strategy and worker processes into profiler profiler: import from linear2 branch core: tidy up existing profiling code and support MITOGEN_PROFILE_FMT issue #260: redundant if statement. ansible: ensure MuxProcess MITOGEN_PROFILING results reach disk. ansible/bench: make end= configurable. master: cache sent/forwarded module names Aggregate code coverage data across tox all runs Allow independant control of coverage erase and reporting Fix incorrect attempt to use coverage docs: update Changelog; closes #527. issue #527: catch new-style module tracebacks like vanilla. Fix DeprecationWarning in mitogen.utils.run_with_router() Generate coverage report even if some tests fail ci: fix incorrect partition/rpartition from 8a4caea issue #260: hide force-disconnect messages. issue #498: fix shutdown crash issue #260: avoid start_transmit()/on_transmit()/stop_transmit() core: ensure broker profiling output reaches disk master: keep is_stdlib_path() result as negative cache entry ci: Allow DISTROS="debian*32" variable, and KEEP=1 Use develop mode in tox issue #429: fix sudo regression. misc: rename to scripts. tab completion!! core: Latch._wake improvements issue #498: prevent crash on double 'disconnect' signal. issue #413: don't double-propagate DEL_ROUTE to parent. issue #498: wrap Router dict mutations in a lock issue #429: enable en_US locale to unbreak debops test. issue #499: fix another mind-numbingly stupid vanilla inconsistency issue #497: do our best to cope with crap upstream code ssh: fix test to match updated log format. issue #429: update Changelog. issue #429: update Changelog. issue #429: teach sudo about every know i18n password string. issue #429: install i18n-related bits in test images. ssh: tidy up logs and stream names. tests: ensure file is closed in connection_test. gcloud: small updates tests: give ansible/gcloud/ its own requirements file. issue #499: another totally moronic implementation difference issue #499: disable new test on vanilla. ...
* origin/dmw: docs: update Changelog; closes #511, closes #536. docs: update Changelog release date. issue #536: disable transport_config tests on vanilla issue #536: restore correct Python interpreter selection behaviour. issue #536: connection_delegation/ tests were erroneously broken tests: define MITOGEN_INVENTORY_FILE even if -i unspecified. issue #536: add tests for each ansible_python_interpreter case. issue #536: stop defining explicit localhost in inventory. tests: allow running Ansible tests locally without -udmw again. stable: fix preamble_size on stable docs. issue #481: add test.
* origin/dmw: issue #537: disable just the trivial LinuxPolicyTest on Travis. docs: update Changelog; closes #537. ansible: refactor affinity class and add abstract tests. Bump version for release. docs: update Changelog. core: serialize calls to _service_stub_main(). docs: update Changelog; closes #532. issue #532: PushFileService race. docs: more concise Changelog. issue #541: changelog typos. ansible: quiesce boto logger; closes #541. docs: update Changelog. tests/ansible: Spec.port() test & mitogen_via= fix. Update copyright year everywhere. tests/ansible: Spec.become_pass() test. docs: remove top "Table of Contents" link docs: remove a little more top margin wastage tests/ansible: Spec.become_user() test. docs: update Changelog; closes #539. issue #539: disable logger propagation. ansible: capture stderr stream of async tasks. Closes #540. docs: update Changelog. issue #536: rework how 2.3-compatible simplejson is served .github: add some more questions to issue template docs: duplicate word docs: update Changelog. tests/ansible: Spec.become_method() test & mitogen_via= fix. setup.py: include LICENSE; closes #538. tests/ansible: Spec.become() test tests/ansible: Spec.password() test, document interactive pw limitation. tests/ansible: Spec.remote_user() test & mitogen_via= fix. tests/ansible: Spec.remote_addr() test & mitogen_via= fix. tests/ansible: Spec.transport() test. docs: lighter pink docs: add 'Fixes' heading docs: more margin tweaks for changelog docs: tighter <p> margins, even less shouting, red headings docs: tidy up footer and GitHub link docs: enable fixed_sidebar docs: sans-serif fonts, reduce shouty headings issue #536: add mitogen_via= tests too. ansible: fix a crash on 2.3 when mitogen_via= host is missing. tests: for 2.3 compatibility, disable gcloud.py for now docs: update Changelog; closes #511, closes #536. docs: update Changelog release date. issue #536: disable transport_config tests on vanilla issue #536: restore correct Python interpreter selection behaviour. issue #536: connection_delegation/ tests were erroneously broken tests: define MITOGEN_INVENTORY_FILE even if -i unspecified. issue #536: add tests for each ansible_python_interpreter case. issue #536: stop defining explicit localhost in inventory. tests: allow running Ansible tests locally without -udmw again. stable: fix preamble_size on stable docs. issue #481: add test.
…-link 15096-austria-sip-link
Ansible using mitogen_linear results in a cryptic ansible error
"Channel was disconnected while connection attempt was in progress; this may be caused by an abnormal Ansible exit, or due to an unreliable target."
This only happens with --become. This happens after MITO000 is received. So login and initial Python run happens fine.
This started with commit 802de6a (using git bisect) for issue #406.
Log
The text was updated successfully, but these errors were encountered: