Clear Ansible SSH control path on reboot #4366

rmol · 2019-04-24T00:55:55Z

Status

Ready for review

Description of Changes

When rebooting the SecureDrop servers in securedrop-admin install, remove the local Ansible SSH control path directory, ensuring stale connections won't break later steps.

Fixes #4364.

Testing

I was able to reliably induce a failure doing a clean installation of the release/0.12.2 branch. The installation should get an error like Timeout (62s) waiting for privilege escalating prompt at Set sysctl flags for grsecurity.

With this change in place, the installation should make it past that step.

Deployment

This shouldn't affect the installed product; it only manipulates the filesystem on the admin workstation, reusing logic from restart-tor-carefully.yml.

Checklist

If you made changes to `securedrop-admin`:

Linting and tests (make -C admin test) pass in the admin development container

If you made non-trivial code changes:

I have written a test plan and validated it for this PR

emkll · 2019-04-24T15:53:46Z

Tested as follows:

Provisioned prod vms on this branch with SecureDrop 0.12.2~rc1 (apt-test.freedom.press)
Used ssh over tor
ran ./securedrop-admin install several times

On the second run, I get the following error at the very end of the run:

PLAY [Lock down firewall configuration for Application and Monitor Servers.] ***********************************

TASK [Gathering Facts] *****************************************************************************************
ok: [app]
fatal: [mon]: FAILED! => {"msg": "Timeout (62s) waiting for privilege escalation prompt: "}

NO MORE HOSTS LEFT *********************************************************************************************

NO MORE HOSTS LEFT *********************************************************************************************
	to retry, use: --limit @/home/amnesia/Persistent/securedrop/install_files/ansible-base/securedrop-prod.retry

PLAY RECAP *****************************************************************************************************
app                        : ok=106  changed=1    unreachable=0    failed=0   
localhost                  : ok=4    changed=0    unreachable=0    failed=0   
mon                        : ok=97   changed=1    unreachable=0    failed=1   

ERROR (run with -v for more): Command '['/home/amnesia/Persistent/securedrop/install_files/ansible-base/securedrop-prod.yml', '--ask-become-pass']' returned non-zero exit status 2

This should resolve for first-time installs, but since the reboot task is not invoked on subsequent installs, it might not completely resolve the issue.
Manually removing the control persist file manually resolves and the install is successful on the subsequent run. Perhaps accompanying this with docs update might be appropriate given the short turnaround for 0.12.2.

rmol · 2019-04-24T18:38:21Z

Maddening. I tested on hardware, three runs of direct SSH and two with SSH over Tor, and couldn't get the error to happen at firewall configuration.

rmol · 2019-12-05T18:51:46Z

Now that we're using Ansible 2.7, we may want to look into the reboot module for handling reboots more gracefully.

We could also reduce ControlPersist=600s in ansible.cfg to something like ControlPersist=30s. If I understand ControlPersist correctly, it's the time the connection will be held open after the last client connection is closed. It doesn't need to cover any long-running Ansible operation, as the connection would be open during those. I don't know why we'd need it to hang around for ten minutes after the last connection to the servers was closed.

zenmonkeykstop · 2019-12-17T23:47:44Z

Looking at reboot module - works well in most cases but not for the reboot after a switch from direct SSH to SSH-over-Tor, so probably need the old method to hang around for that.

zenmonkeykstop · 2019-12-19T18:45:05Z

reboot module works fine. Reducing ControlPersist actually results in more connection errors for me, I'd recommend bumping it up to 1200sec rather than the opposite.

Two other points:

I tried increasing ConnectTimeout to 120s, but looking at verbose Ansible output, it's being set twice in ssh commands, and the second time at 60s overrides anything set in ansible.cfg - still looking into that.
One option I looked into was to to always purge ControlPath files when the playbook errors out. Doing this within Ansible is a PITA and I don't have a reliable method that doesn't involve a refactor of all roles. But doing it within the securedrop-admin script would probably work, gonna give that a try instead.

zenmonkeykstop · 2019-12-19T19:05:14Z

More timeout fun! the SSH plugin always sets the ConnectTimeout option after setting args defined in the [ssh connection] section, so if you wanna change it you need to modify the timeout setting in [defaults] instead (or set the ANSIBLE_TIMEOUT env var).

install_files/ansible-base/tasks/reboot.yml

eloquence · 2020-05-28T20:54:04Z

@rmol Could you clarify the state of this PR, given recent Ansible changes? Should we still aim to pick it up again in a future sprint?

rmol · 2020-05-28T21:14:52Z

@eloquence I think it's worth putting on a sprint, as there are still reports of timeouts at reboots. I'd like to incorporate the reboot module changes @zenmonkeykstop tried, and to see if we can do better than removing the control sockets.

eloquence · 2020-06-24T18:24:34Z

@rmol reports he's not seen the "waiting for privilege escalating prompt" recently. Let's monitor playbook runs doing QA, and prioritize accordingly. Bumping off sprint but keeping backlog for visibility.

eloquence · 2020-06-25T15:50:35Z

Sadly @rmol confirms that the issue is still present, so we should still revisit this soon, in the next sprint at the latest.

eloquence · 2020-07-08T22:33:15Z

Per discussion at sprint planning, let's try to land a minimal version of this that may mitigate, e.g., by clearing the control path after any playbook failure. Not a must-have for 1.5.0 so not adding to milestone.

eloquence · 2020-07-23T20:24:53Z

Deferring further investigation again, though there may be a relation with the timeouts we're seeing in #5401, which we'll poke at during the 7/23-8/5 sprint.

eloquence · 2020-12-16T01:52:27Z

Given that we'll be poking admins to finish up the v3 migration, it seems worth investing some effort to reduce likelihood of Ansible failures; tentatively added to 1.7.0 milestone (may need to be timeboxed).

The server alive messages make detection of failed connections quicker.

rmol · 2021-01-07T00:57:28Z

TL;DR -- I finally had some luck testing this, and have updated the branch with a change I think will mitigate the problem.

I wrote a simple playbook that reboots the servers, debugs the result, and updates the apt cache. Numerous runs were completely boring. Then mon took twice as long to reboot as app (~80 seconds versus ~40). That happened a couple of times.

Then one run saw the app reboot work fine, again, but the playbook just froze on the mon reboot. The mon SSH control master process was still running 45 minutes later. The Ansible reboot ssh command (shutdown -r 0 ...) never ended. The mon server had in fact rebooted and was fine.

I tried a separate SSH command using the same mon control master, which also froze, for about the length of ControlPersist (20 minutes) until there was a mux_client_request_session: read from master failed: Broken pipe error.

Then the playbook completed successfully. The mon server reboot was reported as taking 32 seconds. 🤯

The reboot task was using the default reboot_timeout, 600 seconds. The module docs suggest that timeout is evaluated for both
verifying the reboot and running the test command. That the reboot was reported as taking 32 seconds suggests it is getting the output it needs to verify, but then hanging. 😐

I tore down the SSH control master connections, and on the next run, the reboot SSH commands for both servers were hanging. Both had in fact rebooted and were available. I waited over an hour, and the SSH processes sending the shutdown command to each server still existed.

I tried using the persistent connections and again, after the ControlPersist duration, they got the read from master failed
error, the playbook completed, and both reboots were reported as taking 32 seconds.

Going through the Ansible issues on GitHub, there are many suspects. "It's become!" "It's the reboot module!" "It's
pipelining!" The ways Ansible can hang seem endless and there's not much hope they'll ever all be
fixed. And we're using SSH proxied via nc over Tor. Maybe we should be grateful it works as well as it does.

According to that comment, the clean way to destroy the persistent connections is meta: reset_connection after reboot, but that just caused an error, with stderr containing what would seem to me to be the expected Shared connection to 10.20.2.2 closed message.

Alternatively, SSH can be told to check the connection periodically, with -o ServerAliveInterval 10 -o ServerAliveCountMax 3. If it doesn't get a response after three tries (so 30 seconds) it will drop the connection.

I added those SSH options to install_files/ansible-base/ansible.cfg, stopped the SSH connections, and reran the playbook. The SSH connections went away while the servers were rebooting. The playbook completed immediately after the actual reboots, which took 46 and 51 seconds. The SSH connections were left up, as expected given ControlPersist=1200.

A full install and subsequent reinstall both ran without incident.

So I've updated this branch to just use the ServerAlive options in ansible.cfg.

conorsch

Impressive research, @rmol! Given your explanations, this sounds like a solid improvement to the connection handling. Given that this sets approximately a 30s window for retrying connections, I'd like to confirm that it works on reboots that are much longer—for instance, the 1U servers we do QA on usually take 2-3 minutes to reboot. We don't need to block merge on that condition, as normal QA procedures for 1.7.0 should suffice.

install_files/ansible-base/ansible.cfg

codecov-io · 2021-01-07T01:25:58Z

Codecov Report

Merging #4366 (cf39ee3) into develop (8897a79) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           develop    #4366   +/-   ##
========================================
  Coverage    81.41%   81.41%           
========================================
  Files           53       53           
  Lines         3965     3965           
  Branches       496      496           
========================================
  Hits          3228     3228           
  Misses         632      632           
  Partials       105      105

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8897a79...cf39ee3. Read the comment docs.

rmol · 2021-01-07T02:54:42Z

Given that this sets approximately a 30s window for retrying connections, I'd like to confirm that it works on reboots that are much longer—for instance, the 1U servers we do QA on usually take 2-3 minutes to reboot.

In the admittedly few tests I ran, my NUCs took longer than 30s to reboot, enough that the SSH connections were terminated, and everything worked. The 600-second reboot module timeouts should be what determines the success here, so I think we should see the same result on machines that take longer to boot, but yes, I'll be interested to see how this works with those servers.

kushaldas · 2021-01-11T12:14:03Z

I can reproduce the issue with running molecule converge -s libvirt-staging-focal multiple times in development branch. When I try the same from this branch, the reboot issue is gone.

rmol requested review from conorsch, emkll, kushaldas and msheiny as code owners April 24, 2019 00:55

emkll mentioned this pull request Sep 12, 2019

Ansible failing to become root #4805

Closed

2 tasks

redshiftzero added the PR: Pending additional work label Nov 9, 2019

eloquence mentioned this pull request Dec 3, 2019

continuously having to retry 'securedrop-admin install' [ansible running out of hosts] #5042

Closed

conorsch reviewed Apr 15, 2020

View reviewed changes

install_files/ansible-base/tasks/reboot.yml Outdated Show resolved Hide resolved

conorsch mentioned this pull request Apr 17, 2020

make staging provisioning timeout at final reboot step #5193

Closed

eloquence added this to the 1.7.0 milestone Dec 16, 2020

rmol force-pushed the fix-4364 branch from c98e7f3 to dc31c3a Compare January 6, 2021 23:32

Add SSH server alive messages to Ansible connections

cf39ee3

The server alive messages make detection of failed connections quicker.

rmol force-pushed the fix-4364 branch from dc31c3a to cf39ee3 Compare January 7, 2021 00:13

conorsch approved these changes Jan 7, 2021

View reviewed changes

install_files/ansible-base/ansible.cfg Show resolved Hide resolved

rmol removed the PR: Pending additional work label Jan 7, 2021

kushaldas approved these changes Jan 11, 2021

View reviewed changes

conorsch merged commit baf0bfa into freedomofpress:develop Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear Ansible SSH control path on reboot #4366

Clear Ansible SSH control path on reboot #4366

rmol commented Apr 24, 2019

emkll commented Apr 24, 2019 •

edited

Loading

rmol commented Apr 24, 2019

rmol commented Dec 5, 2019

zenmonkeykstop commented Dec 17, 2019

zenmonkeykstop commented Dec 19, 2019

zenmonkeykstop commented Dec 19, 2019

eloquence commented May 28, 2020

rmol commented May 28, 2020

eloquence commented Jun 24, 2020

eloquence commented Jun 25, 2020

eloquence commented Jul 8, 2020 •

edited

Loading

eloquence commented Jul 23, 2020

eloquence commented Dec 16, 2020

rmol commented Jan 7, 2021

conorsch left a comment

codecov-io commented Jan 7, 2021 •

edited

Loading

rmol commented Jan 7, 2021

kushaldas commented Jan 11, 2021

Clear Ansible SSH control path on reboot #4366

Clear Ansible SSH control path on reboot #4366

Conversation

rmol commented Apr 24, 2019

Status

Description of Changes

Testing

Deployment

Checklist

If you made changes to securedrop-admin:

If you made non-trivial code changes:

emkll commented Apr 24, 2019 • edited Loading

rmol commented Apr 24, 2019

rmol commented Dec 5, 2019

zenmonkeykstop commented Dec 17, 2019

zenmonkeykstop commented Dec 19, 2019

zenmonkeykstop commented Dec 19, 2019

eloquence commented May 28, 2020

rmol commented May 28, 2020

eloquence commented Jun 24, 2020

eloquence commented Jun 25, 2020

eloquence commented Jul 8, 2020 • edited Loading

eloquence commented Jul 23, 2020

eloquence commented Dec 16, 2020

rmol commented Jan 7, 2021

conorsch left a comment

Choose a reason for hiding this comment

codecov-io commented Jan 7, 2021 • edited Loading

Codecov Report

rmol commented Jan 7, 2021

kushaldas commented Jan 11, 2021

If you made changes to `securedrop-admin`:

emkll commented Apr 24, 2019 •

edited

Loading

eloquence commented Jul 8, 2020 •

edited

Loading

codecov-io commented Jan 7, 2021 •

edited

Loading