Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear Ansible SSH control path on reboot #4366

Merged
merged 1 commit into from
Jan 11, 2021

Conversation

rmol
Copy link
Contributor

@rmol rmol commented Apr 24, 2019

Status

Ready for review

Description of Changes

When rebooting the SecureDrop servers in securedrop-admin install, remove the local Ansible SSH control path directory, ensuring stale connections won't break later steps.

Fixes #4364.

Testing

I was able to reliably induce a failure doing a clean installation of the release/0.12.2 branch. The installation should get an error like Timeout (62s) waiting for privilege escalating prompt at Set sysctl flags for grsecurity.

With this change in place, the installation should make it past that step.

Deployment

This shouldn't affect the installed product; it only manipulates the filesystem on the admin workstation, reusing logic from restart-tor-carefully.yml.

Checklist

If you made changes to securedrop-admin:

  • Linting and tests (make -C admin test) pass in the admin development container

If you made non-trivial code changes:

  • I have written a test plan and validated it for this PR

@emkll
Copy link
Contributor

emkll commented Apr 24, 2019

Tested as follows:

  • Provisioned prod vms on this branch with SecureDrop 0.12.2~rc1 (apt-test.freedom.press)
  • Used ssh over tor
  • ran ./securedrop-admin install several times

On the second run, I get the following error at the very end of the run:

PLAY [Lock down firewall configuration for Application and Monitor Servers.] ***********************************

TASK [Gathering Facts] *****************************************************************************************
ok: [app]
fatal: [mon]: FAILED! => {"msg": "Timeout (62s) waiting for privilege escalation prompt: "}

NO MORE HOSTS LEFT *********************************************************************************************

NO MORE HOSTS LEFT *********************************************************************************************
	to retry, use: --limit @/home/amnesia/Persistent/securedrop/install_files/ansible-base/securedrop-prod.retry

PLAY RECAP *****************************************************************************************************
app                        : ok=106  changed=1    unreachable=0    failed=0   
localhost                  : ok=4    changed=0    unreachable=0    failed=0   
mon                        : ok=97   changed=1    unreachable=0    failed=1   

ERROR (run with -v for more): Command '['/home/amnesia/Persistent/securedrop/install_files/ansible-base/securedrop-prod.yml', '--ask-become-pass']' returned non-zero exit status 2

This should resolve for first-time installs, but since the reboot task is not invoked on subsequent installs, it might not completely resolve the issue.
Manually removing the control persist file manually resolves and the install is successful on the subsequent run. Perhaps accompanying this with docs update might be appropriate given the short turnaround for 0.12.2.

@rmol
Copy link
Contributor Author

rmol commented Apr 24, 2019

Maddening. I tested on hardware, three runs of direct SSH and two with SSH over Tor, and couldn't get the error to happen at firewall configuration.

@rmol
Copy link
Contributor Author

rmol commented Dec 5, 2019

Now that we're using Ansible 2.7, we may want to look into the reboot module for handling reboots more gracefully.

We could also reduce ControlPersist=600s in ansible.cfg to something like ControlPersist=30s. If I understand ControlPersist correctly, it's the time the connection will be held open after the last client connection is closed. It doesn't need to cover any long-running Ansible operation, as the connection would be open during those. I don't know why we'd need it to hang around for ten minutes after the last connection to the servers was closed.

@zenmonkeykstop
Copy link
Contributor

Looking at reboot module - works well in most cases but not for the reboot after a switch from direct SSH to SSH-over-Tor, so probably need the old method to hang around for that.

@zenmonkeykstop
Copy link
Contributor

reboot module works fine. Reducing ControlPersist actually results in more connection errors for me, I'd recommend bumping it up to 1200sec rather than the opposite.

Two other points:

  • I tried increasing ConnectTimeout to 120s, but looking at verbose Ansible output, it's being set twice in ssh commands, and the second time at 60s overrides anything set in ansible.cfg - still looking into that.
  • One option I looked into was to to always purge ControlPath files when the playbook errors out. Doing this within Ansible is a PITA and I don't have a reliable method that doesn't involve a refactor of all roles. But doing it within the securedrop-admin script would probably work, gonna give that a try instead.

@zenmonkeykstop
Copy link
Contributor

More timeout fun! the SSH plugin always sets the ConnectTimeout option after setting args defined in the [ssh connection] section, so if you wanna change it you need to modify the timeout setting in [defaults] instead (or set the ANSIBLE_TIMEOUT env var).

@eloquence
Copy link
Member

@rmol Could you clarify the state of this PR, given recent Ansible changes? Should we still aim to pick it up again in a future sprint?

@rmol
Copy link
Contributor Author

rmol commented May 28, 2020

@eloquence I think it's worth putting on a sprint, as there are still reports of timeouts at reboots. I'd like to incorporate the reboot module changes @zenmonkeykstop tried, and to see if we can do better than removing the control sockets.

@eloquence
Copy link
Member

@rmol reports he's not seen the "waiting for privilege escalating prompt" recently. Let's monitor playbook runs doing QA, and prioritize accordingly. Bumping off sprint but keeping backlog for visibility.

@eloquence
Copy link
Member

Sadly @rmol confirms that the issue is still present, so we should still revisit this soon, in the next sprint at the latest.

@eloquence
Copy link
Member

eloquence commented Jul 8, 2020

Per discussion at sprint planning, let's try to land a minimal version of this that may mitigate, e.g., by clearing the control path after any playbook failure. Not a must-have for 1.5.0 so not adding to milestone.

@eloquence
Copy link
Member

Deferring further investigation again, though there may be a relation with the timeouts we're seeing in #5401, which we'll poke at during the 7/23-8/5 sprint.

@eloquence eloquence added this to the 1.7.0 milestone Dec 16, 2020
@eloquence
Copy link
Member

Given that we'll be poking admins to finish up the v3 migration, it seems worth investing some effort to reduce likelihood of Ansible failures; tentatively added to 1.7.0 milestone (may need to be timeboxed).

The server alive messages make detection of failed connections
quicker.
@rmol
Copy link
Contributor Author

rmol commented Jan 7, 2021

TL;DR -- I finally had some luck testing this, and have updated the branch with a change I think will mitigate the problem.

I wrote a simple playbook that reboots the servers, debugs the result, and updates the apt cache. Numerous runs were completely boring. Then mon took twice as long to reboot as app (~80 seconds versus ~40). That happened a couple of times.

Then one run saw the app reboot work fine, again, but the playbook just froze on the mon reboot. The mon SSH control master process was still running 45 minutes later. The Ansible reboot ssh command (shutdown -r 0 ...) never ended. The mon server had in fact rebooted and was fine.

I tried a separate SSH command using the same mon control master, which also froze, for about the length of ControlPersist (20 minutes) until there was a mux_client_request_session: read from master failed: Broken pipe error.

Then the playbook completed successfully. The mon server reboot was reported as taking 32 seconds. 🤯

The reboot task was using the default reboot_timeout, 600 seconds. The module docs suggest that timeout is evaluated for both
verifying the reboot and running the test command. That the reboot was reported as taking 32 seconds suggests it is getting the output it needs to verify, but then hanging. 😐

I tore down the SSH control master connections, and on the next run, the reboot SSH commands for both servers were hanging. Both had in fact rebooted and were available. I waited over an hour, and the SSH processes sending the shutdown command to each server still existed.

I tried using the persistent connections and again, after the ControlPersist duration, they got the read from master failed
error, the playbook completed, and both reboots were reported as taking 32 seconds.

Going through the Ansible issues on GitHub, there are many suspects. "It's become!" "It's the reboot module!" "It's
pipelining!" The ways Ansible can hang seem endless and there's not much hope they'll ever all be
fixed
. And we're using SSH proxied via nc over Tor. Maybe we should be grateful it works as well as it does.

According to that comment, the clean way to destroy the persistent connections is meta: reset_connection after reboot, but that just caused an error, with stderr containing what would seem to me to be the expected Shared connection to 10.20.2.2 closed message.

Alternatively, SSH can be told to check the connection periodically, with -o ServerAliveInterval 10 -o ServerAliveCountMax 3. If it doesn't get a response after three tries (so 30 seconds) it will drop the connection.

I added those SSH options to install_files/ansible-base/ansible.cfg, stopped the SSH connections, and reran the playbook. The SSH connections went away while the servers were rebooting. The playbook completed immediately after the actual reboots, which took 46 and 51 seconds. The SSH connections were left up, as expected given ControlPersist=1200.

A full install and subsequent reinstall both ran without incident.

So I've updated this branch to just use the ServerAlive options in ansible.cfg.

Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive research, @rmol! Given your explanations, this sounds like a solid improvement to the connection handling. Given that this sets approximately a 30s window for retrying connections, I'd like to confirm that it works on reboots that are much longer—for instance, the 1U servers we do QA on usually take 2-3 minutes to reboot. We don't need to block merge on that condition, as normal QA procedures for 1.7.0 should suffice.

install_files/ansible-base/ansible.cfg Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Jan 7, 2021

Codecov Report

Merging #4366 (cf39ee3) into develop (8897a79) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #4366   +/-   ##
========================================
  Coverage    81.41%   81.41%           
========================================
  Files           53       53           
  Lines         3965     3965           
  Branches       496      496           
========================================
  Hits          3228     3228           
  Misses         632      632           
  Partials       105      105           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8897a79...cf39ee3. Read the comment docs.

@rmol
Copy link
Contributor Author

rmol commented Jan 7, 2021

Given that this sets approximately a 30s window for retrying connections, I'd like to confirm that it works on reboots that are much longer—for instance, the 1U servers we do QA on usually take 2-3 minutes to reboot.

In the admittedly few tests I ran, my NUCs took longer than 30s to reboot, enough that the SSH connections were terminated, and everything worked. The 600-second reboot module timeouts should be what determines the success here, so I think we should see the same result on machines that take longer to boot, but yes, I'll be interested to see how this works with those servers.

@kushaldas
Copy link
Contributor

I can reproduce the issue with running molecule converge -s libvirt-staging-focal multiple times in development branch. When I try the same from this branch, the reboot issue is gone.

@conorsch conorsch merged commit baf0bfa into freedomofpress:develop Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stale Ansible SSH control master failures in securedrop-admin install
8 participants