Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try harder to attach to an existing tmux session #4231

Merged
merged 3 commits into from
Mar 14, 2019

Conversation

rmol
Copy link
Contributor

@rmol rmol commented Mar 6, 2019

Status

Ready for review

Description of Changes

If in the course of an ssh/tmux session the tmux package is upgraded and the ssh connection broken, the next ssh attempt will encounter an error because of the tmux protocol version mismatch.

The old tmux can be used to reattach by searching for the tmux process under /proc and using its reference to the old tmux executable. This change attempts to do that automatically if a first "tmux attach" attempt fails, but we can see that a previous session still exists.

Fixes #4221.

Testing

This is not deterministically reproducible, but I've had it happen on at least one of the servers during each upgrade from Trusty to Xenial, provided they are set up to use SSH over Tor. It seems more likely to occur on the app server, so try running securedrop-admin install with this branch, then execute sudo do-release-upgrade on the app server first. When you reconnect after libssl et al. are upgraded and Tor restarted, you should automatically resume the tmux session. If instead you get an error about the protocol mismatch, the fix hasn't worked.

Checklist

If you made changes to the system configuration:

If you made non-trivial code changes:

  • I have written a test plan and validated it for this PR

@rmol rmol force-pushed the fix-4221 branch 2 times, most recently from e6d1b7e to 830411c Compare March 7, 2019 01:28
@emkll
Copy link
Contributor

emkll commented Mar 7, 2019

I could reliably reproduce the tmux bug using prod VMs: setting the host to sleep or disconnecting the network would very reliably trigger the bug. I can confirm that for instances provisioned on this branch, this appears to fix the problem! 🎉

However, the fix only gets applied for new installs, or installs on which the installer was run, as the bashrc additions are added to the servers as part of the create_user task of the common role.

Our documentation currently does not instruct users to run ./securedrop-admin install before beginning the upgrade process [0]. This file could be updated on running instances via the securedrop-config package, since it's installed on both machines. Is this problem severe/frequent enough to warrant us deploying the fix in an unattended fashion? Can someone think of another way to deploy this to existing instances given our March 19 release for 0.12.1?

[0] : https://docs.securedrop.org/en/latest/upgrade/xenial_upgrade_in_place.html

@conorsch
Copy link
Contributor

conorsch commented Mar 7, 2019

Agreed with @emkll's concerns: in order for this change to be useful to Admins performing the Trusty -> Xenial upgrade, we want the helper functions in place before the upgrade is started—ideally without requesting yet another run of ./securedrop-admin install prior to the upgrade.

@rmol Take a stab at porting this new logic to the securedrop-config package (package files at install_files/securedrop-config/. Note that we already have a file for tmux settings written to /etc/; see here: https://github.com/freedomofpress/securedrop/blob/543357a5e691be62e4a19d87f46dc9e00a866e46/install_files/ansible-base/roles/common/files/bashrc_securedrop_additions

Those customizations, with the improvements you're presenting here, should be taken out of the Ansible logic and placed in the securedrop-config packaging logic. Haven't run the config tests on this branch yet, but I have a few comments there—will tack on in-line.

Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to convert solution from Ansible (which requires an install run prior to changes taking effect) to securedrop-config packaging logic (which will be applied automatically as part of the 0.12.1 release).

@rmol
Copy link
Contributor Author

rmol commented Mar 8, 2019

Moved /etc/profile.d/securedrop_additions.sh creation to the securedrop-config package. Started with Trusty, installed the updated .deb manually, and confirmed that during do-release-upgrade I could reconnect and reattach.

Also confirmed that having the existing file in /etc/profile.d doesn't cause problems when installing the updated package, either on Trusty or Xenial.

@rmol rmol force-pushed the fix-4221 branch 2 times, most recently from 15ac1df to 8502ca9 Compare March 11, 2019 18:52
@conorsch
Copy link
Contributor

conorsch commented Mar 12, 2019

Beginning local test. Given that the change is implemented via debian package modifications, we must use staging to test. Intend to use the following test plan:

  • Run rm -rf build/* to ensure no old debs are present
  • 🔶 Run make build-debs-trusty to create new debs used for testing - test failures reported
  • Set enable_ssh_over_true: true in install_files/ansible-base/group_vars/staging.yml
  • Run make staging-trusty to create Trusty VMs
  • Add app-ssh-aths info to host torrc
  • Open an interactive shell via SSH-over-Tor
  • Run sudo apt-mark showhold | xargs sudo apt-mark unhold to enable do-release-upgrade package resolution
  • Run sudo do-release-upgrade while connected over Tor
  • Confirm able to reconnect to tmux after service bounces disconnect the session

Will report back when testing is done.

@conorsch
Copy link
Contributor

Finished test plan. Can confirm I'm able to reconnect to tmux after a service bounce on the App Server. Encountered a few test failures that are quick fixes, so I'll comment in-line with some pointers.

@conorsch conorsch dismissed their stale review March 13, 2019 00:56

Requested changes have been implemented

@conorsch conorsch self-requested a review March 13, 2019 00:57
Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packaging logic refactor looks great. Requesting minor changes from @rmol:

  • a few config tests are failing—see comments in-line

Also requesting additional review, specifically via the upgrade scenario. Given the plan to have these changes land in 0.12.1 in support of Xenial upgrades, we should also make sure to run upgrade tests from Trusty 0.12.0 -> Trusty 0.12.1~rc1. As written, the packaging logic should simply clobber the target file (that was previously configured via Ansible), but we must confirm that.

(Additional context: in modern debian package logic, any file in /etc/ will automatically be marked a conffile, which would not only not clobber, but would break unattended upgrades. Due to the legacy packaging logic we're using for securedrop-config, this not relevant. In support of guarding against regressions, it's worth considering adding a packaging config test to validate that the securedrop-config file has 0 conffiles.)

@@ -4,7 +4,7 @@ Priority: optional
Maintainer: SecureDrop Team <[email protected]>
Homepage: https://securedrop.org
Package: securedrop-config
Version: 0.1.2+0.13.0~rc1
Version: 0.1.3+0.13.0~rc1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version bump here is absolutely correct. It requires matching changes in other places, though:

  • group_vars for staging (already handled)
  • config tests vars (specifically, see molecule/builder-trusty/tests/vars.yml)

The rationale here is that we want to ensure we're installing exactly a certain version, and also running the package checks on exactly that same version. By default, multiple package versions can pile up in build/. We could automatically run rm -rf build/* to avoid the pile-up, but that seems a bit heavy-handed for the dev env.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed config_version in vars.yml.


assert 'test -z "$TMUX" && (tmux attach || tmux new-session)' in f.content
assert f.contains(tmux_check)
assert host_file.sha256sum == h.hexdigest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your motivation to DRY out the config tests here is understandable, but most of the config tests are intentionally duplicative, to enable us to refactor the provision logic confidently. For example, if you modify install_files/securedrop-config/etc/profile.d/securedrop_additions.sh locally, then build packages and install them, you should reasonably expect the corresponding config tests to fail. These changes would cause them to pass.

A broader discussion is warranted here about the pattern used for hardcoding values in the config tests. For now, simply for the sake of consistency, I'd prefer to keep them. Even a multiline string asserting an exact match on host_file.contents or host_file.content_string would be sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to comparing the exact current content.

if test -z "$TMUX"
then
(tmux attach || tmux_attach_via_proc || tmux new-session)
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting additional comments in this file because the logic is cleanly presented and the corresponding commit message is top-notch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a good not-request, though. No one should have to go find my top-notch commit message when trying to understand this. 😄

Copy link
Contributor

@kushaldas kushaldas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make staging-trusty failed with the following error.

    TASK [reboot-if-first-install : Wait for server to come back.] *****************
    fatal: [mon-staging -> localhost]: FAILED! => {"changed": false, "elapsed": 301, "msg": "Timeout when waiting for search string OpenSSH in 192.168.121.57:22"}
    fatal: [app-staging -> localhost]: FAILED! => {"changed": false, "elapsed": 301, "msg": "Timeout when waiting for search string OpenSSH in 192.168.121.163:22"}

@conorsch
Copy link
Contributor

@kushaldas The wait timeout you describe should only occur if you've overridden enable_ssh_over_tor: true, as part of the modified test plan. The staging playbook isn't smart enough to read that var, which is a shortcoming, but needn't block additional testing. After the reboot, you can continue to follow the subsequent test plan steps.

If in the course of an ssh/tmux session the tmux package is upgraded
and the ssh connection broken, the next ssh attempt will encounter an
error because of the tmux protocol version mismatch.

The old tmux can be used to reattach by searching for the tmux process
under /proc and using its reference to the old tmux executable. This
change attempts to do that automatically if a first "tmux attach"
attempt fails, but we can see that a previous session still exists.
Comment the tmux_attach_via_proc function in
securedrop_additions.sh.

Fix version check in builder-trusty/tests/vars.yml.

Fix test_sudoers_config so that it requires explicit content in
securedrop_additions.sh, instead of just checking that the installed
file exactly matches whatever's in the current revision.
emkll
emkll previously approved these changes Mar 13, 2019
Copy link
Contributor

@emkll emkll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes and the rebase @rmol ! This looks good to merge from me 👍

Since I've tested and confirmed the functionality in #4231 (comment), I focused on testing the refactoring into the deb packages using the following test plan:

  • Upgrade testing (xenial) confirm that /etc/profile.d/securedrop_additions.sh has been correctly updated
  • Upgrade testing (trusty) confirm that /etc/profile.d/securedrop_additions.sh has been correctly updated
  • Ensure CI passes, specifically test_sudoers_config
  • Inspect securedrop-config package and ensure securedrop_additions.sh is marked as conffile

None of the files in /etc/ are marked as conffiles per the deb package, so should be squashed for each upgrade, though this may change once the compat level is changed.

$ dpkg-deb -I build/trusty/securedrop-config-0.1.3+0.13.0~rc1-amd64.deb 
 new debian package, version 2.0.
 size 3194 bytes: control archive=2325 bytes.
     311 bytes,    10 lines      control              
    5341 bytes,   135 lines   *  postinst             #!/bin/sh
 Source: securedrop
 Section: web
 Priority: optional
 Maintainer: SecureDrop Team <[email protected]>
 Homepage: https://securedrop.org
 Package: securedrop-config
 Version: 0.1.3+0.13.0~rc1
 Architecture: all
 Description: Establishes baseline system state for running SecureDrop.
  Configures apt repositories.

I've also appended 690f0e3 to this branch to add some build-time tests.

At build time, let's ensure:
- no conffiles are present so that files in /etc are properly squashed
- securedrop-config contains the expected files
@conorsch
Copy link
Contributor

New test report from @emkll looks great, and the additional config tests are a mighty fine addition. Running the build logic locally to confirm all tests passing, then will mark approved and merge.

@conorsch conorsch dismissed kushaldas’s stale review March 14, 2019 01:02

Sufficient review provided by other maintainers, problem noted was artifact of test plan

@conorsch conorsch merged commit d154a43 into freedomofpress:develop Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

During 14.04->16.04 upgrade, tmux is unable to re-attach
4 participants