Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 83 additions & 11 deletions launcher/sdw_updater_gui/Updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,19 @@
import logging
import os
import subprocess
import time
from datetime import datetime, timedelta
from enum import Enum

# Used for VM checks when run in Qubes dom0
try:
import qubesadmin

qubes = qubesadmin.Qubes()
except ImportError:
qubes = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log something when qubes=None, and probably test these lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:

  • handling the error, but only if we're on Qubes (platform.linux_distribution is deprecated; platform.uname seems our best bet though really only tells us that we're on a host called dom0)
  • ensuring we don't log to disk when the file is imported as part of the test suite;
  • cleanly unit testing any added code without overcomplicating things.



DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
DEFAULT_HOME = ".securedrop_launcher"
FLAG_FILE_STATUS_SD_APP = "/home/user/.securedrop_client/sdw-update-status"
Expand Down Expand Up @@ -403,8 +413,8 @@ def shutdown_and_start_vms():
on, for example.

All system AppVMs (sys-net, sys-firewall and sys-usb) need to be restarted.
We use qvm-kill for sys-firewall and sys-net, because a shutdown may fail
if they are currently in use as NetVMs by any of the user's other VMs.
We use a forced shutdown for sys-net and sys-firewall because a regular
shutdown command will return an error if they are in use as NetVMs.
"""

sdw_vms_in_order = [
Expand Down Expand Up @@ -433,17 +443,10 @@ def shutdown_and_start_vms():
sdlog.info("Safely shutting down system VM: {}".format(vm))
_safely_shutdown_vm(vm)

# TODO: Use of qvm-kill should be considered unsafe and may have unexpected
# side effects. We should aim for a more graceful shutdown strategy.
unsafe_sys_vms_in_order = ["sys-firewall", "sys-net"]
for vm in unsafe_sys_vms_in_order:
sdlog.info("Killing system VM: {}".format(vm))
try:
subprocess.check_output(["qvm-kill", vm], stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
sdlog.error("Error while killing system VM: {}".format(vm))
sdlog.error(str(e))
sdlog.error(str(e.stderr))
sdlog.info("Forcing shutdown of system VM: {}".format(vm))
_force_shutdown_vm(vm)

all_sys_vms_in_order = safe_sys_vms_in_order + unsafe_sys_vms_in_order
sdlog.info("Starting fedora-based system VMs after updates")
Expand All @@ -465,6 +468,75 @@ def _safely_shutdown_vm(vm):
return UpdateStatus.UPDATES_FAILED


def _force_shutdown_vm(vm):
"""
Qubes does not yet provide an option to force shutdown for VMs that act as
NetVMs. Issuing a poweroff command and polling for the VM to stop running
is the recommended workaround until then.

Return value:
- True - if VM was successfully shut down or is not running
- False - if there was a problem shutting down the VM (logged as error)
"""
if vm not in qubes.domains:
sdlog.error(
"Error shutting down VM '{}'. No VM with this name found.".format(vm)
)
return False

if qubes.domains[vm].is_running() is False:
sdlog.info("VM '{}' is not running. No shutdown necessary.".format(vm))
return True

if qubes.domains[vm].is_paused():
sdlog.info("VM '{}' is paused. Unpausing before shutdown.".format(vm))
qubes.domains[vm].unpause()

try:
qubes.domains[vm].run("poweroff", user="root")
except subprocess.CalledProcessError as e:
# Exit codes 1 and 143 may occur with successful shutdown; log others
if e.returncode != 1 and e.returncode != 143:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing qvm-run -u root sys-net poweroff a few times, I mostly get exit code 143, and got exit code 1 a few times, without any clear conditional differences; nevertheless the command completed in all cases. poweroff itself triggers SIGTERM, which is what exit code 143 signifies; I'm guessing that its parent shell process may sometimes prematurely terminate. I do not know why the command sometimes returns exit code 1, but I never get exit code 0.

CC'ing @marmarek in case he can shed light on the exit code behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation makes sense to me, thanks for jumping in. :)

sdlog.error(
"Error shutting down VM '{}'. "
"poweroff command returned unexpected exit code {}.".format(
vm, e.returncode
)
)
return False

return _wait_for_is_running(vm, False)


def _wait_for_is_running(vm, expected, timeout=60, interval=0.2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect to reuse this polling method with other VM states? If so, we should consider using get_power_state() function of a domain in the qubes API, which would give us more standard Qubes terminology for power state.

qubes.domains["sys-net"].get_power_state()
Running

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try making the _wait_for function generic, with this signature:

_wait_for(vm, condition, timeout=60, interval=0.2)

That way we can pass in any lambda we want. I personally prefer using a simple Boolean check for is_running but this gives us the flexibility to use different checks for different purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in 111ccb0, haven't tested in Qubes yet. 111ccb0#diff-ad9e526e0e4fea47e3196b3e293a4d50R508 shows an example invocation.

"""
Poll for a VM to enter the given is_running state, and give up after a
timeout is reached.

Return value:
- True if the VM reached the expected state
- False if it did not
"""
start_time = time.time()
stop_time = start_time + timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested several values for interval here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait.

while time.time() < stop_time:
state = qubes.domains[vm].is_running()
elapsed = time.time() - start_time
if state == expected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like using the get_power_state state as describe above might make more sense here, though functionally equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See above)

sdlog.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 5 runs, I get the following measures:

  • sys-firewall: [17.94, 27.42, 17.94, 27.45, 17.50]
  • sys-net: [3.38, 3.25, 3.22, 0.00, 0.00]

Do you have any insight into the variance for sys-firewall? Have you need a value of 0.00 for sys-net before?

Copy link
Member Author

@eloquence eloquence Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the API run command completes after the is_running state has already reached False, no polling will be required, and the elapsed time will be 0.00 (which is the best case). I've seen fairly inconsistent behavior when using qvm-run; sometimes it exits immediately, sometimes it aligns nicely with the shutdown sequence. I think we're at the mercy of the scheduler here. Fortunately, all poweroff invocations I've tried have been successful.

"VM '{}' entered expected state (is_running() is {}) "
"after {:.2f} seconds of polling.".format(vm, expected, elapsed)
)
return True
time.sleep(interval)

sdlog.error(
"VM '{}' did not enter expected state (is_running() is {}) "
"in the provided timeout of {} seconds.".format(vm, expected, timeout)
)
return False


def _safely_start_vm(vm):
try:
running_vms = subprocess.check_output(
Expand Down
88 changes: 28 additions & 60 deletions launcher/tests/test_updater.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import json
import os
import pytest
import time
import subprocess
from importlib.machinery import SourceFileLoader
from datetime import datetime, timedelta
Expand Down Expand Up @@ -504,6 +505,24 @@ def test_apply_updates_dom0_failure(mocked_info, mocked_error, mocked_call):
mocked_error.assert_has_calls(error_log)


@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_vm_polling(mocked_info, mocked_error):
def mock_api(results):
for r in results:
yield r
time.sleep(0.1)

with mock.patch("Updater.qubes") as mocked_qubes:
mocked_qubes.domains = {"sys-net": mock.MagicMock()}
mocked_qubes.domains["sys-net"].is_running = mock.MagicMock(
side_effect=mock_api((True, True, False))
)
assert updater._wait_for_is_running("sys-net", False, timeout=1) is True
assert mocked_info.called
assert not mocked_error.called


@pytest.mark.parametrize("vm", current_templates.keys())
@mock.patch("subprocess.check_call", side_effect="0")
@mock.patch("Updater.sdlog.error")
Expand Down Expand Up @@ -668,15 +687,18 @@ def test_safely_shutdown_fails(mocked_info, mocked_error, mocked_call, vm):
@mock.patch("subprocess.check_output")
@mock.patch("Updater._safely_start_vm")
@mock.patch("Updater._safely_shutdown_vm")
@mock.patch("Updater._force_shutdown_vm")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_shutdown_and_start_vms(
mocked_info, mocked_error, mocked_shutdown, mocked_start, mocked_output
mocked_info,
mocked_error,
mocked_force_shutdown,
mocked_shutdown,
mocked_start,
mocked_output,
):
sys_vm_kill_calls = [
call(["qvm-kill", "sys-firewall"], stderr=-1),
call(["qvm-kill", "sys-net"], stderr=-1),
]
force_shutdown_calls = [call("sys-firewall"), call("sys-net")]
sys_vm_shutdown_calls = [
call("sys-usb"),
call("sys-whonix"),
Expand Down Expand Up @@ -705,7 +727,7 @@ def test_shutdown_and_start_vms(
call("sd-log"),
]
updater.shutdown_and_start_vms()
mocked_output.assert_has_calls(sys_vm_kill_calls)
mocked_force_shutdown.assert_has_calls(force_shutdown_calls)
mocked_shutdown.assert_has_calls(
template_vm_calls + app_vm_calls + sys_vm_shutdown_calls
)
Expand All @@ -714,60 +736,6 @@ def test_shutdown_and_start_vms(
assert not mocked_error.called


@mock.patch(
"subprocess.check_output",
side_effect=subprocess.CalledProcessError(1, "check_output"),
)
@mock.patch("Updater._safely_start_vm")
@mock.patch("Updater._safely_shutdown_vm")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_shutdown_and_start_vms_sysvm_fail(
mocked_info, mocked_error, mocked_shutdown, mocked_start, mocked_output
):
sys_vm_kill_calls = [
call(["qvm-kill", "sys-firewall"], stderr=-1),
call(["qvm-kill", "sys-net"], stderr=-1),
]
sys_vm_start_calls = [
call("sys-net"),
call("sys-firewall"),
call("sys-whonix"),
call("sys-usb"),
]
app_vm_calls = [
call("sd-app"),
call("sd-proxy"),
call("sd-whonix"),
call("sd-gpg"),
call("sd-log"),
]
template_vm_calls = [
call("fedora-30"),
call("sd-viewer-buster-template"),
call("sd-app-buster-template"),
call("sd-log-buster-template"),
call("sd-devices-buster-template"),
call("sd-proxy-buster-template"),
call("whonix-gw-15"),
call("securedrop-workstation-buster"),
]
error_calls = [
call("Error while killing system VM: sys-firewall"),
call("Command 'check_output' returned non-zero exit status 1."),
call("None"),
call("Error while killing system VM: sys-net"),
call("Command 'check_output' returned non-zero exit status 1."),
call("None"),
]
updater.shutdown_and_start_vms()
mocked_output.assert_has_calls(sys_vm_kill_calls)
mocked_shutdown.assert_has_calls(template_vm_calls + app_vm_calls)
app_vm_calls_reversed = list(reversed(app_vm_calls))
mocked_start.assert_has_calls(sys_vm_start_calls + app_vm_calls_reversed)
mocked_error.assert_has_calls(error_calls)


@pytest.mark.parametrize("status", UpdateStatus)
@mock.patch("subprocess.check_call")
@mock.patch("os.path.expanduser", return_value=temp_dir)
Expand Down