-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use poweroff + polling instead of qvm-kill for forced shutdowns #534
Conversation
In my own testing, I was not able to get #498 with those changes applied, but I've had that experience before. On the other hand, I was able to see #498 quickly (this time with If this approach looks sound and we still see #498 after it, my next best bet would be to use a similar method for starting VMs. |
c0b0a13
to
47c6a3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eloquence , I took a first pass through this, changes look in principle, left some comments inline.
As you already mentioned earlier, coverage still not yet at 100% for Updater.py:
sdw_updater_gui/Updater.py 310 19 94% 22, 481-508, 533-537
qubes.domains[vm].run("poweroff", user="root") | ||
except subprocess.CalledProcessError as e: | ||
# Exit codes 1 and 143 may occur with successful shutdown; log others | ||
if e.returncode != 1 and e.returncode != 143: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing qvm-run -u root sys-net poweroff
a few times, I mostly get exit code 143, and got exit code 1 a few times, without any clear conditional differences; nevertheless the command completed in all cases. poweroff
itself triggers SIGTERM
, which is what exit code 143 signifies; I'm guessing that its parent shell process may sometimes prematurely terminate. I do not know why the command sometimes returns exit code 1, but I never get exit code 0.
CC'ing @marmarek in case he can shed light on the exit code behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explanation makes sense to me, thanks for jumping in. :)
launcher/sdw_updater_gui/Updater.py
Outdated
return _wait_for_is_running(vm, False) | ||
|
||
|
||
def _wait_for_is_running(vm, expected, timeout=60, interval=0.2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect to reuse this polling method with other VM states? If so, we should consider using get_power_state()
function of a domain in the qubes API, which would give us more standard Qubes terminology for power state.
qubes.domains["sys-net"].get_power_state()
Running
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try making the _wait_for
function generic, with this signature:
_wait_for(vm, condition, timeout=60, interval=0.2)
That way we can pass in any lambda we want. I personally prefer using a simple Boolean check for is_running
but this gives us the flexibility to use different checks for different purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now done in 111ccb0, haven't tested in Qubes yet. 111ccb0#diff-ad9e526e0e4fea47e3196b3e293a4d50R508 shows an example invocation.
- False if it did not | ||
""" | ||
start_time = time.time() | ||
stop_time = start_time + timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested several values for interval here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait.
launcher/sdw_updater_gui/Updater.py
Outdated
while time.time() < stop_time: | ||
state = qubes.domains[vm].is_running() | ||
elapsed = time.time() - start_time | ||
if state == expected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like using the get_power_state state as describe above might make more sense here, though functionally equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See above)
|
||
qubes = qubesadmin.Qubes() | ||
except ImportError: | ||
qubes = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log something when qubes=None
, and probably test these lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:
- handling the error, but only if we're on Qubes (
platform.linux_distribution
is deprecated;platform.uname
seems our best bet though really only tells us that we're on a host calleddom0
) - ensuring we don't log to disk when the file is imported as part of the test suite;
- cleanly unit testing any added code without overcomplicating things.
state = qubes.domains[vm].is_running() | ||
elapsed = time.time() - start_time | ||
if state == expected: | ||
sdlog.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 5 runs, I get the following measures:
- sys-firewall:
[17.94, 27.42, 17.94, 27.45, 17.50]
- sys-net:
[3.38, 3.25, 3.22, 0.00, 0.00]
Do you have any insight into the variance for sys-firewall? Have you need a value of 0.00
for sys-net before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the API run
command completes after the is_running
state has already reached False
, no polling will be required, and the elapsed time will be 0.00
(which is the best case). I've seen fairly inconsistent behavior when using qvm-run
; sometimes it exits immediately, sometimes it aligns nicely with the shutdown sequence. I think we're at the mercy of the scheduler here. Fortunately, all poweroff
invocations I've tried have been successful.
Ran this version in Qubes a few times without errors. Like @emkll I'm noticing that polling is often not required after issuing a poweroff (it'll show polling for 0.00 seconds in the log), but it's intended to save our bacon in the cases when it is. |
Except it maybe doesn't, which raises the question whether the added complexity is worth it. See new report in #498 (comment) . As a good practice, it's probably still best to avoid |
Per further investigation of #498 it now seems clear that at least some instances of the problem are not caused by our use of I would still recommend proceeding with the removal of |
Since this does not in fact fully resolve #498, we've agreed it can wait until after the next RPM release (0.3.0). Deferring from current sprint (4/22-5/6) to near-term backlog. |
@conorsch @emkll Judging by the latest reports on #498, it now seems likely that the I'm curious how you both feel about the value of this change, with this in mind. I would suggest that we wait for the |
@marmarek We're eager to stop using You had mentioned here that you could look into backporting the |
Yes, can do this week. |
Thanks a lot @marmarek for poking at this. Leaving a note here that we'll also need to use this |
I'm going to close this for now until we determine we need it. Would appreciate keeping the branch on the remote for the time being. |
Status
Towards #498
Test plan
Preparatory steps
Updater.py
(patch Updater.py instant_karma.patch
) in order to skip most of the update run, since this PR only impacts the shut-down-and-reboot sequence./opt/securedrop/launcher
and/srv/salt/launcher
(if only the/opt
copy is overwritten, the updater itself will replace it on the next run).tail -f ~/.securedrop_launcher/launcher.log
to follow along as you test.Testing
/opt/securedrop/launcher/sdw-launcher.py --skip-delta 0
. This forces an updater run.sys-usb
,sys-firewall
,sys-net
) are powered off and back on.sd-log
#498 (:crossed_fingers:).0.1
and observe if you can now repro [0.2.3-rpm] libxenlight failed to create new domainsd-log
#498 (it is expected that you'll now see it again). Caveat that the file needs to again be patched in two places (/opt
andsrv/
) for repeat runs.Checklist
make test
indom0
not run yet (but should not be impacted; launcher has its own test suite)