Skip to content

Commit

Permalink
vm: call after-shutdown cleanup also from vm.kill and vm.shutdown
Browse files Browse the repository at this point in the history
Cleaning up after domain shutdown (domain-stopped and domain-shutdown
events) relies on libvirt events which may be unreliable in some cases
(events may be processed with some delay, of if libvirt was restarted in
the meantime, may not happen at all). So, instead of ensuring only
proper ordering between shutdown cleanup and next startup, also trigger
the cleanup when we know for sure domain isn't running:
 - at vm.kill() - after libvirt confirms domain was destroyed
 - at vm.shutdown(wait=True) - after successful shutdown
 - at vm.remove_from_disk() - after ensuring it isn't running but just
 before actually removing it

This fixes various race conditions:
 - qvm-kill && qvm-remove: remove could happen before shutdown cleanup
 was done and storage driver would be confused about that
 - qvm-shutdown --wait && qvm-clone: clone could happen before new content was
 commited to the original volume, making the copy of previous VM state
(and probably more)

Previously it wasn't such a big issue on default configuration, because
LVM driver was fully synchronous, effectively blocking the whole qubesd
for the time the cleanup happened.

To avoid code duplication, factor out _ensure_shutdown_handled function
calling actual cleanup (and possibly canceling one called with libvirt
event). Note that now, "Duplicated stopped event from libvirt received!"
warning may happen in normal circumstances, not only because of some
bug.

It is very important that post-shutdown cleanup happen when domain is
not running. To ensure that, take startup_lock and under it 1) ensure
its halted and only then 2) execute the cleanup. This isn't necessary
when removing it from disk, because its already removed from the
collection at that time, which also avoids other calls to it (see also
"vm/dispvm: fix DispVM cleanup" commit).
Actually, taking the startup_lock in remove_from_disk function would
cause a deadlock in DispVM auto cleanup code:
 - vm.kill (or other trigger for the cleanup)
   - vm.startup_lock acquire   <====
     - vm._ensure_shutdown_handled
       - domain-shutdown event
         - vm._auto_cleanup (in DispVM class)
           - vm.remove_from_disk
             - cannot take vm.startup_lock again
  • Loading branch information
marmarek committed Oct 26, 2018
1 parent 5be003d commit 2c1629d
Showing 1 changed file with 56 additions and 34 deletions.
90 changes: 56 additions & 34 deletions qubes/vm/qubesvm.py
Original file line number Diff line number Diff line change
Expand Up @@ -878,6 +878,36 @@ def on_property_pre_del_autostart(self, event, name, oldvalue=None):
# methods for changing domain state
#

@asyncio.coroutine
def _ensure_shutdown_handled(self):
'''Make sure previous shutdown is fully handled.
MUST NOT be called when domain is running.
'''
with (yield from self._domain_stopped_lock):
# Don't accept any new stopped event's till a new VM has been
# created. If we didn't received any stopped event or it wasn't
# handled yet we will handle this in the next lines.
self._domain_stopped_event_received = True

if self._domain_stopped_future is not None:
# Libvirt stopped event was already received, so cancel the
# future. If it didn't generate the Qubes events yet we
# will do it below.
self._domain_stopped_future.cancel()
self._domain_stopped_future = None

if not self._domain_stopped_event_handled:
# No Qubes domain-stopped events have been generated yet.
# So do this now.

# Set this immediately such that we don't generate events
# twice if an exception gets thrown.
self._domain_stopped_event_handled = True

yield from self.fire_event_async('domain-stopped')
yield from self.fire_event_async('domain-shutdown')


@asyncio.coroutine
def start(self, start_guid=True, notify_function=None,
mem_required=None):
Expand All @@ -894,29 +924,7 @@ def start(self, start_guid=True, notify_function=None,
if self.get_power_state() != 'Halted':
return self

with (yield from self._domain_stopped_lock):
# Don't accept any new stopped event's till a new VM has been
# created. If we didn't received any stopped event or it wasn't
# handled yet we will handle this in the next lines.
self._domain_stopped_event_received = True

if self._domain_stopped_future is not None:
# Libvirt stopped event was already received, so cancel the
# future. If it didn't generate the Qubes events yet we
# will do it below.
self._domain_stopped_future.cancel()
self._domain_stopped_future = None

if not self._domain_stopped_event_handled:
# No Qubes domain-stopped events have been generated yet.
# So do this now.

# Set this immediately such that we don't generate events
# twice if an exception gets thrown.
self._domain_stopped_event_handled = True

yield from self.fire_event_async('domain-stopped')
yield from self.fire_event_async('domain-shutdown')
yield from self._ensure_shutdown_handled()

self.log.info('Starting {}'.format(self.name))

Expand Down Expand Up @@ -1030,8 +1038,8 @@ def on_libvirt_domain_stopped(self):
return

if self._domain_stopped_event_received:
self.log.warning('Duplicated stopped event from libvirt received!')
# ignore this unexpected event
# ignore this event - already triggered by shutdown(), kill(),
# or subsequent start()
return

self._domain_stopped_event_received = True
Expand Down Expand Up @@ -1088,8 +1096,12 @@ def shutdown(self, force=False, wait=False, timeout=None):
while timeout > 0 and not self.is_halted():
yield from asyncio.sleep(0.25)
timeout -= 0.25
if not self.is_halted():
raise qubes.exc.QubesVMShutdownTimeoutError(self)
with (yield from self.startup_lock):
if self.is_halted():
# make sure all shutdown tasks are completed
yield from self._ensure_shutdown_handled()
else:
raise qubes.exc.QubesVMShutdownTimeoutError(self)

return self

Expand All @@ -1104,13 +1116,17 @@ def kill(self):
if not self.is_running() and not self.is_paused():
raise qubes.exc.QubesVMNotStartedError(self)

try:
self.libvirt_domain.destroy()
except libvirt.libvirtError as e:
if e.get_error_code() == libvirt.VIR_ERR_OPERATION_INVALID:
raise qubes.exc.QubesVMNotStartedError(self)
else:
raise
with (yield from self.startup_lock):
try:
self.libvirt_domain.destroy()
except libvirt.libvirtError as e:
if e.get_error_code() == libvirt.VIR_ERR_OPERATION_INVALID:
raise qubes.exc.QubesVMNotStartedError(self)
else:
raise

# make sure all shutdown tasks are completed
yield from self._ensure_shutdown_handled()

return self

Expand Down Expand Up @@ -1477,6 +1493,12 @@ def remove_from_disk(self):
"Can't remove VM {!s}, beacuse it's in state {!r}.".format(
self, self.get_power_state()))

# make sure shutdown is handled before removing anything, but only if
# handling is pending; if not, we may be called from within
# domain-shutdown event (DispVM._auto_cleanup), which would deadlock
if not self._domain_stopped_event_handled:
yield from self._ensure_shutdown_handled()

yield from self.fire_event_async('domain-remove-from-disk')
try:
# TODO: make it async?
Expand Down

0 comments on commit 2c1629d

Please sign in to comment.