Add 'InstanceForceStop' action to forcefully stop all VCPUs #1534

andrew-d · 2020-01-21T01:00:51Z

Reason for This PR

In the case that the guest kernel is either non-cooperative (i.e. does not shut down when Ctrl-Alt-Delete is sent), or hangs and does not respond, it's convenient to be able to forcefully stop a microVM.

Description of Changes

Building off the work done in #1414, I simply wired through use of VcpuEvent::Exit.

Feedback is appreciated - this is my first time contributing to Firecracker!

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Reviewer TODO: Verify that these criteria are met. Request changes if not]

All commits in this PR are signed (git commit -s).
Either this PR is linked to an issue, or, the reason for this PR is
clearly provided.
The description of changes is clear and encompassing.
Either no docs need to be updated as part of this PR, or, the required
doc changes are included in this PR. Docs in scope are all *.md files
located either in the repository root, or in the docs/ directory.
Either no code has been touched, or, code-level documentation for touched
code is included in this PR.
Either no API changes are included in this PR, or, the API changes are
reflected in firecracker/swagger.yaml.
Either the changes in this PR have no user impact, or, the changes in
this PR have user impact and have been added to the CHANGELOG.md file.
Either no new unsafe code has been added, or, the newly added unsafe
code is unavoidable and properly documented.

In the case that the guest kernel is either non-cooperative (i.e. does not shut down when Ctrl-Alt-Delete is sent), or hangs and does not respond, it's convenient to be able to forcefully stop a microVM. The newly-added 'InstanceForceStop' action accomplishes this by telling all VCPUs in the microVM to exit, building off the work done in firecracker-microvm#1414. Signed-off-by: Andrew Dunham <[email protected]>

Signed-off-by: Andrew Dunham <[email protected]>

andrew-d · 2020-01-21T04:55:58Z

It looks like the integration tests are failing with a code coverage error:

integration_tests/build/test_coverage.py:94: in test_coverage
    assert coverage >= min_coverage, coverage_low_msg
E   AssertionError: Current code coverage (84.70%) is below the target (84.9%).
E   assert 84.7033661893249 >= 84.85000000000001

I would appreciate advice on how to test the new code - do the python-based integration tests not count towards code coverage?

acatangiu · 2020-01-21T12:17:34Z

@andrew-d the code you added looks good, but I don't see a clear usecase for it. The idea for the SendCtrlAltDel command is for the guest to get the chance to cleanly shut down.

From a use-case point of view, the functionality added by this PR is no different than simply kill-ing the firecracker process. As such, I don't see added value in adding and maintaining this proposed new functionality.

andrew-d · 2020-01-21T16:25:50Z

@acatangiu - Hello, and thank you for taking a look! I asked #1370 a while back about how to force-stop a Firecracker VM, and was told that SIGKILLing the Firecracker process was the recommended method, but that "the Firecracker process is killed, there is no guarantee of immediate cleanup". I don't know enough about KVM to know if there would be a resource leak or other issue with simply SIGKILLing the main Firecracker process; if doing that is perfectly safe and doesn't leak any KVM resources, then yes, this code is redundant.

I'm happy to look into the KVM side further, if you have any pointers?

The other minor use-case is that it's possible to grant access to an unprivileged client to call InstanceForceStop via the control socket. AFAIK, allowing someone to send SIGKILL to the Firecracker process would require that either they're (a) running as the same uid, which means that that uid also has direct access to /dev/kvm since Firecracker needs it, or (b) are running as root.

acatangiu · 2020-01-21T16:57:10Z

I don't know enough about KVM to know if there would be a resource leak or other issue with simply SIGKILLing the main Firecracker process; if doing that is perfectly safe and doesn't leak any KVM resources, then yes, this code is redundant.

Killing the Firecracker process will not leak any host resources, kvm or otherwise.

The problem with killing the firecracker process is that the guest does not get a chance to do any shutdown operations. IO buffers both in the guest kernel and in the firecracker process don't get flushed so disk drives might end up in dirty states. By using VcpuEvent::Exit events to stop the guest we could make sure to flush the firecracker internal buffers before closing the process, but without flushing the guest OS ones as well, it will not solve much - maybe just slightly decrease the probability of dirty disk states.

I'm happy to look into the KVM side further, if you have any pointers?

Looking into KVM further regarding what exactly?

andrew-d · 2020-01-21T17:05:11Z

Killing the Firecracker process will not leak any host resources, kvm or otherwise.

Good to know - thank you!

IO buffers both in the guest kernel and in the firecracker process don't get flushed so disk drives might end up in dirty states.

Yep, I would expect this to be true if I ever had to kill a Firecracker VM.

We could make sure to flush the firecracker internal buffers before closing the process, but without flushing the guest OS ones as well, it will not solve much - maybe just slightly decrease the probability of dirty disk states.

Does Firecracker guarantee that buffers are written to disk in the same order as the guest submits them? Ignoring Firecracker's buffers for a second: if the guest is using a journaling filesystem, then killing the VM while it's running is the same as a power loss event, and as long as the guest code properly uses fsync() and a journaling filesystem, modern systems should handle this just fine.

However, if Firecracker does not guarantee that it writes its' internal buffers to disk in order, then using InstanceForceStop becomes important, since that could break many filesystems' assumptions about how the disk works.

I'm happy to look into the KVM side further, if you have any pointers?

Looking into KVM further regarding what exactly?

Ah, sorry - I meant "if you didn't know off the top of your head whether there'd be a leak".

acatangiu · 2020-01-30T17:15:04Z

Ignoring Firecracker's buffers for a second: if the guest is using a journaling filesystem, then killing the VM while it's running is the same as a power loss event, and as long as the guest code properly uses fsync() and a journaling filesystem, modern systems should handle this just fine.

I think the API you propose would help in such a case. Let us think on it for a couple of days.

acatangiu · 2020-01-31T16:21:42Z

@andrew-d We've had a quick look at the code and our findings are that kill-ing the Firecracker process is as safe as this proposed API.

For the net and vsock devices we don't really care about guaranteeing flushing pending data for unresponsive guests. I don't believe that'd bring any value to an actual customer use-case. Please correct me if I'm wrong.

For block devices (emulated disks), Firecracker doesn't have any intermediary buffers outside of the virtio queues and all write()s are blocking writes. Virtio in-flight requests are handled according to the virtio protocol and as long as the guest driver is following that correctly, journalled filesystems will work fine regardless of when Firecracker is killed. For the guest a kill -9 fc-pid should be equivalent to a power loss.

For now, the extra API proposed in this PR doesn't bring any value. We should revisit this if/when we start looking at async-io.

andrew-d · 2020-01-31T21:25:07Z

Thank you very much for investigating this - it's greatly appreciated! Given the results, agreed that this change is unnecessary; thank you for your time.

andrew-d added 3 commits January 20, 2020 16:53

Update integration tests

2012469

Signed-off-by: Andrew Dunham <[email protected]>

Update documentation and CHANGELOG

0ae9913

Signed-off-by: Andrew Dunham <[email protected]>

andrew-d requested a review from acatangiu January 21, 2020 01:01

andrew-d added 2 commits January 20, 2020 20:39

rustfmt

d413655

Signed-off-by: Andrew Dunham <[email protected]>

python formatting

6846ab2

Signed-off-by: Andrew Dunham <[email protected]>

lauralt assigned andrew-d Jan 21, 2020

alxiord added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Jan 31, 2020

acatangiu closed this Jan 31, 2020

andrew-d deleted the andrew/force-stop branch January 31, 2020 21:25

richardcase mentioned this pull request Oct 20, 2021

Kill firecracker on shutdown if CtrlAltDel action did not work (ARM) liquidmetal-dev/flintlock#152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 'InstanceForceStop' action to forcefully stop all VCPUs #1534

Add 'InstanceForceStop' action to forcefully stop all VCPUs #1534

andrew-d commented Jan 21, 2020

andrew-d commented Jan 21, 2020

acatangiu commented Jan 21, 2020

andrew-d commented Jan 21, 2020

acatangiu commented Jan 21, 2020

andrew-d commented Jan 21, 2020

acatangiu commented Jan 30, 2020

acatangiu commented Jan 31, 2020

andrew-d commented Jan 31, 2020

Add 'InstanceForceStop' action to forcefully stop all VCPUs #1534

Add 'InstanceForceStop' action to forcefully stop all VCPUs #1534

Conversation

andrew-d commented Jan 21, 2020

Reason for This PR

Description of Changes

License Acceptance

PR Checklist

andrew-d commented Jan 21, 2020

acatangiu commented Jan 21, 2020

andrew-d commented Jan 21, 2020

acatangiu commented Jan 21, 2020

andrew-d commented Jan 21, 2020

acatangiu commented Jan 30, 2020

acatangiu commented Jan 31, 2020

andrew-d commented Jan 31, 2020