diff --git a/CHANGELOG.md b/CHANGELOG.md index 28e2b8f7bcc..e0ab1066e5a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,16 @@ and this project adheres to without MPTable support. Please see our [kernel policy documentation](docs/kernel-policy.md) for more information regarding relevant kernel configurations. +- [#4487](https://github.com/firecracker-microvm/firecracker/pull/4487): Added + support for the Virtual Machine Generation Identifier (VMGenID) device on + x86_64 platforms. VMGenID is a virtual device that allows VMMs to notify + guests when they are resumed from a snapshot. Linux includes VMGenID support + since version 5.18. It uses notifications from the device to reseed its + internal CSPRNG. Please refer to + [snapshot support](docs/snapshotting/snapshot-support.md) and + [random for clones](docs/snapshotting/random-for-clones.md) documention for + more info on VMGenID. VMGenID state is part of the snapshot format of + Firecracker. As a result, Firecracker snapshot version is now 2.0.0. ### Changed diff --git a/docs/snapshotting/random-for-clones.md b/docs/snapshotting/random-for-clones.md index f5598728f4f..18f0cf452d3 100644 --- a/docs/snapshotting/random-for-clones.md +++ b/docs/snapshotting/random-for-clones.md @@ -22,17 +22,19 @@ which wraps the [`AWS-LC` cryptographic library][9]. Traditionally, `/dev/random` has been considered a source of “true” randomness, with the downside that reads block when the pool of entropy gets depleted. On -the other hand, `/dev/urandom` doesn’t block, but provides lower quality -results. It turns out the distinction in output quality is actually very hard to -make. According to [this article][2], for kernel versions prior to 4.8, both -devices draw their output from the same pool, with the exception that -`/dev/random` will block when the system estimates the entropy count has -decreased below a certain threshold. The `/dev/urandom` output is considered -secure for virtually all purposes, with the caveat that using it before the -system gathers sufficient entropy for initialization may indeed produce low -quality random numbers. The `getrandom` syscall helps with this situation; it -uses the `/dev/urandom` source by default, but will block until it gets properly -initialized (the behavior can be altered via configuration flags). +the other hand, `/dev/urandom` doesn’t block, which lead people believe that it +provides lower quality results. + +It turns out the distinction in output quality is actually very hard to make. +According to [this article][2], for kernel versions prior to 4.8, both devices +draw their output from the same pool, with the exception that `/dev/random` will +block when the system estimates the entropy count has decreased below a certain +threshold. The `/dev/urandom` output is considered secure for virtually all +purposes, with the caveat that using it before the system gathers sufficient +entropy for initialization may indeed produce low quality random numbers. The +`getrandom` syscall helps with this situation; it uses the `/dev/urandom` source +by default, but will block until it gets properly initialized (the behavior can +be altered via configuration flags). Newer kernels (4.8+) have switched to an implementation where `/dev/random` output comes from a pool called the blocking pool, the output of `/dev/urandom` @@ -41,6 +43,8 @@ and there’s also an input pool which gathers entropy from various sources available on the system, and is used to feed into or seed the other two components. A very detailed description is available [here][3]. +### Linux kernels from 4.8 until 5.17 (included) + The details of this newer implementation are used to make the recommendations present in the document. There are in-kernel interfaces used to obtain random numbers as well, but they are similar to using `/dev/urandom` (or `getrandom` @@ -99,6 +103,42 @@ not increase the current entropy estimation. There is also an `ioctl` interface which, given the appropriate privileges, can be used to add data to the input entropy pool while also increasing the count, or completely empty all pools. +### Linux kernels from 5.18 onwards + +Since version 5.18, Linux has support for the +[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier). +The purpose of VMGenID is to notify the guest about time shift events, such as +resuming from a snapshot. The device exposes a 16-byte cryptographically random +identifier in guest memory. Firecracker implements VMGenID. When resuming a +microVM from a snapshot Firecracker writes a new identifier and injects a +notification to the guest. Linux, +[uses this value](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/virt/vmgenid.c#L77) +[as new randomness for its CSPRNG](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/char/random.c#L908). +Quoting the random.c implementation of the kernel: + +``` +/* + * Handle a new unique VM ID, which is unique, not secret, so we + * don't credit it, but we do immediately force a reseed after so + * that it's used by the crng posthaste. + */ +``` + +As a result, values returned by `getrandom()` and `/dev/(u)random` are distinct +in all VMs started from the same snapshot, **after** the kernel handles the +VMGenID notification. This leaves a race window between resuming vCPUs and Linux +CSPRNG getting successfully re-seeded. In Linux 6.8, we +[extended VMGenID](https://lore.kernel.org/lkml/20230531095119.11202-2-bchalios@amazon.es/) +to emit a uevent to user space when it handles the notification. User space can +poll this uevent to know when it is safe to use `getrandom()`, et al. avoiding +the race condition. + +Please note that, Firecracker will always enable VMGenID. In kernels earlier +than 5.18, where there is no VMGenID driver, the device will not have any effect +in the guest. + +### User space considerations + Init systems (such as `systemd` used by AL2 and other distros) might save a random seed file after boot. For `systemd`, the path is `/var/lib/systemd/random-seed`. Just to be on the safe side, any such file @@ -121,8 +161,8 @@ alter the read result via bind mounting another file on top of and should be sufficient for most cases. - Use `virtio-rng`. When present, the guest kernel uses the device as an additional source of entropy. -- To be as safe as possible, the direct approach is to do the following (before - customer code is resumed in the clone): +- On kernels before 5.18, to be as safe as possible, the direct approach is to + do the following (before customer code is resumed in the clone): 1. Open one of the special devices files (either `/dev/random` or `/dev/urandom`). Take note that `RNDCLEARPOOL` no longer [has any effect][7] on the entropy pool. @@ -133,6 +173,13 @@ alter the read result via bind mounting another file on top of 1. Issue a `RNDRESEEDCRNG` ioctl call ([4.14][5], [5.10][6], (requires `CAP_SYS_ADMIN`)) that specifically causes the `CSPRNG` to be reseeded from the input pool. +- On kernels starting from 5.18 onwards, the CSPRNG will be automatically + reseeded when the guest kernel handles the VMGenID notification. To completely + avoid the race condition, users should follow the same steps as with kernels + \< 5.18. +- On kernels starting from 6.8, users can poll for the VMGenID uevent that the + driver sends when the CSPRNG is reseeded after handling the VMGenID + notification. **Annex 1 contains the source code of a C program which implements the previous three steps.** As soon as the guest kernel version switches to 4.19 (or higher), diff --git a/docs/snapshotting/snapshot-support.md b/docs/snapshotting/snapshot-support.md index ccd57e08ee4..136c9b6bce8 100644 --- a/docs/snapshotting/snapshot-support.md +++ b/docs/snapshotting/snapshot-support.md @@ -146,6 +146,10 @@ The snapshot functionality is still in developer preview due to the following: - If a [CPU template](../cpu_templates/cpu-templates.md) is not used on x86_64, overwrites of `MSR_IA32_TSX_CTRL` MSR value will not be preserved after restoring from a snapshot. +- Resuming from a snapshot that was taken during early stages of the guest + kernel boot might lead to crashes upon snapshot resume. We suggest that users + take snapshot after the guest microVM kernel has booted. Please see + [VMGenID device limitation](#vmgenid-device-limitation). ## Firecracker Snapshotting characteristics @@ -571,15 +575,32 @@ we also consider microVM A insecure if it resumes execution. ### Reusing snapshotted states securely -We are currently working to add a functionality that will notify guest operating -systems of the snapshot event in order to enable secure reuse of snapshotted -microVM states, guest operating systems, language runtimes, and cryptographic -libraries. In some cases, user applications will need to handle the snapshot -create/restore events in such a way that the uniqueness and randomness -properties are preserved and guaranteed before resuming the workload. - -We've started a discussion on how the Linux operating system might securely -handle being snapshotted [here](https://lkml.org/lkml/2020/10/16/629). +[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier) +(VMGenID) is a virtual device that allows VM guests to detect when they have +resumed from a snapshot. It works by exposing a cryptographically random +16-bytes identifier to the guest. The VMM ensures that the value of the +indentifier changes every time the VM a time shift happens in the lifecycle of +the VM, e.g. when it resumes from a snapshot. + +Linux supports VMGenID since version 5.18. When Linux detects a change in the +identifier, it uses its value to reseed its internal PRNG. Moreover, +[since version 6.8](https://lkml.org/lkml/2023/5/31/414) Linux VMGenID driver +also emits to userspace a uevent. User space processes can monitor this uevent +for detecting snapshot resume events. + +Firecracker supports VMGenID device on x86 platforms. Firecracker will always +enable the device. During snapshot resume, Firecracker will update the 16-byte +generation ID and inject a notification in the guest before resuming its vCPUs. + +As a result, guests that run Linux versions >= 5.18 will re-seed their in-kernel +PRNG upon snapshot resume. User space applications can rely on the guest kernel +for randomness. State other than the guest kernel entropy pool, such as unique +identifiers, cached random numbers, cryptographic tokens, etc **will** still be +replicated across multiple microVMs resumed from the same snapshot. Users need +to implement mechanisms for ensuring de-duplication of such state, where needed. +On guests that run Linux versions >= 6.8, users can make use of the uevent that +VMGenID driver emits upon resuming from a snapshot, to be notified about +snapshot resume events. ## Vsock device limitation @@ -605,6 +626,16 @@ section 5.10.6.6 Device Events. Firecracker handles sending the `reset` event to the vsock driver, thus the customers are no longer responsible for closing active connections. +## VMGenID device limitation + +During snashot resume, Firecracker updates the 16-byte generation ID of the +VMGenID device and injects an interrupt in the guest before resuming vCPUs. If +the snapshot was taken at the very early stages of the guest kernel boot process +proper interrupt handling might not be in place yet. As a result, the kernel +might not be able to handle the injected notification and crash. We suggest to +users that they take snapshots only after the guest kernel has completed +booting, to avoid this issue. + ## Snapshot compatibility across kernel versions We have a mechanism in place to experiment with snapshot compatibility across diff --git a/src/vmm/src/acpi/mod.rs b/src/vmm/src/acpi/mod.rs index 917b1f0edf8..ad108217d08 100644 --- a/src/vmm/src/acpi/mod.rs +++ b/src/vmm/src/acpi/mod.rs @@ -2,13 +2,14 @@ // SPDX-License-Identifier: Apache-2.0 use acpi_tables::fadt::{FADT_F_HW_REDUCED_ACPI, FADT_F_PWR_BUTTON, FADT_F_SLP_BUTTON}; -use acpi_tables::{Dsdt, Fadt, Madt, Rsdp, Sdt, Xsdt}; +use acpi_tables::{Aml, Dsdt, Fadt, Madt, Rsdp, Sdt, Xsdt}; use log::{debug, error}; use vm_allocator::AllocPolicy; use crate::acpi::x86_64::{ apic_addr, rsdp_addr, setup_arch_dsdt, setup_arch_fadt, setup_interrupt_controllers, }; +use crate::device_manager::acpi::ACPIDeviceManager; use crate::device_manager::mmio::MMIODeviceManager; use crate::device_manager::resources::ResourceAllocator; use crate::vstate::memory::{GuestAddress, GuestMemoryMmap}; @@ -74,12 +75,19 @@ impl<'a> AcpiTableWriter<'a> { } /// Build the DSDT table for the guest - fn build_dsdt(&mut self, mmio_device_manager: &MMIODeviceManager) -> Result { + fn build_dsdt( + &mut self, + mmio_device_manager: &MMIODeviceManager, + acpi_device_manager: &ACPIDeviceManager, + ) -> Result { let mut dsdt_data = Vec::new(); // Virtio-devices DSDT data dsdt_data.extend_from_slice(&mmio_device_manager.dsdt_data); + // Add GED and VMGenID AML data. + acpi_device_manager.append_aml_bytes(&mut dsdt_data); + // Architecture specific DSDT data setup_arch_dsdt(&mut dsdt_data); @@ -155,6 +163,7 @@ pub(crate) fn create_acpi_tables( mem: &GuestMemoryMmap, resource_allocator: &mut ResourceAllocator, mmio_device_manager: &MMIODeviceManager, + acpi_device_manager: &ACPIDeviceManager, vcpus: &[Vcpu], ) -> Result<(), AcpiError> { let mut writer = AcpiTableWriter { @@ -162,7 +171,7 @@ pub(crate) fn create_acpi_tables( resource_allocator, }; - let dsdt_addr = writer.build_dsdt(mmio_device_manager)?; + let dsdt_addr = writer.build_dsdt(mmio_device_manager, acpi_device_manager)?; let fadt_addr = writer.build_fadt(dsdt_addr)?; let madt_addr = writer.build_madt(vcpus.len().try_into().unwrap())?; let xsdt_addr = writer.build_xsdt(fadt_addr, madt_addr)?; diff --git a/src/vmm/src/builder.rs b/src/vmm/src/builder.rs index 3ebc62ec0bb..36f49a65e39 100644 --- a/src/vmm/src/builder.rs +++ b/src/vmm/src/builder.rs @@ -37,10 +37,18 @@ use crate::cpu_config::templates::{ KvmCapability, }; #[cfg(target_arch = "x86_64")] +use crate::device_manager::acpi::ACPIDeviceManager; +#[cfg(target_arch = "x86_64")] use crate::device_manager::legacy::PortIODeviceManager; use crate::device_manager::mmio::MMIODeviceManager; use crate::device_manager::persist::MMIODevManagerConstructorArgs; +#[cfg(target_arch = "x86_64")] +use crate::device_manager::persist::{ + ACPIDeviceManagerConstructorArgs, ACPIDeviceManagerRestoreError, +}; use crate::device_manager::resources::ResourceAllocator; +#[cfg(target_arch = "x86_64")] +use crate::devices::acpi::vmgenid::{VmGenId, VmGenIdError}; use crate::devices::legacy::serial::SerialOut; #[cfg(target_arch = "aarch64")] use crate::devices::legacy::RTCDevice; @@ -70,6 +78,9 @@ use crate::{device_manager, EventManager, Vmm, VmmError}; pub enum StartMicrovmError { /// Unable to attach block device to Vmm: {0} AttachBlockDevice(io::Error), + /// Unable to attach the VMGenID device: {0} + #[cfg(target_arch = "x86_64")] + AttachVmgenidDevice(kvm_ioctls::Error), /// System configuration error: {0} ConfigureSystem(crate::arch::ConfigurationError), /// Failed to create guest config: {0} @@ -81,6 +92,9 @@ pub enum StartMicrovmError { /// Error creating legacy device: {0} #[cfg(target_arch = "x86_64")] CreateLegacyDevice(device_manager::legacy::LegacyDeviceError), + /// Error creating VMGenID device: {0} + #[cfg(target_arch = "x86_64")] + CreateVMGenID(VmGenIdError), /// Invalid Memory Configuration: {0} GuestMemory(crate::vstate::memory::MemoryError), /// Cannot load initrd due to an invalid memory configuration. @@ -160,6 +174,10 @@ fn create_vmm_and_vcpus( // Instantiate the MMIO device manager. let mmio_device_manager = MMIODeviceManager::new(); + // Instantiate ACPI device manager. + #[cfg(target_arch = "x86_64")] + let acpi_device_manager = ACPIDeviceManager::new(); + // For x86_64 we need to create the interrupt controller before calling `KVM_CREATE_VCPUS` // while on aarch64 we need to do it the other way around. #[cfg(target_arch = "x86_64")] @@ -215,6 +233,8 @@ fn create_vmm_and_vcpus( mmio_device_manager, #[cfg(target_arch = "x86_64")] pio_device_manager, + #[cfg(target_arch = "x86_64")] + acpi_device_manager, }; Ok((vmm, vcpus)) @@ -327,6 +347,9 @@ pub fn build_microvm_for_boot( #[cfg(target_arch = "aarch64")] attach_legacy_devices_aarch64(event_manager, &mut vmm, &mut boot_cmdline).map_err(Internal)?; + #[cfg(target_arch = "x86_64")] + attach_vmgenid_device(&mut vmm)?; + configure_system_for_boot( &mut vmm, vcpus.as_mut(), @@ -425,6 +448,11 @@ pub enum BuildMicrovmFromSnapshotError { MissingVmmSeccompFilters, /// Failed to apply VMM secccomp filter: {0} SeccompFiltersInternal(#[from] seccompiler::InstallationError), + /// Failed to restore ACPI device manager: {0} + #[cfg(target_arch = "x86_64")] + ACPIDeviManager(#[from] ACPIDeviceManagerRestoreError), + /// VMGenID update failed: {0} + VMGenIDUpdate(std::io::Error), } /// Builds and starts a microVM based on the provided MicrovmState. @@ -498,7 +526,7 @@ pub fn build_microvm_from_snapshot( // Restore devices states. let mmio_ctor_args = MMIODevManagerConstructorArgs { - mem: guest_memory, + mem: &guest_memory, vm: vmm.vm.fd(), event_manager, resource_allocator: &mut vmm.resource_allocator, @@ -511,6 +539,25 @@ pub fn build_microvm_from_snapshot( .map_err(MicrovmStateError::RestoreDevices)?; vmm.emulate_serial_init()?; + #[cfg(target_arch = "x86_64")] + { + let acpi_ctor_args = ACPIDeviceManagerConstructorArgs { + mem: &guest_memory, + resource_allocator: &mut vmm.resource_allocator, + vm: vmm.vm.fd(), + }; + + vmm.acpi_device_manager = + ACPIDeviceManager::restore(acpi_ctor_args, µvm_state.acpi_dev_state)?; + + // Inject the notification to VMGenID that we have resumed from a snapshot. + // This needs to happen before we resume vCPUs, so that we minimize the time between vCPUs + // resuming and notification being handled by the driver. + vmm.acpi_device_manager + .notify_vmgenid() + .map_err(BuildMicrovmFromSnapshotError::VMGenIDUpdate)?; + } + // Move vcpus to their own threads and start their state machine in the 'Paused' state. vmm.start_vcpus( vcpus, @@ -803,6 +850,7 @@ pub fn configure_system_for_boot( &vmm.guest_memory, &mut vmm.resource_allocator, &vmm.mmio_device_manager, + &vmm.acpi_device_manager, vcpus, )?; } @@ -868,6 +916,18 @@ pub(crate) fn attach_boot_timer_device( Ok(()) } +#[cfg(target_arch = "x86_64")] +fn attach_vmgenid_device(vmm: &mut Vmm) -> Result<(), StartMicrovmError> { + let vmgenid = VmGenId::new(&vmm.guest_memory, &mut vmm.resource_allocator) + .map_err(StartMicrovmError::CreateVMGenID)?; + + vmm.acpi_device_manager + .attach_vmgenid(vmgenid, vmm.vm.fd()) + .map_err(StartMicrovmError::AttachVmgenidDevice)?; + + Ok(()) +} + fn attach_entropy_device( vmm: &mut Vmm, cmdline: &mut LoaderKernelCmdline, @@ -1066,6 +1126,8 @@ pub mod tests { vm.memory_init(&guest_memory, false).unwrap(); let mmio_device_manager = MMIODeviceManager::new(); #[cfg(target_arch = "x86_64")] + let acpi_device_manager = ACPIDeviceManager::new(); + #[cfg(target_arch = "x86_64")] let pio_device_manager = PortIODeviceManager::new( Arc::new(Mutex::new(BusDevice::Serial(SerialWrapper { serial: Serial::with_events( @@ -1104,6 +1166,8 @@ pub mod tests { mmio_device_manager, #[cfg(target_arch = "x86_64")] pio_device_manager, + #[cfg(target_arch = "x86_64")] + acpi_device_manager, } } @@ -1221,6 +1285,12 @@ pub mod tests { .is_some()); } + #[cfg(target_arch = "x86_64")] + pub(crate) fn insert_vmgenid_device(vmm: &mut Vmm) { + attach_vmgenid_device(vmm).unwrap(); + assert!(vmm.acpi_device_manager.vmgenid.is_some()); + } + pub(crate) fn insert_balloon_device( vmm: &mut Vmm, cmdline: &mut Cmdline, diff --git a/src/vmm/src/device_manager/acpi.rs b/src/vmm/src/device_manager/acpi.rs new file mode 100644 index 00000000000..216039a7644 --- /dev/null +++ b/src/vmm/src/device_manager/acpi.rs @@ -0,0 +1,85 @@ +// Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + +use acpi_tables::{aml, Aml}; +use kvm_ioctls::VmFd; + +use crate::devices::acpi::vmgenid::VmGenId; + +#[derive(Debug)] +pub struct ACPIDeviceManager { + /// VMGenID device + pub vmgenid: Option, +} + +impl ACPIDeviceManager { + /// Create a new ACPIDeviceManager object + pub fn new() -> Self { + Self { vmgenid: None } + } + + /// Attach a new VMGenID device to the microVM + /// + /// This will register the device's interrupt with KVM + pub fn attach_vmgenid( + &mut self, + vmgenid: VmGenId, + vm_fd: &VmFd, + ) -> Result<(), kvm_ioctls::Error> { + vm_fd.register_irqfd(&vmgenid.interrupt_evt, vmgenid.gsi)?; + self.vmgenid = Some(vmgenid); + Ok(()) + } + + /// If it exists, notify guest VMGenID device that we have resumed from a snapshot. + pub fn notify_vmgenid(&mut self) -> Result<(), std::io::Error> { + if let Some(vmgenid) = &mut self.vmgenid { + vmgenid.notify_guest()?; + } + Ok(()) + } +} + +impl Aml for ACPIDeviceManager { + fn append_aml_bytes(&self, v: &mut Vec) { + // If we have a VMGenID device, create the AML for the device and GED interrupt handler + self.vmgenid.as_ref().inspect(|vmgenid| { + // AML for GED + aml::Device::new( + "_SB_.GED_".into(), + vec![ + &aml::Name::new("_HID".into(), &"ACPI0013"), + &aml::Name::new( + "_CRS".into(), + &aml::ResourceTemplate::new(vec![&aml::Interrupt::new( + true, + true, + false, + false, + vmgenid.gsi, + )]), + ), + &aml::Method::new( + "_EVT".into(), + 1, + true, + vec![&aml::If::new( + // We know that the maximum IRQ number fits in a u8. We have up to 32 + // IRQs in x86 and up to 128 in ARM (look into + // `vmm::crate::arch::layout::IRQ_MAX`) + #[allow(clippy::cast_possible_truncation)] + &aml::Equal::new(&aml::Arg(0), &(vmgenid.gsi as u8)), + vec![&aml::Notify::new( + &aml::Path::new("\\_SB_.VGEN"), + &0x80usize, + )], + )], + ), + ], + ) + .append_aml_bytes(v); + // AML for VMGenID itself. + vmgenid.append_aml_bytes(v); + }); + } +} diff --git a/src/vmm/src/device_manager/mod.rs b/src/vmm/src/device_manager/mod.rs index 42ad46ca0a8..24b9c373bc1 100644 --- a/src/vmm/src/device_manager/mod.rs +++ b/src/vmm/src/device_manager/mod.rs @@ -5,6 +5,9 @@ // Use of this source code is governed by a BSD-style license that can be // found in the THIRD-PARTY file. +/// ACPI device manager. +#[cfg(target_arch = "x86_64")] +pub mod acpi; /// Legacy Device Manager. pub mod legacy; /// Memory Mapped I/O Manager. diff --git a/src/vmm/src/device_manager/persist.rs b/src/vmm/src/device_manager/persist.rs index da1adf0dc98..c9a5a9f2db8 100644 --- a/src/vmm/src/device_manager/persist.rs +++ b/src/vmm/src/device_manager/persist.rs @@ -12,10 +12,14 @@ use log::{error, warn}; use serde::{Deserialize, Serialize}; use vm_allocator::AllocPolicy; +#[cfg(target_arch = "x86_64")] +use super::acpi::ACPIDeviceManager; use super::mmio::*; use super::resources::ResourceAllocator; #[cfg(target_arch = "aarch64")] use crate::arch::DeviceType; +#[cfg(target_arch = "x86_64")] +use crate::devices::acpi::vmgenid::{VMGenIDState, VMGenIdConstructorArgs, VmGenId, VmGenIdError}; use crate::devices::virtio::balloon::persist::{BalloonConstructorArgs, BalloonState}; use crate::devices::virtio::balloon::{Balloon, BalloonError}; use crate::devices::virtio::block::device::Block; @@ -206,7 +210,7 @@ pub enum SharedDeviceType { } pub struct MMIODevManagerConstructorArgs<'a> { - pub mem: GuestMemoryMmap, + pub mem: &'a GuestMemoryMmap, pub vm: &'a VmFd, pub event_manager: &'a mut EventManager, pub resource_allocator: &'a mut ResourceAllocator, @@ -226,6 +230,59 @@ impl fmt::Debug for MMIODevManagerConstructorArgs<'_> { } } +#[cfg(target_arch = "x86_64")] +#[derive(Default, Debug, Clone, Serialize, Deserialize)] +pub struct ACPIDeviceManagerState { + vmgenid: Option, +} + +#[cfg(target_arch = "x86_64")] +pub struct ACPIDeviceManagerConstructorArgs<'a> { + pub mem: &'a GuestMemoryMmap, + pub resource_allocator: &'a mut ResourceAllocator, + pub vm: &'a VmFd, +} + +#[cfg(target_arch = "x86_64")] +#[derive(Debug, thiserror::Error, displaydoc::Display)] +pub enum ACPIDeviceManagerRestoreError { + /// Could not register device: {0} + Interrupt(#[from] kvm_ioctls::Error), + /// Could not create VMGenID device: {0} + VMGenID(#[from] VmGenIdError), +} + +#[cfg(target_arch = "x86_64")] +impl<'a> Persist<'a> for ACPIDeviceManager { + type State = ACPIDeviceManagerState; + type ConstructorArgs = ACPIDeviceManagerConstructorArgs<'a>; + type Error = ACPIDeviceManagerRestoreError; + + fn save(&self) -> Self::State { + ACPIDeviceManagerState { + vmgenid: self.vmgenid.as_ref().map(|dev| dev.save()), + } + } + + fn restore( + constructor_args: Self::ConstructorArgs, + state: &Self::State, + ) -> std::result::Result { + let mut dev_manager = ACPIDeviceManager::new(); + if let Some(vmgenid_args) = &state.vmgenid { + let vmgenid = VmGenId::restore( + VMGenIdConstructorArgs { + mem: constructor_args.mem, + resource_allocator: constructor_args.resource_allocator, + }, + vmgenid_args, + )?; + dev_manager.attach_vmgenid(vmgenid, constructor_args.vm)?; + } + Ok(dev_manager) + } +} + impl<'a> Persist<'a> for MMIODeviceManager { type State = DeviceStates; type ConstructorArgs = MMIODevManagerConstructorArgs<'a>; @@ -360,7 +417,7 @@ impl<'a> Persist<'a> for MMIODeviceManager { state: &Self::State, ) -> Result { let mut dev_manager = MMIODeviceManager::new(); - let mem = &constructor_args.mem; + let mem = constructor_args.mem; let vm = constructor_args.vm; #[cfg(target_arch = "aarch64")] @@ -748,7 +805,7 @@ mod tests { let device_states: DeviceStates = Snapshot::deserialize(&mut buf.as_slice()).unwrap(); let vm_resources = &mut VmResources::default(); let restore_args = MMIODevManagerConstructorArgs { - mem: vmm.guest_memory().clone(), + mem: vmm.guest_memory(), vm: vmm.vm.fd(), event_manager: &mut event_manager, resource_allocator: &mut resource_allocator, diff --git a/src/vmm/src/devices/acpi/mod.rs b/src/vmm/src/devices/acpi/mod.rs new file mode 100644 index 00000000000..5151bddd231 --- /dev/null +++ b/src/vmm/src/devices/acpi/mod.rs @@ -0,0 +1,4 @@ +// Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + +pub mod vmgenid; diff --git a/src/vmm/src/devices/acpi/vmgenid.rs b/src/vmm/src/devices/acpi/vmgenid.rs new file mode 100644 index 00000000000..b60343e473b --- /dev/null +++ b/src/vmm/src/devices/acpi/vmgenid.rs @@ -0,0 +1,179 @@ +// Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. +// SPDX-License-Identifier: Apache-2.0 + +use acpi_tables::{aml, Aml}; +use aws_lc_rs::error::Unspecified as RandError; +use aws_lc_rs::rand; +use log::{debug, error}; +use serde::{Deserialize, Serialize}; +use utils::eventfd::EventFd; +use vm_memory::{GuestAddress, GuestMemoryError}; +use vm_superio::Trigger; + +use super::super::legacy::EventFdTrigger; +use crate::device_manager::resources::ResourceAllocator; +use crate::snapshot::Persist; +use crate::vstate::memory::{Bytes, GuestMemoryMmap}; + +/// Virtual Machine Generation ID device +/// +/// VMGenID is an emulated device which exposes to the guest a 128-bit cryptographically random +/// integer value which will be different every time the virtual machine executes from a different +/// configuration file. In Firecracker terms this translates to a different value every time a new +/// microVM is created, either from scratch or restored from a snapshot. +/// +/// The device specification can be found here: https://go.microsoft.com/fwlink/?LinkId=260709 +#[derive(Debug)] +pub struct VmGenId { + /// Current generation ID of guest VM + pub gen_id: u128, + /// Interrupt line for notifying the device about generation ID changes + pub interrupt_evt: EventFdTrigger, + /// Guest physical address where VMGenID data lives. + pub guest_address: GuestAddress, + /// GSI number for the device + pub gsi: u32, +} + +#[derive(Debug, thiserror::Error, displaydoc::Display)] +pub enum VmGenIdError { + /// Error with VMGenID interrupt: {0} + Interrupt(#[from] std::io::Error), + /// Error accessing VMGenID memory: {0} + GuestMemory(#[from] GuestMemoryError), + /// Create generation ID error: {0} + GenerationId(#[from] RandError), + /// Failed to allocate requested resource: {0} + Allocator(#[from] vm_allocator::Error), +} + +impl VmGenId { + /// Create a new Vm Generation Id device using an address in the guest for writing the + /// generation ID and a GSI for sending device notifications. + pub fn from_parts( + guest_address: GuestAddress, + gsi: u32, + mem: &GuestMemoryMmap, + ) -> Result { + debug!( + "vmgenid: building VMGenID device. Address: {:#010x}. IRQ: {}", + guest_address.0, gsi + ); + let interrupt_evt = EventFdTrigger::new(EventFd::new(libc::EFD_NONBLOCK)?); + let gen_id = Self::make_genid()?; + + // Write generation ID in guest memory + debug!( + "vmgenid: writing new generation ID to guest: {:#034x}", + gen_id + ); + mem.write_slice(&gen_id.to_le_bytes(), guest_address) + .inspect_err(|err| error!("vmgenid: could not write generation ID to guest: {err}"))?; + + Ok(Self { + gen_id, + interrupt_evt, + guest_address, + gsi, + }) + } + + /// Create a new VMGenID device + /// + /// Allocate memory and a GSI for sending notifications and build the device + pub fn new( + mem: &GuestMemoryMmap, + resource_allocator: &mut ResourceAllocator, + ) -> Result { + let gsi = resource_allocator.allocate_gsi(1)?; + let addr = resource_allocator.allocate_system_memory( + 4096, + 8, + vm_allocator::AllocPolicy::LastMatch, + )?; + + Self::from_parts(GuestAddress(addr), gsi[0], mem) + } + + // Create a 16-bytes random number + fn make_genid() -> Result { + let mut gen_id_bytes = [0u8; 16]; + rand::fill(&mut gen_id_bytes) + .inspect_err(|err| error!("vmgenid: could not create new generation ID: {err}"))?; + Ok(u128::from_le_bytes(gen_id_bytes)) + } + + /// Send an ACPI notification to guest device. + /// + /// This will only have effect if we have updated the generation ID in guest memory, i.e. when + /// re-creating the device after snapshot resumption. + pub fn notify_guest(&mut self) -> Result<(), std::io::Error> { + self.interrupt_evt + .trigger() + .inspect_err(|err| error!("vmgenid: could not send guest notification: {err}"))?; + debug!("vmgenid: notifying guest about new generation ID"); + Ok(()) + } +} + +/// Logic to save/restore the state of a VMGenID device + +#[derive(Default, Debug, Clone, Serialize, Deserialize)] +pub struct VMGenIDState { + /// GSI used for VMGenID device + pub gsi: u32, + /// memory address of generation ID + pub addr: u64, +} + +#[derive(Debug)] +pub struct VMGenIdConstructorArgs<'a> { + pub mem: &'a GuestMemoryMmap, + pub resource_allocator: &'a mut ResourceAllocator, +} + +impl<'a> Persist<'a> for VmGenId { + type State = VMGenIDState; + type ConstructorArgs = VMGenIdConstructorArgs<'a>; + type Error = VmGenIdError; + + fn save(&self) -> Self::State { + VMGenIDState { + gsi: self.gsi, + addr: self.guest_address.0, + } + } + + fn restore( + constructor_args: Self::ConstructorArgs, + state: &Self::State, + ) -> std::result::Result { + constructor_args.resource_allocator.allocate_system_memory( + 4096, + 8, + vm_allocator::AllocPolicy::ExactMatch(state.addr), + )?; + Self::from_parts(GuestAddress(state.addr), state.gsi, constructor_args.mem) + } +} + +impl Aml for VmGenId { + fn append_aml_bytes(&self, v: &mut Vec) { + #[allow(clippy::cast_possible_truncation)] + let addr_low = self.guest_address.0 as u32; + let addr_high = (self.guest_address.0 >> 32) as u32; + aml::Device::new( + "_SB_.VGEN".into(), + vec![ + &aml::Name::new("_HID".into(), &"FCVMGID"), + &aml::Name::new("_CID".into(), &"VM_Gen_Counter"), + &aml::Name::new("_DDN".into(), &"VM_Gen_Counter"), + &aml::Name::new( + "ADDR".into(), + &aml::Package::new(vec![&addr_low, &addr_high]), + ), + ], + ) + .append_aml_bytes(v) + } +} diff --git a/src/vmm/src/devices/mod.rs b/src/vmm/src/devices/mod.rs index 42938002733..393b4234515 100644 --- a/src/vmm/src/devices/mod.rs +++ b/src/vmm/src/devices/mod.rs @@ -9,6 +9,8 @@ use std::io; +#[cfg(target_arch = "x86_64")] +pub mod acpi; pub mod bus; pub mod legacy; pub mod pseudo; diff --git a/src/vmm/src/lib.rs b/src/vmm/src/lib.rs index 3dc5b081fee..1584bd19158 100644 --- a/src/vmm/src/lib.rs +++ b/src/vmm/src/lib.rs @@ -117,7 +117,11 @@ use std::sync::mpsc::RecvTimeoutError; use std::sync::{Arc, Barrier, Mutex}; use std::time::Duration; +#[cfg(target_arch = "x86_64")] +use device_manager::acpi::ACPIDeviceManager; use device_manager::resources::ResourceAllocator; +#[cfg(target_arch = "x86_64")] +use devices::acpi::vmgenid::VmGenIdError; use event_manager::{EventManager as BaseEventManager, EventOps, Events, MutEventSubscriber}; use seccompiler::BpfProgram; use userfaultfd::Uffd; @@ -257,6 +261,9 @@ pub enum VmmError { VmmObserverInit(utils::errno::Error), /// Error thrown by observer object on Vmm teardown: {0} VmmObserverTeardown(utils::errno::Error), + /// VMGenID error: {0} + #[cfg(target_arch = "x86_64")] + VMGenID(#[from] VmGenIdError), } /// Shorthand type for KVM dirty page bitmap. @@ -318,6 +325,8 @@ pub struct Vmm { mmio_device_manager: MMIODeviceManager, #[cfg(target_arch = "x86_64")] pio_device_manager: PortIODeviceManager, + #[cfg(target_arch = "x86_64")] + acpi_device_manager: ACPIDeviceManager, } impl Vmm { @@ -522,6 +531,8 @@ impl Vmm { let device_states = self.mmio_device_manager.save(); let memory_state = self.guest_memory().describe(); + #[cfg(target_arch = "x86_64")] + let acpi_dev_state = self.acpi_device_manager.save(); Ok(MicrovmState { vm_info: vm_info.clone(), @@ -529,6 +540,8 @@ impl Vmm { vm_state, vcpu_states, device_states, + #[cfg(target_arch = "x86_64")] + acpi_dev_state, }) } diff --git a/src/vmm/src/persist.rs b/src/vmm/src/persist.rs index e2985e795ed..6c8058899f2 100644 --- a/src/vmm/src/persist.rs +++ b/src/vmm/src/persist.rs @@ -26,6 +26,8 @@ use crate::cpu_config::templates::StaticCpuTemplate; use crate::cpu_config::x86_64::cpuid::common::get_vendor_id_from_host; #[cfg(target_arch = "x86_64")] use crate::cpu_config::x86_64::cpuid::CpuidTrait; +#[cfg(target_arch = "x86_64")] +use crate::device_manager::persist::ACPIDeviceManagerState; use crate::device_manager::persist::{DevicePersistError, DeviceStates}; use crate::logger::{info, warn}; use crate::resources::VmResources; @@ -83,6 +85,9 @@ pub struct MicrovmState { pub vcpu_states: Vec, /// Device states. pub device_states: DeviceStates, + /// ACPI devices state. + #[cfg(target_arch = "x86_64")] + pub acpi_dev_state: ACPIDeviceManagerState, } /// This describes the mapping between Firecracker base virtual address and @@ -155,7 +160,7 @@ pub enum CreateSnapshotError { } /// Snapshot version -pub const SNAPSHOT_VERSION: Version = Version::new(1, 0, 0); +pub const SNAPSHOT_VERSION: Version = Version::new(2, 0, 0); /// Creates a Microvm snapshot. pub fn create_snapshot( @@ -626,6 +631,8 @@ mod tests { use utils::tempfile::TempFile; use super::*; + #[cfg(target_arch = "x86_64")] + use crate::builder::tests::insert_vmgenid_device; use crate::builder::tests::{ default_kernel_cmdline, default_vmm, insert_balloon_device, insert_block_devices, insert_net_device, insert_vsock_device, CustomBlockConfig, @@ -686,6 +693,9 @@ mod tests { insert_vsock_device(&mut vmm, &mut cmdline, &mut event_manager, vsock_config); + #[cfg(target_arch = "x86_64")] + insert_vmgenid_device(&mut vmm); + vmm } @@ -717,6 +727,8 @@ mod tests { vm_state: vmm.vm.save_state(&mpidrs).unwrap(), #[cfg(target_arch = "x86_64")] vm_state: vmm.vm.save_state().unwrap(), + #[cfg(target_arch = "x86_64")] + acpi_dev_state: vmm.acpi_device_manager.save(), }; let mut buf = vec![0; 10000]; diff --git a/tests/conftest.py b/tests/conftest.py index 91c88d08c19..b98c00a5403 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -35,7 +35,7 @@ import host_tools.cargo_build as build_tools from framework import defs, utils -from framework.artifacts import kernel_params, rootfs_params +from framework.artifacts import kernel_params, kernels_unfiltered, rootfs_params from framework.microvm import MicroVMFactory from framework.properties import global_props from framework.utils_cpu_templates import ( @@ -352,6 +352,12 @@ def rootfs_fxt(request, record_property): guest_kernel_linux_5_10 = pytest.fixture( guest_kernel_fxt, params=kernel_params("vmlinux-5.10*") ) +# Use the unfiltered selector, since we don't officially support 6.1 yet. +# TODO: switch to default selector once we add full 6.1 support. +guest_kernel_linux_6_1 = pytest.fixture( + guest_kernel_fxt, + params=kernel_params("vmlinux-6.1*", select=kernels_unfiltered), +) # Fixtures for all Ubuntu rootfs, and specific versions rootfs = pytest.fixture(rootfs_fxt, params=rootfs_params("*.squashfs")) diff --git a/tests/framework/artifacts.py b/tests/framework/artifacts.py index da95746a766..37882c74e48 100644 --- a/tests/framework/artifacts.py +++ b/tests/framework/artifacts.py @@ -58,14 +58,27 @@ def kernels(glob) -> Iterator: break +def kernels_unfiltered(glob) -> Iterator: + """Return kernels from the CI artifacts. This one does not filter for + supported kernels. It will return any kernel in the CI artifacts folder + that matches the 'glob' + """ + all_kernels = [r"vmlinux-\d.\d+.\d+", r"vmlinux-5.10-no-sve-bin"] + for kernel in sorted(ARTIFACT_DIR.rglob(glob)): + for kernel_regex in all_kernels: + if re.fullmatch(kernel_regex, kernel.name): + yield kernel + break + + def disks(glob) -> Iterator: """Return supported rootfs""" yield from sorted(ARTIFACT_DIR.glob(glob)) -def kernel_params(glob="vmlinux-*") -> Iterator: +def kernel_params(glob="vmlinux-*", select=kernels) -> Iterator: """Return supported kernels""" - for kernel in kernels(glob): + for kernel in select(glob): yield pytest.param(kernel, id=kernel.name) diff --git a/tests/integration_tests/functional/test_max_devices.py b/tests/integration_tests/functional/test_max_devices.py index 5991cbd6926..a86724b25ab 100644 --- a/tests/integration_tests/functional/test_max_devices.py +++ b/tests/integration_tests/functional/test_max_devices.py @@ -6,9 +6,9 @@ import pytest -# IRQs are available from 5 to 23, so the maximum number of devices -# supported at the same time is 19. -MAX_DEVICES_ATTACHED = 19 +# IRQs are available from 5 to 23. We always use one IRQ for VMGenID device, so +# the maximum number of devices supported at the same time is 18. +MAX_DEVICES_ATTACHED = 18 @pytest.mark.skipif( diff --git a/tests/integration_tests/functional/test_snapshot_basic.py b/tests/integration_tests/functional/test_snapshot_basic.py index 02ed8d09c77..1014c84ecd2 100644 --- a/tests/integration_tests/functional/test_snapshot_basic.py +++ b/tests/integration_tests/functional/test_snapshot_basic.py @@ -13,6 +13,7 @@ import host_tools.drive as drive_tools from framework.microvm import SnapshotType +from framework.properties import global_props from framework.utils import check_filesystem, run_cmd, wait_process_termination from framework.utils_vsock import ( ECHO_SERVER_PORT, @@ -25,6 +26,20 @@ start_guest_echo_server, ) +# Kernel emits this message when it resumes from a snapshot with VMGenID device +# present +DMESG_VMGENID_RESUME = "random: crng reseeded due to virtual machine fork" + + +def check_vmgenid_update_count(vm, resume_count): + """ + Kernel will emit the DMESG_VMGENID_RESUME every time we resume + from a snapshot + """ + rc, stdout, stderr = vm.ssh.run("dmesg") + assert rc == 0, stderr + assert resume_count == stdout.count(DMESG_VMGENID_RESUME) + def _get_guest_drive_size(ssh_connection, guest_dev_name="/dev/vdb"): # `lsblk` command outputs 2 lines to STDOUT: @@ -118,6 +133,7 @@ def test_5_snapshots( microvm = microvm_factory.build() microvm.spawn() microvm.restore_from_snapshot(snapshot, resume=True) + # TODO: SIGCONT here and SIGSTOP later before creating snapshot # is a temporary fix to avoid vsock timeout in # _vsock_connect_to_guest(). This will be removed once we @@ -187,6 +203,7 @@ def test_patch_drive_snapshot(uvm_nano, microvm_factory): vm = microvm_factory.build() vm.spawn() vm.restore_from_snapshot(snapshot, resume=True) + # Attempt to connect to resumed microvm and verify the new microVM has the # right scratch drive. guest_drive_size = _get_guest_drive_size(vm.ssh) @@ -495,3 +512,50 @@ def test_snapshot_overwrite_self(guest_kernel, rootfs, microvm_factory): # restored, with a new snapshot of this vm, does not break the VM rc, _, stderr = vm.ssh.run("true") assert rc == 0, stderr + + +@pytest.mark.parametrize("snapshot_type", [SnapshotType.DIFF, SnapshotType.FULL]) +def test_vmgenid(guest_kernel_linux_6_1, rootfs, microvm_factory, snapshot_type): + """ + Test VMGenID device upon snapshot resume + """ + if global_props.cpu_architecture != "x86_64": + pytest.skip("At the moment we only support VMGenID on x86_64") + + base_vm = microvm_factory.build(guest_kernel_linux_6_1, rootfs) + base_vm.spawn() + base_vm.basic_config(track_dirty_pages=True) + base_vm.add_net_iface() + base_vm.start() + + # Wait for microVM to be booted + rc, _, stderr = base_vm.ssh.run("true") + assert rc == 0, stderr + + snapshot = base_vm.make_snapshot(snapshot_type) + base_snapshot = snapshot + base_vm.kill() + + for i in range(5): + vm = microvm_factory.build() + vm.spawn() + vm.restore_from_snapshot(snapshot, resume=True) + + # Make sure the microVM is up + rc, _, stderr = vm.ssh.run("true") + assert rc == 0, stderr + + # We should have as DMESG_VMGENID_RESUME messages as + # snapshots we have resumed + check_vmgenid_update_count(vm, i + 1) + + snapshot = vm.make_snapshot(snapshot_type) + vm.kill() + + # If we are testing incremental snapshots we ust merge the base with + # current layer. + if snapshot.is_diff: + snapshot = snapshot.rebase_snapshot(base_snapshot) + + # Update the base for next iteration + base_snapshot = snapshot diff --git a/tests/integration_tests/security/test_vulnerabilities.py b/tests/integration_tests/security/test_vulnerabilities.py index b22a76a2ab6..15445d7c1e8 100644 --- a/tests/integration_tests/security/test_vulnerabilities.py +++ b/tests/integration_tests/security/test_vulnerabilities.py @@ -122,6 +122,9 @@ def with_restore(factory, microvm_factory): def restore(firecracker=None, jailer=None): microvm = factory(firecracker, jailer) + # Ensure that we have booted before getting the snapshot. + rc, _, stderr = microvm.ssh.run("true") + assert rc == 0, stderr snapshot = microvm.snapshot_full() if firecracker: