diff --git a/CHANGELOG.md b/CHANGELOG.md index 28e2b8f7bcc..e0ab1066e5a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,16 @@ and this project adheres to without MPTable support. Please see our [kernel policy documentation](docs/kernel-policy.md) for more information regarding relevant kernel configurations. +- [#4487](https://github.com/firecracker-microvm/firecracker/pull/4487): Added + support for the Virtual Machine Generation Identifier (VMGenID) device on + x86_64 platforms. VMGenID is a virtual device that allows VMMs to notify + guests when they are resumed from a snapshot. Linux includes VMGenID support + since version 5.18. It uses notifications from the device to reseed its + internal CSPRNG. Please refer to + [snapshot support](docs/snapshotting/snapshot-support.md) and + [random for clones](docs/snapshotting/random-for-clones.md) documention for + more info on VMGenID. VMGenID state is part of the snapshot format of + Firecracker. As a result, Firecracker snapshot version is now 2.0.0. ### Changed diff --git a/docs/snapshotting/random-for-clones.md b/docs/snapshotting/random-for-clones.md index f5598728f4f..18f0cf452d3 100644 --- a/docs/snapshotting/random-for-clones.md +++ b/docs/snapshotting/random-for-clones.md @@ -22,17 +22,19 @@ which wraps the [`AWS-LC` cryptographic library][9]. Traditionally, `/dev/random` has been considered a source of “true” randomness, with the downside that reads block when the pool of entropy gets depleted. On -the other hand, `/dev/urandom` doesn’t block, but provides lower quality -results. It turns out the distinction in output quality is actually very hard to -make. According to [this article][2], for kernel versions prior to 4.8, both -devices draw their output from the same pool, with the exception that -`/dev/random` will block when the system estimates the entropy count has -decreased below a certain threshold. The `/dev/urandom` output is considered -secure for virtually all purposes, with the caveat that using it before the -system gathers sufficient entropy for initialization may indeed produce low -quality random numbers. The `getrandom` syscall helps with this situation; it -uses the `/dev/urandom` source by default, but will block until it gets properly -initialized (the behavior can be altered via configuration flags). +the other hand, `/dev/urandom` doesn’t block, which lead people believe that it +provides lower quality results. + +It turns out the distinction in output quality is actually very hard to make. +According to [this article][2], for kernel versions prior to 4.8, both devices +draw their output from the same pool, with the exception that `/dev/random` will +block when the system estimates the entropy count has decreased below a certain +threshold. The `/dev/urandom` output is considered secure for virtually all +purposes, with the caveat that using it before the system gathers sufficient +entropy for initialization may indeed produce low quality random numbers. The +`getrandom` syscall helps with this situation; it uses the `/dev/urandom` source +by default, but will block until it gets properly initialized (the behavior can +be altered via configuration flags). Newer kernels (4.8+) have switched to an implementation where `/dev/random` output comes from a pool called the blocking pool, the output of `/dev/urandom` @@ -41,6 +43,8 @@ and there’s also an input pool which gathers entropy from various sources available on the system, and is used to feed into or seed the other two components. A very detailed description is available [here][3]. +### Linux kernels from 4.8 until 5.17 (included) + The details of this newer implementation are used to make the recommendations present in the document. There are in-kernel interfaces used to obtain random numbers as well, but they are similar to using `/dev/urandom` (or `getrandom` @@ -99,6 +103,42 @@ not increase the current entropy estimation. There is also an `ioctl` interface which, given the appropriate privileges, can be used to add data to the input entropy pool while also increasing the count, or completely empty all pools. +### Linux kernels from 5.18 onwards + +Since version 5.18, Linux has support for the +[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier). +The purpose of VMGenID is to notify the guest about time shift events, such as +resuming from a snapshot. The device exposes a 16-byte cryptographically random +identifier in guest memory. Firecracker implements VMGenID. When resuming a +microVM from a snapshot Firecracker writes a new identifier and injects a +notification to the guest. Linux, +[uses this value](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/virt/vmgenid.c#L77) +[as new randomness for its CSPRNG](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/char/random.c#L908). +Quoting the random.c implementation of the kernel: + +``` +/* + * Handle a new unique VM ID, which is unique, not secret, so we + * don't credit it, but we do immediately force a reseed after so + * that it's used by the crng posthaste. + */ +``` + +As a result, values returned by `getrandom()` and `/dev/(u)random` are distinct +in all VMs started from the same snapshot, **after** the kernel handles the +VMGenID notification. This leaves a race window between resuming vCPUs and Linux +CSPRNG getting successfully re-seeded. In Linux 6.8, we +[extended VMGenID](https://lore.kernel.org/lkml/20230531095119.11202-2-bchalios@amazon.es/) +to emit a uevent to user space when it handles the notification. User space can +poll this uevent to know when it is safe to use `getrandom()`, et al. avoiding +the race condition. + +Please note that, Firecracker will always enable VMGenID. In kernels earlier +than 5.18, where there is no VMGenID driver, the device will not have any effect +in the guest. + +### User space considerations + Init systems (such as `systemd` used by AL2 and other distros) might save a random seed file after boot. For `systemd`, the path is `/var/lib/systemd/random-seed`. Just to be on the safe side, any such file @@ -121,8 +161,8 @@ alter the read result via bind mounting another file on top of and should be sufficient for most cases. - Use `virtio-rng`. When present, the guest kernel uses the device as an additional source of entropy. -- To be as safe as possible, the direct approach is to do the following (before - customer code is resumed in the clone): +- On kernels before 5.18, to be as safe as possible, the direct approach is to + do the following (before customer code is resumed in the clone): 1. Open one of the special devices files (either `/dev/random` or `/dev/urandom`). Take note that `RNDCLEARPOOL` no longer [has any effect][7] on the entropy pool. @@ -133,6 +173,13 @@ alter the read result via bind mounting another file on top of 1. Issue a `RNDRESEEDCRNG` ioctl call ([4.14][5], [5.10][6], (requires `CAP_SYS_ADMIN`)) that specifically causes the `CSPRNG` to be reseeded from the input pool. +- On kernels starting from 5.18 onwards, the CSPRNG will be automatically + reseeded when the guest kernel handles the VMGenID notification. To completely + avoid the race condition, users should follow the same steps as with kernels + \< 5.18. +- On kernels starting from 6.8, users can poll for the VMGenID uevent that the + driver sends when the CSPRNG is reseeded after handling the VMGenID + notification. **Annex 1 contains the source code of a C program which implements the previous three steps.** As soon as the guest kernel version switches to 4.19 (or higher), diff --git a/docs/snapshotting/snapshot-support.md b/docs/snapshotting/snapshot-support.md index ccd57e08ee4..136c9b6bce8 100644 --- a/docs/snapshotting/snapshot-support.md +++ b/docs/snapshotting/snapshot-support.md @@ -146,6 +146,10 @@ The snapshot functionality is still in developer preview due to the following: - If a [CPU template](../cpu_templates/cpu-templates.md) is not used on x86_64, overwrites of `MSR_IA32_TSX_CTRL` MSR value will not be preserved after restoring from a snapshot. +- Resuming from a snapshot that was taken during early stages of the guest + kernel boot might lead to crashes upon snapshot resume. We suggest that users + take snapshot after the guest microVM kernel has booted. Please see + [VMGenID device limitation](#vmgenid-device-limitation). ## Firecracker Snapshotting characteristics @@ -571,15 +575,32 @@ we also consider microVM A insecure if it resumes execution. ### Reusing snapshotted states securely -We are currently working to add a functionality that will notify guest operating -systems of the snapshot event in order to enable secure reuse of snapshotted -microVM states, guest operating systems, language runtimes, and cryptographic -libraries. In some cases, user applications will need to handle the snapshot -create/restore events in such a way that the uniqueness and randomness -properties are preserved and guaranteed before resuming the workload. - -We've started a discussion on how the Linux operating system might securely -handle being snapshotted [here](https://lkml.org/lkml/2020/10/16/629). +[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier) +(VMGenID) is a virtual device that allows VM guests to detect when they have +resumed from a snapshot. It works by exposing a cryptographically random +16-bytes identifier to the guest. The VMM ensures that the value of the +indentifier changes every time the VM a time shift happens in the lifecycle of +the VM, e.g. when it resumes from a snapshot. + +Linux supports VMGenID since version 5.18. When Linux detects a change in the +identifier, it uses its value to reseed its internal PRNG. Moreover, +[since version 6.8](https://lkml.org/lkml/2023/5/31/414) Linux VMGenID driver +also emits to userspace a uevent. User space processes can monitor this uevent +for detecting snapshot resume events. + +Firecracker supports VMGenID device on x86 platforms. Firecracker will always +enable the device. During snapshot resume, Firecracker will update the 16-byte +generation ID and inject a notification in the guest before resuming its vCPUs. + +As a result, guests that run Linux versions >= 5.18 will re-seed their in-kernel +PRNG upon snapshot resume. User space applications can rely on the guest kernel +for randomness. State other than the guest kernel entropy pool, such as unique +identifiers, cached random numbers, cryptographic tokens, etc **will** still be +replicated across multiple microVMs resumed from the same snapshot. Users need +to implement mechanisms for ensuring de-duplication of such state, where needed. +On guests that run Linux versions >= 6.8, users can make use of the uevent that +VMGenID driver emits upon resuming from a snapshot, to be notified about +snapshot resume events. ## Vsock device limitation @@ -605,6 +626,16 @@ section 5.10.6.6 Device Events. Firecracker handles sending the `reset` event to the vsock driver, thus the customers are no longer responsible for closing active connections. +## VMGenID device limitation + +During snashot resume, Firecracker updates the 16-byte generation ID of the +VMGenID device and injects an interrupt in the guest before resuming vCPUs. If +the snapshot was taken at the very early stages of the guest kernel boot process +proper interrupt handling might not be in place yet. As a result, the kernel +might not be able to handle the injected notification and crash. We suggest to +users that they take snapshots only after the guest kernel has completed +booting, to avoid this issue. + ## Snapshot compatibility across kernel versions We have a mechanism in place to experiment with snapshot compatibility across