From af19d42d6ab7761f9eca6e11e34e519a7d4c84b2 Mon Sep 17 00:00:00 2001
From: Babis Chalios <bchalios@amazon.es>
Date: Thu, 4 Apr 2024 13:18:14 +0000
Subject: [PATCH] acpi: add documentation about VMGenID

Extend our current documentation for snapshotting and entropy
recommendations with context about VMGenID. Mention the available
VMGenID features depending on Linux version and also provide
recommendations for entropy on VM clones based on VMGenID availability.

Also, add CHANGELOG entry for VMGenID support.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
---
 CHANGELOG.md                           |  9 ++++
 docs/snapshotting/random-for-clones.md | 69 +++++++++++++++++++++-----
 docs/snapshotting/snapshot-support.md  | 49 ++++++++++++++----
 3 files changed, 105 insertions(+), 22 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 28e2b8f7bcca..78ef783d67c8 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -17,6 +17,15 @@ and this project adheres to
   without MPTable support. Please see our
   [kernel policy documentation](docs/kernel-policy.md) for more information
   regarding relevant kernel configurations.
+- [#4487](https://github.com/firecracker-microvm/firecracker/pull/4487): Added
+  support for the Virtual Machine Generation Identifier (VMGenID) device on
+  x86_64 platforms. VMGenID is a virtual device that allows VMMs to notify
+  guests when they are resumed from a snapshot. Linux includes VMGenID support
+  since version 5.18. It uses notifications from the device to reseed its
+  internal CSPRNG. Please refer to
+  [snapshot support](docs/snapshotting/snapshot-support.md) and
+  [random for clones](docs/snapshotting/random-for-clones.md) documention for
+  more info on VMGenID.
 
 ### Changed
 
diff --git a/docs/snapshotting/random-for-clones.md b/docs/snapshotting/random-for-clones.md
index f5598728f4fa..4ab74e47c107 100644
--- a/docs/snapshotting/random-for-clones.md
+++ b/docs/snapshotting/random-for-clones.md
@@ -22,17 +22,19 @@ which wraps the [`AWS-LC` cryptographic library][9].
 
 Traditionally, `/dev/random` has been considered a source of “true” randomness,
 with the downside that reads block when the pool of entropy gets depleted. On
-the other hand, `/dev/urandom` doesn’t block, but provides lower quality
-results. It turns out the distinction in output quality is actually very hard to
-make. According to [this article][2], for kernel versions prior to 4.8, both
-devices draw their output from the same pool, with the exception that
-`/dev/random` will block when the system estimates the entropy count has
-decreased below a certain threshold. The `/dev/urandom` output is considered
-secure for virtually all purposes, with the caveat that using it before the
-system gathers sufficient entropy for initialization may indeed produce low
-quality random numbers. The `getrandom` syscall helps with this situation; it
-uses the `/dev/urandom` source by default, but will block until it gets properly
-initialized (the behavior can be altered via configuration flags).
+the other hand, `/dev/urandom` doesn’t block, which lead people believe that it
+provides lower quality results.
+
+It turns out the distinction in output quality is actually very hard to make.
+According to [this article][2], for kernel versions prior to 4.8, both devices
+draw their output from the same pool, with the exception that `/dev/random` will
+block when the system estimates the entropy count has decreased below a certain
+threshold. The `/dev/urandom` output is considered secure for virtually all
+purposes, with the caveat that using it before the system gathers sufficient
+entropy for initialization may indeed produce low quality random numbers. The
+`getrandom` syscall helps with this situation; it uses the `/dev/urandom` source
+by default, but will block until it gets properly initialized (the behavior can
+be altered via configuration flags).
 
 Newer kernels (4.8+) have switched to an implementation where `/dev/random`
 output comes from a pool called the blocking pool, the output of `/dev/urandom`
@@ -41,6 +43,8 @@ and there’s also an input pool which gathers entropy from various sources
 available on the system, and is used to feed into or seed the other two
 components. A very detailed description is available [here][3].
 
+### Linux kernels from 4.8 until 5.17 (included)
+
 The details of this newer implementation are used to make the recommendations
 present in the document. There are in-kernel interfaces used to obtain random
 numbers as well, but they are similar to using `/dev/urandom` (or `getrandom`
@@ -99,6 +103,38 @@ not increase the current entropy estimation. There is also an `ioctl` interface
 which, given the appropriate privileges, can be used to add data to the input
 entropy pool while also increasing the count, or completely empty all pools.
 
+### Linux kernels from 5.18 onwards
+
+Since version 5.18, Linux has support for the
+[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier).
+The purpose of VMGenID is to notify the guest about time shift events, such as
+resuming from a snapshot. The device exposes a 16-byte cryptographically random
+identifier in guest memory. Firecracker implements VMGenID. When resuming a
+microVM from a snapshot Firecracker writes a new identifier and injects a
+notification to the guest. Linux,
+[uses this value](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/virt/vmgenid.c#L77)
+[as new randomness for its CSPRNG](https://elixir.bootlin.com/linux/v5.18.19/source/drivers/char/random.c#L908).
+Quoting the random.c implementation of the kernel:
+
+```
+/*
+ * Handle a new unique VM ID, which is unique, not secret, so we
+ * don't credit it, but we do immediately force a reseed after so
+ * that it's used by the crng posthaste.
+ */
+```
+
+As a result, values returned by `getrandom()` and `/dev/(u)random` are distinct
+in all VMs started from the same snapshot, **after** the kernel handles the
+VMGenID notification. This leaves a race window between resuming vCPUs and Linux
+CSPRNG getting successfully re-seeded. In Linux 6.8, we
+[extended VMGenID](https://lore.kernel.org/lkml/20230531095119.11202-2-bchalios@amazon.es/)
+to emit a uevent to user space when it handles the notification. User space can
+poll this uevent to know when it safe to use `getrandom()`, et al. avoiding the
+race condition.
+
+### User space considerations
+
 Init systems (such as `systemd` used by AL2 and other distros) might save a
 random seed file after boot. For `systemd`, the path is
 `/var/lib/systemd/random-seed`. Just to be on the safe side, any such file
@@ -121,8 +157,8 @@ alter the read result via bind mounting another file on top of
   and should be sufficient for most cases.
 - Use `virtio-rng`. When present, the guest kernel uses the device as an
   additional source of entropy.
-- To be as safe as possible, the direct approach is to do the following (before
-  customer code is resumed in the clone):
+- On kernels before 5.18, to be as safe as possible, the direct approach is to
+  do the following (before customer code is resumed in the clone):
   1. Open one of the special devices files (either `/dev/random` or
      `/dev/urandom`). Take note that `RNDCLEARPOOL` no longer
      [has any effect][7] on the entropy pool.
@@ -133,6 +169,13 @@ alter the read result via bind mounting another file on top of
   1. Issue a `RNDRESEEDCRNG` ioctl call ([4.14][5], [5.10][6], (requires
      `CAP_SYS_ADMIN`)) that specifically causes the `CSPRNG` to be reseeded from
      the input pool.
+- On kernels starting from 5.18 onwards, the CSPRNG will be automatically
+  reseeded when the guest kernel handles the VMGenID notification. To completely
+  avoid the race condition, users should follow the same steps as with kernels
+  \< 5.18.
+- On kernels starting from 6.8, users can poll for the VMGenID uevent that the
+  driver sends when the CSPRNG is reseeded after handling the VMGenID
+  notification.
 
 **Annex 1 contains the source code of a C program which implements the previous
 three steps.** As soon as the guest kernel version switches to 4.19 (or higher),
diff --git a/docs/snapshotting/snapshot-support.md b/docs/snapshotting/snapshot-support.md
index ccd57e08ee40..347829c4890b 100644
--- a/docs/snapshotting/snapshot-support.md
+++ b/docs/snapshotting/snapshot-support.md
@@ -146,6 +146,10 @@ The snapshot functionality is still in developer preview due to the following:
 - If a [CPU template](../cpu_templates/cpu-templates.md) is not used on x86_64,
   overwrites of `MSR_IA32_TSX_CTRL` MSR value will not be preserved after
   restoring from a snapshot.
+- Resuming from a snapshot that was taken during early stages of the guest
+  kernel boot might lead to crashes upon snapshot resume. We suggest that users
+  take snapshot after the guest microVM kernel has booted. Please see
+  [VMGenID device limitation](#vmgenid-device-limitation).
 
 ## Firecracker Snapshotting characteristics
 
@@ -571,15 +575,32 @@ we also consider microVM A insecure if it resumes execution.
 
 ### Reusing snapshotted states securely
 
-We are currently working to add a functionality that will notify guest operating
-systems of the snapshot event in order to enable secure reuse of snapshotted
-microVM states, guest operating systems, language runtimes, and cryptographic
-libraries. In some cases, user applications will need to handle the snapshot
-create/restore events in such a way that the uniqueness and randomness
-properties are preserved and guaranteed before resuming the workload.
-
-We've started a discussion on how the Linux operating system might securely
-handle being snapshotted [here](https://lkml.org/lkml/2020/10/16/629).
+[Virtual Machine Generation Identifier](https://learn.microsoft.com/en-us/windows/win32/hyperv_v2/virtual-machine-generation-identifier)
+(VMGenID) is a virtual device that allows VM guests to detect when they have
+resumed from a snapshot. It works by exposing a cryptographically random
+16-bytes identifier to the guest. The VMM ensures that the value of the
+indentifier changes every time the VM a time shift happens in the lifecycle of
+the VM, e.g. when it resumes from a snapshot.
+
+Linux supports VMGenID since version 6.1. When Linux detects a change in the
+identifier, it uses its value to reseed its internal PRNG. Moreover,
+[since version 6.8](https://lkml.org/lkml/2023/5/31/414) Linux VMGenID driver
+also emits to userspace a uevent. User space processes can monitor this uevent
+for detecting snapshot resume events.
+
+Firecracker supports VMGenID device on x86 platforms. Firecracker will always
+enable the device. During snapshot resume, Firecracker will update the 16-byte
+generation ID and inject a notification in the guest before resuming its vCPUs.
+
+As a result, guests that run Linux versions >= 6.1 will re-seed their in-kernel
+PRNG upon snapshot resume. User space applications can rely on the guest kernel
+for randomness. State other than the guest kernel entropy pool, such as unique
+identifiers, cached random numbers, cryptographic tokens, etc **will** still be
+replicated across multiple microVMs resumed from the same snapshot. Users need
+to implement mechanisms for ensuring de-duplication of such state, where needed.
+On guests that run Linux versions >= 6.8, users can make use ofthe uevent that
+VMGenID driver emits upon resuming from a snapshot, to be notified about
+snapshot resume events.
 
 ## Vsock device limitation
 
@@ -605,6 +626,16 @@ section 5.10.6.6 Device Events.
 Firecracker handles sending the `reset` event to the vsock driver, thus the
 customers are no longer responsible for closing active connections.
 
+## VMGenID device limitation
+
+During snashot resume, Firecracker updates the 16-byte generation ID of the
+VMGenID device and injects an interrupt in the guest before resuming vCPUs. If
+the snapshot was taken at the very early stages of the guest kernel boot process
+proper interrupt handling might not be in place yet. As a result, the kernel
+might not be able to handle the injected notification and crash. We suggest to
+users that they take snapshots only after the guest kernel has completed
+booting, to avoid this issue.
+
 ## Snapshot compatibility across kernel versions
 
 We have a mechanism in place to experiment with snapshot compatibility across