When restoring MSRs, we currently do not enforce any sort of relative
ordering. However, some MSRs have implicit dependencies on other MSRs
being restored before them, and failing to fulfill these dependencies
can result in incorrect VM behavior after resuming.
One example of such an implicit dependency between MSRs is the pair
(`MSR_IA32_TSC_DEADLINE`, `MSR_IA32_TSC`). When restoring
`MSR_IA32_TSC_DEADLINE`, KVM internally checks whether the value of this
restoration implies that the guest was waiting for the tsc_deadline
timer to expire at the time of being paused for snapshotting. If yes, it
primes a (either harddware or software depending on support) timer on
the host to make sure the guest will receive the expected interrupt
after restoration. To determine whether this is needed, KVM looks at the
guest's timestamp counter (TSC) and compares it with the requested
tsc_deadline value. The value KVM reads for the guest's TSC depends on
the value of MSR_IA32_TSC. Thus, if MSR_IA32_TSC_DEADLINE is set before
MSR_IA32_TSC is restored, this comparison will yield a wrong result (as
the deadline value is compared with something uninitialized). This can
either result in KVM determining the guest wasn't waiting for a timing
expiry at the time of snapshotting, or cause it to schedule the timer
interrupt too far into the future.
Note that the former is only a problem on AMD platforms, which do not
support the TSC_DEADLINE feature at the hardware level. Here, KVM falls
back to a software timer, which explicitly does not notify the vCPU if
the deadline value is "in the past". The hardware timer used on other
x86 platforms on the other hand always fires (potentially firing
immediately if the deadline value is in the past).
This commit fixes the above by ensuring we always restore MSR_IA32_TSC
before restoring MSR_IA32_TSC_DEADLINE. We realize this by splitting the
lists of MSRs that KVM gives us into one additional chunk containing all
"deferred" MSRs that needs to be restored "as late as possible". This
splitting happens at snapshot creation time, to get it off the hot-path.
Fixes firecracker-microvm#4099
Signed-off-by: Patrick Roy <[email protected]>