Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snaps] Filesystem errors when loading a firecracker VM snapshot on a different machine #759

Open
CuriousGeorgiy opened this issue Aug 22, 2023 · 1 comment

Comments

@CuriousGeorgiy
Copy link

CuriousGeorgiy commented Aug 22, 2023

I am hacking on firecracker-containerd to support firecracker snapshots, and I am facing the following problem (see also firecracker-microvm/firecracker#4036).

I create a snapshot of a VM running a nginx container, and then I try to load this snapshot on a different machine (for preserving the disk state, I simply commit a container snapshot using source code from nerdctl commit).

When restoring the disk state, I patch (see also firecracker-microvm/firecracker#4014) the snapshot's disk device path for the container snapshot device to point to a fresh containerd snapshot mount point (the previously committed image is mounted).

Snapshot loading succeeds and the container is responsive even to http requests (I am dropping the network setup details since I don't have problems with it now), but the nginx container returns internal server errors, and the following error appears in the VM's kernel logs:

DEBU[2023-08-15T07:21:33.072516371-04:00] [   38.858355] EXT4-fs error (device vdb): ext4_find_entry:1447: inode #262: comm nginx: checksumming directory block 0  jailer=noop runtime=aws.firecracker vmID=1 vmm_stream=stdout
DEBU[2023-08-15T07:21:33.076584363-04:00] [   38.862540] EXT4-fs error (device vdb): ext4_find_entry:1447: inode #262: comm nginx: checksumming directory block 0  jailer=noop runtime=aws.firecracker vmID=1 vmm_stream=stdout

It seems like the restored disk state (via container snapshot commit) is inconsistent with the VM's disk state.

For other containers, such as simple Python or Golang http servers, the symptoms after sending a request to the container loaded from a snapshot are crashes (Python interpreter trap on invalid opcode or Golang runtime panic).

I have tried manually doing the same thing, i.e.:

  1. Pull image using firecracker-ctr
  2. Prepare snapshot using firecracker-ctr
  3. Setup firecracker using the getting started guide, adding a stub drive like firecracker-containerd
  4. Send a patch drive request replacing the stub drive with the container snapshot mount
  5. Manually mount the container snapshot inside the VM and launch the nginx server
  6. Pause the VM and create a snapshot
  7. Transfer the VM state files to a different machine using rsync
  8. Repeat steps 1-2 on the second machine.
  9. Resume the VM on the second machine.
  10. Everything works, the nginx server responds with no errors.

Even though the disk state is technically not the same (creating a container commit and pushing it to a registry would require patching firecracker-ctr), the container responds with a greeting (as opposed to an internal server error when doing the same stuff using firecracker-containerd) and seems healthy. Though I do see the same error in the kernel log, I believe it is related to the disk state difference.

Manually loading snapshots of simpler setups (manually creating the container snapshot for a simple Golang http server) also works okay.

Discussing this issue with firecracker folks in scope of firecracker-microvm/firecracker#4036, we came to the conclusion that the problem is rather in firecracker-containerd than in firecracker.

AFAIC, I studied all firecracker-containerd interactions with container snapshots and firecracker (both the VM and the agent running in the VM), and I didn’t find any problems and any special filesystem actions other than those I did manually.

This leads me to the conclusion that the problem may be with the shim and container filesystem setup.

My patch to firecracker-containerd can be found in #760.

I can provide more context and more detailed steps for reproducing this issue, if needed. I would really appreciate any help or suggestions on this effort to support firecracker snapshots in firecracker-containerd.

@CuriousGeorgiy CuriousGeorgiy changed the title [Snaps] Filesystem errors when loading a snapshot on a different machine [Snaps] Filesystem errors when loading a firecracker VM snapshot on a different machine Aug 22, 2023
@CuriousGeorgiy
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant