VM Supervisor #198

jmickey · 2021-11-03T11:39:37Z

The Supervisor will be responsible for monitoring running MicroVMs and react to changes that drift from the desired state.

Why do we need this?

We don't currently continuously monitor the state of running VMs. If a VM drifts away from its desired state - e.g. A VM crashes and is in a failed state - we need to wait until the next time the reconciler runs for the VM to be recreated/restarted.

Additionally, we don't currently track if a VM is continuously failing. Flintlock will continue to recreate the VM every time a resync occurs. The reconciler doesn't know if a VM has already been started, as far as it is concerned it only cares about reconciling the existing state to the desired state.

What do we need

The VM Supervisor should exist as a background process/goroutine.
As firecracker does not have a background daemon or event system, the supervisor will need to have knowledge of the VMs that should/do exist and their desired state, and should continuously (on a short timer) check the status of each VM.
The supervisor should probably utilise the containerd state to store events (e.g. VM stopped, VM started, VM restarted X times, etc), and the event bus to notify the reconciler to take action outside the delayed reconciler resync loop.

How this looks on an implementation level is unknown, and it's likely one or more ADRs will need to be produced as a result.

Subtasks

Create supervisor that starts when flintlockd is started, detect and continuously list all VMs and their state directly from Firecracker.
Extend MicroVM model to track microvm events - started, stopped, reboots, etc.
Detect when VMs are in a failed state and trigger an event on the event bus to reconcile the VM state
Track number of failures/restarts and emit metrics.

The text was updated successfully, but these errors were encountered:

richardcase · 2021-11-03T15:38:13Z

I don't think the supervisor itself will store events......just raise them. And it probably shouldn't make any modifications to the microvm spec....so its a read-only consumer of the specs. wdyt?

jmickey · 2021-11-04T01:24:50Z

@richardcase The usage of Events here might be a little overloaded. By "Events" I am more referring to a running kind of log I guess? Similar to the events that are shown when you kubectl describe a resource. An "event" in this case might be that the supervisor has detected that a previously running VM is no longer running, if that makes sense?

Maybe that is actually part of the reconciler, wdyt?

I don't think the supervisor should make changes to the spec, but maybe the status? Again, maybe I didn't think it through enough and it actually belongs in the reconciler.

jmickey · 2021-11-04T01:54:28Z

Actually, maybe you're right. It should probably be the reconciler that updates if a VM has been started, how many times it's been restarted, etc. Then it can also control the back-off if we choose to implement one in the future.

e.g. It can mark the VM as CrashLoopBackOff (I couldn't think of a better name so I just went with the Kubernetes vernacular) and create a gradually increasing ticker to retry?

richardcase · 2021-11-17T13:32:04Z

As part of the implementation, we need to revisit the sleep introduced in #255....and hopefully remove the need for it.

github-actions · 2022-01-17T07:22:22Z

This issue is stale because it has been open 60 days with no activity.

richardcase · 2022-01-19T06:51:42Z

This is still required

Callisto13 · 2022-06-01T14:43:42Z

Would be good to have this soon. I just started flintlock on a machine which I apparently did not clean up and have rebooted a couple of times since I last did LM, and flintlock is like "woah look how many mvms I have" and I am like "bruh, there are no firecracker processes running".

richardcase · 2022-06-01T16:14:48Z

We should also look at this again: https://github.com/asynkron/protoactor-go

github-actions · 2023-05-19T07:20:07Z

This issue is stale because it has been open 60 days with no activity.

github-actions · 2024-05-18T07:20:35Z

This issue was closed because it has been stalled for 365 days with no activity.

github-actions · 2025-01-08T07:25:10Z

This issue is stale because it has been open 180 days with no activity.

yitsushi mentioned this issue Nov 16, 2021

feat: Add retry counter on failed reconciliation #255

Merged

4 tasks

yitsushi mentioned this issue Dec 1, 2021

Mvm status should reflect usability of mvm #286

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2022

Callisto13 removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022

Callisto13 added this to Liquid Metal Roadmap - Public Sep 22, 2022

Callisto13 moved this to Backlog in Liquid Metal Roadmap - Public Sep 22, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 19, 2023

github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 18, 2024

github-actions bot closed this as completed May 18, 2024

github-project-automation bot moved this from Backlog to Closed in Liquid Metal Roadmap - Public May 18, 2024

richardcase reopened this Jul 10, 2024

richardcase removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 10, 2024

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM Supervisor #198

VM Supervisor #198

jmickey commented Nov 3, 2021

richardcase commented Nov 3, 2021

jmickey commented Nov 4, 2021

jmickey commented Nov 4, 2021

richardcase commented Nov 17, 2021

github-actions bot commented Jan 17, 2022

richardcase commented Jan 19, 2022

Callisto13 commented Jun 1, 2022

richardcase commented Jun 1, 2022

github-actions bot commented May 19, 2023

github-actions bot commented May 18, 2024

github-actions bot commented Jan 8, 2025

VM Supervisor #198

VM Supervisor #198

Comments

jmickey commented Nov 3, 2021

richardcase commented Nov 3, 2021

jmickey commented Nov 4, 2021

jmickey commented Nov 4, 2021

richardcase commented Nov 17, 2021

github-actions bot commented Jan 17, 2022

richardcase commented Jan 19, 2022

Callisto13 commented Jun 1, 2022

richardcase commented Jun 1, 2022

github-actions bot commented May 19, 2023

github-actions bot commented May 18, 2024

github-actions bot commented Jan 8, 2025