Actually provision read-only downstairs! #728

jmpesp · 2023-05-09T14:13:25Z

My normal flow for booting an instance is:

create an image
create a disk with that image as a source
attach that disk to an instance, and boot

This masked the problem that this PR fixes: the crucible agent does not call apply_smf when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), only when creating or destroying regions.

If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, apply_smf is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like
oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs:

May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2
May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2

Fixes oxidecomputer/omicron#3034
Fixes #704

My normal flow for booting an instance is: - create an image - create a disk with that image as a source - attach that disk to an instance, and boot This masked the problem that this PR fixes: the crucible agent does not call `apply_smf` when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), **only** when creating or destroying regions. If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, `apply_smf` is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs: May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 Fixes oxidecomputer/omicron#3034 Fixes oxidecomputer#704

jmpesp · 2023-05-09T14:18:13Z

Possibly also fixes #542

leftwo

Some quick questions to start :)

leftwo · 2023-05-09T15:55:27Z

agent/src/datafile.rs

+        let mut rs = inner
+            .running_snapshots
+            .get_mut(region_id)
+            .unwrap()


What happens if, and when would these unwraps possibly fail?

The agent would panic, and the service would restart, scan the data file, and attempt to do the same operation. It may then succeed, or it may crash again. Too many crashes would put the service in a maintenance state and it remain disabled.

Do we know for sure that a panic'd agent will restart and retry? I know there are places where we don't
correctly handle panics, but I was not sure what happens to the agent.

I'll verify this but I'm pretty sure.

agent/src/datafile.rs

leftwo · 2023-05-09T15:59:21Z

agent/src/datafile.rs

+        );
+        rs.state = nstate;
+
+        // XXX shouldn't this remove the record?


XXX How can we tell?

If we don't remove the record, then any attempt by Nexus to provision a running snapshot with the same name for the same region wouldn't work, it would return the existing record with state == destroyed. We currently keep the record around for debugging purposes.

I guess I'm confused about what the XXX was for.

I think removing the record is what we want, right? Otherwise we are leaking resources.

I think leaving the records there is important to keep Nexus sagas idempotent. If they request that a region is to be destroyed, then the saga node reruns, we'll want the "destroy" command to not return a 404 (because the record is gone), but instead a 204 because the region is already destroyed. I've removed the XXX comments and added a note about sagas in 03086e6.

agent/src/datafile.rs

leftwo · 2023-05-09T23:55:50Z

agent/src/datafile.rs

+        );
+        rs.state = nstate;
+
+        // XXX shouldn't this remove the record?


I guess I'm confused about what the XXX was for.

I think removing the record is what we want, right? Otherwise we are leaking resources.

leftwo · 2023-05-10T00:03:01Z

agent/src/main.rs

@@ -536,6 +538,10 @@ fn apply_smf(

    for (_, region_snapshots) in running_snapshots.iter_mut() {
        for snapshot in region_snapshots.values_mut() {
+            if snapshot.state != State::Requested {


Stupid github. I want to comment on lines 370 and 380 in this file, but github won't let me.
So, I'm putting the comment on this line even though it is not related to the line itself.

Just FYI though. I thought those lines might have been where we were leaking memory in
agent. (relating to #542)

You mean

/* * XXX just disable for now. */ inst.disable(false)?;

Yeah. I think there was something additional here we needed to clean up the resources, but in my limited analysis I never figured out exactly what we needed to do. When this change goes back, we will see if it fixed the 542 problem or not :)

leftwo · 2023-05-10T00:12:04Z

agent/src/main.rs

-                let res = worker_region_destroy(&log, &r, region_dataset)
-                    .and_then(|_| df.destroyed(&r.id));
+                    _ => {
+                        eprintln!(


should we be using error! macro here instead of eprintln!?

There is another eprintln! down below as well that we should probably change.

done in 473cf73

leftwo

The comments I have are either questions, or minor nits.
Let's get this merged to main and roll it into the rest of the control plane.

jmpesp · 2023-05-10T15:02:26Z

The comments I have are either questions, or minor nits.
Let's get this merged to main and roll it into the rest of the control plane.

Sounds good, I'll merge this and create follow up issues.

jmpesp requested a review from leftwo May 9, 2023 14:13

jmpesp changed the title ~~Actually provision running read-only downstairs!~~ Actually provision read-only downstairs! May 9, 2023

disable the smf without re-enabling it

3db67ec

leftwo reviewed May 9, 2023

View reviewed changes

leftwo reviewed May 10, 2023

View reviewed changes

leftwo self-requested a review May 10, 2023 14:28

leftwo approved these changes May 10, 2023

View reviewed changes

jmpesp added 4 commits May 10, 2023 10:37

provisioned -> created

6f7e168

remove XXX, add note about sagas

03086e6

move datafile update into "SMF ok!" branch

bc0ba2d

eprintln -> error

473cf73

jmpesp merged commit 9f69dea into oxidecomputer:main May 10, 2023

jmpesp deleted the actually_provision_running_snapshots branch May 10, 2023 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actually provision read-only downstairs! #728

Actually provision read-only downstairs! #728

jmpesp commented May 9, 2023

jmpesp commented May 9, 2023

leftwo left a comment

leftwo May 9, 2023

jmpesp May 9, 2023

leftwo May 9, 2023

jmpesp May 10, 2023

leftwo May 9, 2023

jmpesp May 9, 2023

leftwo May 9, 2023

jmpesp May 10, 2023

leftwo May 9, 2023

leftwo May 10, 2023

jmpesp May 10, 2023

leftwo May 10, 2023

leftwo May 10, 2023

jmpesp May 10, 2023

leftwo left a comment

jmpesp commented May 10, 2023

Actually provision read-only downstairs! #728

Actually provision read-only downstairs! #728

Conversation

jmpesp commented May 9, 2023

jmpesp commented May 9, 2023

leftwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leftwo left a comment

Choose a reason for hiding this comment

jmpesp commented May 10, 2023