Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actually provision read-only downstairs! #728

Merged

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 9, 2023

My normal flow for booting an instance is:

  • create an image
  • create a disk with that image as a source
  • attach that disk to an instance, and boot

This masked the problem that this PR fixes: the crucible agent does not call apply_smf when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), only when creating or destroying regions.

If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, apply_smf is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like
oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs:

May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2
May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2

Fixes oxidecomputer/omicron#3034
Fixes #704

My normal flow for booting an instance is:

- create an image
- create a disk with that image as a source
- attach that disk to an instance, and boot

This masked the problem that this PR fixes: the crucible agent does not
call `apply_smf` when creating or destroying "running snapshots" (aka
read-only downstairs for snapshots), **only** when creating or
destroying regions.

If a user creates an image, the read-only downstairs are not
provisioned. If a user then creates a new disk with that image as a
source, `apply_smf` is called as part of creating the region for that
new disk, and this will also provision the read-only downstairs. If a
user only created a disk from a bulk import (like
oxidecomputer/omicron#3034) then the read-only downstairs would not be
started. If that disk is then attached to an instance, it will not boot
because the Upstairs cannot connect to the non-existent read-only
downstairs:

    May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
    May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
    May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2
    May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1
    May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0
    May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2

Fixes oxidecomputer/omicron#3034
Fixes oxidecomputer#704
@jmpesp jmpesp requested a review from leftwo May 9, 2023 14:13
@jmpesp jmpesp changed the title Actually provision running read-only downstairs! Actually provision read-only downstairs! May 9, 2023
@jmpesp
Copy link
Contributor Author

jmpesp commented May 9, 2023

Possibly also fixes #542

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick questions to start :)

let mut rs = inner
.running_snapshots
.get_mut(region_id)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if, and when would these unwraps possibly fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent would panic, and the service would restart, scan the data file, and attempt to do the same operation. It may then succeed, or it may crash again. Too many crashes would put the service in a maintenance state and it remain disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know for sure that a panic'd agent will restart and retry? I know there are places where we don't
correctly handle panics, but I was not sure what happens to the agent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify this but I'm pretty sure.

agent/src/datafile.rs Show resolved Hide resolved
);
rs.state = nstate;

// XXX shouldn't this remove the record?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XXX How can we tell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't remove the record, then any attempt by Nexus to provision a running snapshot with the same name for the same region wouldn't work, it would return the existing record with state == destroyed. We currently keep the record around for debugging purposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused about what the XXX was for.

I think removing the record is what we want, right? Otherwise we are leaking resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving the records there is important to keep Nexus sagas idempotent. If they request that a region is to be destroyed, then the saga node reruns, we'll want the "destroy" command to not return a 404 (because the record is gone), but instead a 204 because the region is already destroyed. I've removed the XXX comments and added a note about sagas in 03086e6.

agent/src/datafile.rs Show resolved Hide resolved
);
rs.state = nstate;

// XXX shouldn't this remove the record?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm confused about what the XXX was for.

I think removing the record is what we want, right? Otherwise we are leaking resources.

@@ -536,6 +538,10 @@ fn apply_smf(

for (_, region_snapshots) in running_snapshots.iter_mut() {
for snapshot in region_snapshots.values_mut() {
if snapshot.state != State::Requested {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid github. I want to comment on lines 370 and 380 in this file, but github won't let me.
So, I'm putting the comment on this line even though it is not related to the line itself.

Just FYI though. I thought those lines might have been where we were leaking memory in
agent. (relating to #542)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean

                /*
                 * XXX just disable for now.
                 */
                inst.disable(false)?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think there was something additional here we needed to clean up the resources, but in my limited analysis I never figured out exactly what we needed to do. When this change goes back, we will see if it fixed the 542 problem or not :)

let res = worker_region_destroy(&log, &r, region_dataset)
.and_then(|_| df.destroyed(&r.id));
_ => {
eprintln!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using error! macro here instead of eprintln!?

There is another eprintln! down below as well that we should probably change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 473cf73

@leftwo leftwo self-requested a review May 10, 2023 14:28
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments I have are either questions, or minor nits.
Let's get this merged to main and roll it into the rest of the control plane.

@jmpesp
Copy link
Contributor Author

jmpesp commented May 10, 2023

The comments I have are either questions, or minor nits.
Let's get this merged to main and roll it into the rest of the control plane.

Sounds good, I'll merge this and create follow up issues.

@jmpesp jmpesp merged commit 9f69dea into oxidecomputer:main May 10, 2023
@jmpesp jmpesp deleted the actually_provision_running_snapshots branch May 10, 2023 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants