-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actually provision read-only downstairs! #728
Actually provision read-only downstairs! #728
Conversation
My normal flow for booting an instance is: - create an image - create a disk with that image as a source - attach that disk to an instance, and boot This masked the problem that this PR fixes: the crucible agent does not call `apply_smf` when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), **only** when creating or destroying regions. If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source, `apply_smf` is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (like oxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs: May 07 06:23:09.145 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:09.146 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:09.146 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 May 07 06:23:19.155 INFO [1] connecting to [fd00:1122:3344:102::a]:19001, looper: 1 May 07 06:23:19.158 INFO [0] connecting to [fd00:1122:3344:105::5]:19001, looper: 0 May 07 06:23:19.158 INFO [2] connecting to [fd00:1122:3344:10b::b]:19001, looper: 2 Fixes oxidecomputer/omicron#3034 Fixes oxidecomputer#704
Possibly also fixes #542 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quick questions to start :)
let mut rs = inner | ||
.running_snapshots | ||
.get_mut(region_id) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if, and when would these unwraps possibly fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The agent would panic, and the service would restart, scan the data file, and attempt to do the same operation. It may then succeed, or it may crash again. Too many crashes would put the service in a maintenance state and it remain disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know for sure that a panic'd agent will restart and retry? I know there are places where we don't
correctly handle panics, but I was not sure what happens to the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll verify this but I'm pretty sure.
agent/src/datafile.rs
Outdated
); | ||
rs.state = nstate; | ||
|
||
// XXX shouldn't this remove the record? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XXX How can we tell?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't remove the record, then any attempt by Nexus to provision a running snapshot with the same name for the same region wouldn't work, it would return the existing record with state == destroyed
. We currently keep the record around for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm confused about what the XXX was for.
I think removing the record is what we want, right? Otherwise we are leaking resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leaving the records there is important to keep Nexus sagas idempotent. If they request that a region is to be destroyed, then the saga node reruns, we'll want the "destroy" command to not return a 404 (because the record is gone), but instead a 204 because the region is already destroyed. I've removed the XXX comments and added a note about sagas in 03086e6.
agent/src/datafile.rs
Outdated
); | ||
rs.state = nstate; | ||
|
||
// XXX shouldn't this remove the record? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm confused about what the XXX was for.
I think removing the record is what we want, right? Otherwise we are leaking resources.
@@ -536,6 +538,10 @@ fn apply_smf( | |||
|
|||
for (_, region_snapshots) in running_snapshots.iter_mut() { | |||
for snapshot in region_snapshots.values_mut() { | |||
if snapshot.state != State::Requested { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid github. I want to comment on lines 370 and 380 in this file, but github won't let me.
So, I'm putting the comment on this line even though it is not related to the line itself.
Just FYI though. I thought those lines might have been where we were leaking memory in
agent. (relating to #542)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean
/*
* XXX just disable for now.
*/
inst.disable(false)?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think there was something additional here we needed to clean up the resources, but in my limited analysis I never figured out exactly what we needed to do. When this change goes back, we will see if it fixed the 542 problem or not :)
agent/src/main.rs
Outdated
let res = worker_region_destroy(&log, &r, region_dataset) | ||
.and_then(|_| df.destroyed(&r.id)); | ||
_ => { | ||
eprintln!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be using error!
macro here instead of eprintln!
?
There is another eprintln!
down below as well that we should probably change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 473cf73
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments I have are either questions, or minor nits.
Let's get this merged to main and roll it into the rest of the control plane.
Sounds good, I'll merge this and create follow up issues. |
My normal flow for booting an instance is:
This masked the problem that this PR fixes: the crucible agent does not call
apply_smf
when creating or destroying "running snapshots" (aka read-only downstairs for snapshots), only when creating or destroying regions.If a user creates an image, the read-only downstairs are not provisioned. If a user then creates a new disk with that image as a source,
apply_smf
is called as part of creating the region for that new disk, and this will also provision the read-only downstairs. If a user only created a disk from a bulk import (likeoxidecomputer/omicron#3034) then the read-only downstairs would not be started. If that disk is then attached to an instance, it will not boot because the Upstairs cannot connect to the non-existent read-only downstairs:
Fixes oxidecomputer/omicron#3034
Fixes #704