Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

503 error when snapshotting disks attached to stopped instances #3289

Closed
askfongjojo opened this issue Jun 4, 2023 · 4 comments · Fixed by #5221
Closed

503 error when snapshotting disks attached to stopped instances #3289

askfongjojo opened this issue Jun 4, 2023 · 4 comments · Fixed by #5221
Assignees
Labels
known issue To include in customer documentation and training
Milestone

Comments

@askfongjojo
Copy link

I think I have seen the 503 before but mistaken that as a control plane issue:

$ oxide snapshot create --name loadgen-snap-stoppedvm --disk loadgen --description "disk attached to vm" --project try
error
Error Response: status: 503 Service Unavailable; headers: {"content-type": "application/json", "x-request-id": "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f", "content-length": "133", "date": "Sun, 04 Jun 2023 06:29:02 GMT"}; value: Error { error_code: Some("ServiceNotAvailable"), message: "Service Unavailable", request_id: "7c2b9888-1d8f-43c7-8dd7-8bec8167ca8f" }

(Once I started up the VM, I was able to create snapshots for both of the disks attached to this vm on rack2.)

If snapshot is prohibited on disks attached to stopped instances, we should probably prevent the snapshot action with a more explicit error. A 503 response would imply that the service is only temporarily unavailable and user may retry the action.

If snapshots should be allowed on disks, then it is a functional issue.

@askfongjojo askfongjojo added this to the FCS milestone Jun 4, 2023
@jmpesp
Copy link
Contributor

jmpesp commented Jun 5, 2023

In this case, it's a functional issue. Taking a snapshot of an attached disk is allowed.

@askfongjojo askfongjojo added the known issue To include in customer documentation and training label Jul 6, 2023
@jmpesp
Copy link
Contributor

jmpesp commented Jul 7, 2023

Here's the bug in the snapshot create saga:

match db_disk.state().into() {
external::DiskState::Detached => {
info!(log, "setting state of {} to maintenance", params.disk_id);
osagactx
.datastore()
.disk_update_runtime(
&opctx,
&authz_disk,
&db_disk.runtime().maintenance(),
)
.await
.map_err(ActionError::action_failed)?;
}
external::DiskState::Finalizing => {
// This saga is a sub-saga of the finalize saga if the user has
// specified an optional snapshot should be taken. No state change
// is required.
info!(log, "disk {} in state finalizing", params.disk_id);
}
_ => {
// Return a 503 indicating that the user should retry
return Err(ActionError::action_failed(
Error::ServiceUnavailable {
internal_message: format!(
"disk is in state {:?}",
db_disk.state(),
),
},
));
}
}

The disk state here is expected to be Detached, but if the disk is attached to a stopped instance this match will return 503. The part of Nexus that checks whether or not the Pantry should be used to take a snapshot says to use the Pantry if the instance is stopped:

match instance_state {
// If there's a propolis running, use that
InstanceState::Running |
// Rebooting doesn't deactivate the volume
InstanceState::Rebooting
=> false,
// If there's definitely no propolis running, then use the
// pantry
InstanceState::Stopped | InstanceState::Destroyed => true,
// If there *may* be a propolis running, then fail: we can't
// know if that propolis has activated the Volume or not, or if
// it's in the process of deactivating.
_ => {
return Err(
Error::invalid_request(
&format!("cannot snapshot attached disk for instance in state {}", instance_state)
)
);
}

There needs to be more work here: at the minimum, the disk's state changes to Maintenance as part of this saga, and this has to work with (read: block) the instance from starting. This may not be a candidate for FCS though, due to the workaround of starting the instance existing?

@askfongjojo
Copy link
Author

Thanks for root-causing/sizing this. Let's re-target this to MVP give the effort and impact involved. I'll document this known issue in the release notes because users will likely want to create clean snapshots on stopped instances (so that things such as temporary files locks created by running applications won't be part of the snapshot).

@askfongjojo askfongjojo modified the milestones: FCS, MVP, 1.0.1 Jul 8, 2023
@askfongjojo
Copy link
Author

Want to note that a customer indicated that they would like to see a fix for this issue sooner for the same reason I mentioned above (i.e. their best practice is to create snapshots on stopped instances).

@askfongjojo askfongjojo modified the milestones: MVP, 1.0.2 Aug 14, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.2, 1.0.3 Aug 22, 2023
@askfongjojo askfongjojo modified the milestones: 1.0.3, 3 Sep 1, 2023
@askfongjojo askfongjojo modified the milestones: 3, 4 Oct 17, 2023
@morlandi7 morlandi7 modified the milestones: 4, 5 Nov 14, 2023
@morlandi7 morlandi7 modified the milestones: 5, 6 Dec 5, 2023
@leftwo leftwo modified the milestones: 6, MVP+1 Jan 19, 2024
jmpesp added a commit to jmpesp/omicron that referenced this issue Mar 7, 2024
Volumes are "checked out" from Nexus for many reasons, some of which
include sending to another service for use in `Volume::construct`. When
that service activates the resulting Volume, this will forcibly take
over any existing downstairs connections based on the Upstairs'
generation number. This is intentional, and was designed so Nexus, in
handing out Volumes with increasing generation numbers, can be sure that
the resulting Volume works no matter what (for example, even if a
previous Upstairs is wedged somehow, even if the service that is running
the previous Upstairs is no longer accepting network connections).

Up until now, Nexus wouldn't allow checking out a Volume if there is any
chance a Propolis could be running that may use that Volume. This meant
restricting certain operations, like creating a snapshot when a disk is
attached to an instance that is stopped: any action Nexus would take to
attempt a snapshot using a Pantry would race with a user's request to
start that instance, and if the Volume checkouts occur in the wrong
order the Pantry would take over connections from Propolis, resulting in
guest OS errors.

Nexus _can_ do this safely though: it has all the information required
to know when a checkout is safe to do, and when it may not be safe.
This commit adds checks to the Volume checkout transaction that are
based on the reason that checkout is occurring, and requires call sites
that are performing a checkout to say why they are. Because these checks
are performed inside a transaction, Nexus can say for sure when it is
safe to allow a Volume to be checked out for a certain reason.

For example, in the scenario of taking a snapshot of a disk attached to
an instance that is stopped, there are two checkout operations that have
the possiblity of racing:

1) the one that Nexus will send to a Pantry during a snapshot create
   saga.

2) the one that Nexus will send to a Propolis during an instance start
   saga.

If oxidecomputer#1 occurs before oxidecomputer#2, then Propolis will take over the downstairs
connections that the Pantry has established, and the snapshot create
saga will fail, but the guest OS for that Propolis will not see any
errors. If oxidecomputer#2 occurs before oxidecomputer#1, then the oxidecomputer#1 checkout will fail due to
one of the conditions added in this commit: the checkout is being
performed for use with a Pantry, and a Propolis _may_ exist, so reject
the checkout attempt.

Fixes oxidecomputer#3289.
@askfongjojo askfongjojo modified the milestones: MVP+1, 8 Mar 9, 2024
jmpesp added a commit that referenced this issue Apr 3, 2024
Volumes are "checked out" from Nexus for many reasons, some of which
include sending to another service for use in `Volume::construct`. When
that service activates the resulting Volume, this will forcibly take
over any existing downstairs connections based on the Upstairs'
generation number. This is intentional, and was designed so Nexus, in
handing out Volumes with increasing generation numbers, can be sure that
the resulting Volume works no matter what (for example, even if a
previous Upstairs is wedged somehow, even if the service that is running
the previous Upstairs is no longer accepting network connections).

Up until now, Nexus wouldn't allow checking out a Volume if there is any
chance a Propolis could be running that may use that Volume. This meant
restricting certain operations, like creating a snapshot when a disk is
attached to an instance that is stopped: any action Nexus would take to
attempt a snapshot using a Pantry would race with a user's request to
start that instance, and if the Volume checkouts occur in the wrong
order the Pantry would take over connections from Propolis, resulting in
guest OS errors.

Nexus _can_ do this safely though: it has all the information required
to know when a checkout is safe to do, and when it may not be safe. This
commit adds checks to the Volume checkout transaction that are based on
the reason that checkout is occurring, and requires call sites that are
performing a checkout to say why they are. Because these checks are
performed inside a transaction, Nexus can say for sure when it is safe
to allow a Volume to be checked out for a certain reason.

For example, in the scenario of taking a snapshot of a disk attached to
an instance that is stopped, there are two checkout operations that have
the possiblity of racing:

1) the one that Nexus will send to a Pantry during a snapshot create
   saga.

2) the one that Nexus will send to a Propolis during an instance start
   saga.

If 1 occurs before 2, then Propolis will take over the downstairs
connections that the Pantry has established, and the snapshot create
saga will fail, but the guest OS for that Propolis will not see any
errors. If 2 occurs before 1, then the 1 checkout will fail due to one
of the conditions added in this commit: the checkout is being performed
for use with a Pantry, and a Propolis _may_ exist, so reject the
checkout attempt.

Fixes #3289.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
known issue To include in customer documentation and training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants