Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set instance state to failed on bad instance_put #1715

Merged
merged 3 commits into from
Sep 16, 2022

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Sep 15, 2022

If a call to sled-agent's instance_put fails with anything other than a 400 (bad request), set instance's state to failed - we cannot know what state the instance is in.

This exposed another two problems:

  • Nexus would only delete network interfaces off instances that were stopped
  • Nexus would only detach disks off instances that were creating or stopped.

This commit changes the relevant network interface queries to allow deletion of network interfaces for instances that are either stopped or failed, and adds Failed to the list of ok_to_detach_instance_states for disks. Note that network interface insertion still requires an instance to be stopped, not failed.

Closes #1713

If a call to sled-agent's `instance_put` fails with anything other than
a 400 (bad request), set instance's state to failed - we cannot know
what state the instance is in.

This exposed another two problems:

- Nexus would only delete network interfaces off instances that were stopped
- Nexus would only detach disks off instances that were creating or stopped.

This commit changes the relevant network interface queries to allow
deletion of network interfaces for instances that are either stopped or
failed, and adds Failed to the list of ok_to_detach_instance_states for
disks. Note that network interface insertion still requires an instance
to be stopped, not failed.

Closes oxidecomputer#1713
@jmpesp jmpesp requested a review from bnaecker September 15, 2022 21:28
@@ -194,6 +194,13 @@ impl SledAgent {
initial_hardware: InstanceHardware,
target: InstanceRuntimeStateRequested,
) -> Result<InstanceRuntimeState, Error> {
let ncpus: i64 = (&initial_hardware.runtime.ncpus).into();
if ncpus > 16 {
return Err(Error::internal_error(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this should probably be bad_request instead, but for test_instance_fails_to_boot_with_disk I needed it to be an internal_error. I welcome any input for better ways to do this :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thought was InvalidValue but that also just gets mapped to 400 as well :/

Either way, add a comment here to indicate why this specific error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments in be9f7a3

nexus/src/app/instance.rs Outdated Show resolved Hide resolved
Comment on lines +644 to +647
// this is unfortunate, but sled_agent_client::Error doesn't
// implement Copy, and can't be match'ed upon below without this
// line.
let e = e.into();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the Copy bit actually matter? It'd doesn't seem like you need to copy anywhere, and the .into() is gonna be needed regardless.

for e.g. you could do:

match e {
    e @ sled_agent_client::Error::InvalidRequest { .. } => Err(e.into()),
    e => {
        // ...
       Err(e.into())
    }
}

or just hoist the .into():

match e.into() {
    e @ Error::InvalidRequest { .. } => Err(e),
    e => {
        // ...
       Err(e)
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried all variants of moving around the borrow, and .into(), and what's there is the only thing I could make work unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need a borrow at all?

nexus/tests/integration_tests/instances.rs Show resolved Hide resolved
@@ -194,6 +194,13 @@ impl SledAgent {
initial_hardware: InstanceHardware,
target: InstanceRuntimeStateRequested,
) -> Result<InstanceRuntimeState, Error> {
let ncpus: i64 = (&initial_hardware.runtime.ncpus).into();
if ncpus > 16 {
return Err(Error::internal_error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thought was InvalidValue but that also just gets mapped to 400 as well :/

Either way, add a comment here to indicate why this specific error

Copy link
Contributor

@luqmana luqmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits but thanks!

I think I might've inadvertently exposed this by changing exactly when an instance transitions from Creating to Starting.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for doing this!

@jmpesp jmpesp merged commit 92cf5cb into oxidecomputer:main Sep 16, 2022
@jmpesp jmpesp deleted the set_instance_to_failed branch September 16, 2022 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disk-attach subsaga can panic Nexus if the instance fails to provision
3 participants