Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

Closed
bnaecker opened this issue Sep 15, 2022 · 4 comments · Fixed by #1715
Closed

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

bnaecker opened this issue Sep 15, 2022 · 4 comments · Fixed by #1715
Assignees
Labels
nexus Related to nexus

Comments

@bnaecker
Copy link
Collaborator

I'm testing some updates to OPTE, and ran into a situation where Nexus panics if the instance fails to start. This appears independent of the OPTE changes, and I believe it should be possible to repro this on the latest main branch as well. I'll try to provide as much context as I can.

I built and installed the control plane as normal, using ./tools/create_virtual_hardware.sh and omicron-package {package,install}. All of Omicron appeared to come up as expected. I then created an IP pool, a few global images (populated with ./tools/populate/populate-images.sh), a disk, and an instance that should attach that disk. All of this was using the oxide CLI. The instance provision failed with the following:

bnaecker@feldspar : ~/omicron $ oxide api /organizations/o/projects/p/instances --method POST --input - <<EOF
        {
          "name": "i$i",
          "description": "i",
          "hostname": "myinst",
          "memory": 1073741824,
          "ncpus": 2,
          "disks": [
            {
              "type": "attach",
              "name": "debian$i"
            }
          ],
          "external_ips": [{"type": "ephemeral"}]
        }
EOF
error sending request for url (http://192.168.1.20//organizations/o/projects/p/instances): connection closed before message completed
bnaecker@feldspar : ~/omicron $

This isn't a normal 400, it looks like Nexus actually crashed. We can double check that by looking at the nexus log file, after logging into the zone oxz_nexus:

root@oxz_nexus:~# tail -f $(svcs -L nexus)
<SNIP>
{"msg":"update for saga cached state","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:05.554247251Z","hostname":"oxz_nexus","pid":27116,"sec_id":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","component":"SEC","component":"nexus","component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"Unwinding","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}
{"msg":"updating state","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:05.555095628Z","hostname":"oxz_nexus","pid":27116,"component":"SecStore","component":"nexus","component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"unwinding","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/saga_exec.rs:1149:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Sep 15 10:07:06 Stopping because all processes in service exited. ]
[ Sep 15 10:07:06 Executing stop method (:kill). ]
[ Sep 15 10:07:06 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Sep 15 10:07:06 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
{"msg":"setting up nexus server","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:06.4843665Z","hostname":"oxz_nexus","pid":18440,"name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"}
<SNIP>

So Nexus appears to panic during a saga unwind. That error comes from Steno, which currently just unwrap()s any errors from an unwind action, because it's not clear what to do there yet. So which saga node failed? It's not super clear from the log-file, which only shows info-level messages and higher. I ran the whole process again, but this time capturing debug messages using DTrace:

bnaecker@feldspar : ~/omicron $ pfexec dtrace -p $(pgrep nexus) -x strsize=4k -qn 'slog$target:::debug { printf("%s\n", copyinstr(arg0)); }'
{"ok":{"location":{"module":"omicron_nexus::authn::external::spoof","file":"nexus/src/authn/external/spoof.rs","line":97},"level":"DEBUG","timestamp":"2022-09-15T17:05:06.843174524Z","message":"looking up silo for user","kv":{"component":"dropshot_external","local_addr":"192.168.1.20:80","method":"POST","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","remote_addr":"192.168.1.145:54836","req_id":"381dfffe-3561-424c-841f-348650469062","silo_user_id":"001de000-05e4-4000-8000-000000004007","uri":"//organizations/o/projects/p/instances"}}}
<SNIP>
{"ok":{"location":{"module":"steno::sec","file":"/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs","line":1227},"level":"DEBUG","timestamp":"2022-09-15T17:07:05.856940294Z","message":"saga log event","kv":{"component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"N046 undo_started","sec_id":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"}}}
{"ok":{"location":{"module":"omicron_nexus::db::sec_store","file":"nexus/src/db/sec_store.rs","line":56},"level":"DEBUG","timestamp":"2022-09-15T17:07:05.856967924Z","message":"recording saga event","kv":{"component":"ServerContext","event_type":"UndoStarted","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","node_id":"46","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}}}
<SNIP>

To make that a bit more clear:

{
  "ok": {
    "location": {
      "module": "steno::sec",
      "file": "/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs",
      "line": 1227
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.856940294Z",
    "message": "saga log event",
    "kv": {
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "new_state": "N046 undo_started",
      "sec_id": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "omicron_nexus::db::sec_store",
      "file": "nexus/src/db/sec_store.rs",
      "line": 56
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.856967924Z",
    "message": "recording saga event",
    "kv": {
      "component": "ServerContext",
      "event_type": "UndoStarted",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "node_id": "46",
      "saga_id": "27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"
    }
  }
}

Ok, so Steno is running the undo action for node with ID N046. (Note that the second <SNIP>d portion in that log file is all authz checks. There's nothing else after that in the file at all, since DTrace exited with the Nexus process panicked.)

So why did we start unwinding this saga, and which saga are we in? Here are a few pretty-printed lines from the DTrace output, showing the initial place where we start to unwind the saga with this ID.

{
  "ok": {
    "location": {
      "module": "sled_agent_client",
      "file": "sled-agent-client/src/lib.rs",
      "line": 12
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.548156369Z",
    "message": "client response",
    "kv": {
      "SledAgent": "fb0f7546-4d46-40ca-9d56-cbb810684ca7",
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "result": "Err(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv6(fd00:1122:3344:101::1)), port: Some(12345), path: \"/instances/4654cb63-6bd4-4d3a-b0b0-3dc95c29d868\", query: None, fragment: None }, source: TimedOut })"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "steno::sec",
      "file": "/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs",
      "line": 1227
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.549793982Z",
    "message": "saga log event",
    "kv": {
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "new_state": "N083 failed",
      "sec_id": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "omicron_nexus::db::sec_store",
      "file": "nexus/src/db/sec_store.rs",
      "line": 56
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.549838653Z",
    "message": "recording saga event",
    "kv": {
      "component": "ServerContext",
      "event_type": "Failed(ActionFailed { source_error: Object {\"InternalError\": Object {\"internal_message\": String(\"CommunicationError: error sending request for url (http://[fd00:1122:3344:101::1]:12345/instances/4654cb63-6bd4-4d3a-b0b0-3dc95c29d868): operation timed out\")}} })",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "node_id": "83",
      "saga_id": "27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"
    }
  }
}

Ok, so a request to the sled-agent timed out, and the SEC started to unwind the saga, beginning with node 82. We're running saga 27ac2b12-5a0b-434c-97a1-ac1effbb9cf5, and we panic while undoing node N046. So which node is that?

I wasn't sure how to get that information, other than actually looking at the saga itself in the database. So:

root@[fd00:1122:3344:101::2]:32221/omicron> select name, saga_state from saga where id = '27ac2b12-5a0b-434c-97a1-ac1effbb9cf5';
       name       | saga_state
------------------+-------------
  instance-create | unwinding
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

So we're unwinding the instance creation saga with that ID. Makes sense. What is node 46?

root@[fd00:1122:3344:101::2]:32221/omicron> select saga_dag->'graph'->'nodes'->46 from saga where id = '27ac2b12-5a0b-434c-97a1-ac1effbb9cf5';
                                                                  ?column?
---------------------------------------------------------------------------------------------------------------------------------------------
  {"Action": {"action_name": "instance-create.attach-disks-to-instance", "label": "AttachDisksToInstance-0", "name": "attach_disk_output"}}
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

We're failing to run node "AttachDisksToInstance-0", specifically its undo action. That's here:

async fn ensure_instance_disk_attach_state(

There are a few places we propagate the underlying errors from database calls, but the one that's relevant appears to be here:

return Err(ActionError::action_failed(Error::invalid_request(

That's because the instance appears to still be in the starting state, according to the database.

root@[fd00:1122:3344:101::2]:32221/omicron> select id, name, state from instance where time_deleted is null;
                   id                  | name |  state
---------------------------------------+------+-----------
  4654cb63-6bd4-4d3a-b0b0-3dc95c29d868 | i    | starting
  d1e6206e-d148-4942-a7f4-5eece83cb102 | i0   | starting
(2 rows)


Time: 1ms total (execution 1ms / network 0ms)

This is instance with the imaginative name i, which I verified from the Nexus log.

To recap the whole chain of events:

  1. Instance creation times out
  2. Sled-agent tears down the zone (there's no Propolis zone hanging around)
  3. Nexus does not update the instance state. This might be correct, I'm not sure. It doesn't really have any other information from the sled agent, so this may be the right behavior here.
  4. The instance-creation saga is unwound
  5. The undo action detaching all disks from the instance returns an Err, because the instance is still in the starting state.
  6. Steno unwrap()s the error, and Nexus panics.
@bnaecker bnaecker added the nexus Related to nexus label Sep 15, 2022
@jmpesp jmpesp self-assigned this Sep 15, 2022
@jmpesp
Copy link
Contributor

jmpesp commented Sep 15, 2022

I'm making a fix for this that addresses point 3 specifically, but point 6 is troubling. It looks like Steno will unwrap any error in undo_node! Many of the saga undo nodes could return Err. We shouldn't address this by panicking though.

@davepacheco
Copy link
Collaborator

Yeah, I think that’s oxidecomputer/steno#26

@bnaecker
Copy link
Collaborator Author

@jmpesp pointed out in chat that there are other undo actions that return errors, specifically instance_delete_network_interface, which will fail due to the fact that the instance is not in the Stopped state. This issue could probably be retitled to something like "Timeout provisioning instance panics the instance-create saga unwind".

jmpesp added a commit to jmpesp/omicron that referenced this issue Sep 15, 2022
If a call to sled-agent's `instance_put` fails with anything other than
a 400 (bad request), set instance's state to failed - we cannot know
what state the instance is in.

This exposed another two problems:

- Nexus would only delete network interfaces off instances that were stopped
- Nexus would only detach disks off instances that were creating or stopped.

This commit changes the relevant network interface queries to allow
deletion of network interfaces for instances that are either stopped or
failed, and adds Failed to the list of ok_to_detach_instance_states for
disks. Note that network interface insertion still requires an instance
to be stopped, not failed.

Closes oxidecomputer#1713
@jmpesp
Copy link
Contributor

jmpesp commented Sep 15, 2022

#1715 addresses point 3 (and deals with the aftermath :))

jmpesp added a commit that referenced this issue Sep 16, 2022
If a call to sled-agent's `instance_put` fails with anything other than
a 400 (bad request), set instance's state to failed - we cannot know
what state the instance is in.

This exposed another two problems:

- Nexus would only delete network interfaces off instances that were stopped
- Nexus would only detach disks off instances that were creating or stopped.

This commit changes the relevant network interface queries to allow
deletion of network interfaces for instances that are either stopped or
failed, and adds Failed to the list of ok_to_detach_instance_states for
disks. Note that network interface insertion still requires an instance
to be stopped, not failed.

Closes #1713
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants