Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

bnaecker · 2022-09-15T18:19:14Z

I'm testing some updates to OPTE, and ran into a situation where Nexus panics if the instance fails to start. This appears independent of the OPTE changes, and I believe it should be possible to repro this on the latest main branch as well. I'll try to provide as much context as I can.

I built and installed the control plane as normal, using ./tools/create_virtual_hardware.sh and omicron-package {package,install}. All of Omicron appeared to come up as expected. I then created an IP pool, a few global images (populated with ./tools/populate/populate-images.sh), a disk, and an instance that should attach that disk. All of this was using the oxide CLI. The instance provision failed with the following:

bnaecker@feldspar : ~/omicron $ oxide api /organizations/o/projects/p/instances --method POST --input - <<EOF
        {
          "name": "i$i",
          "description": "i",
          "hostname": "myinst",
          "memory": 1073741824,
          "ncpus": 2,
          "disks": [
            {
              "type": "attach",
              "name": "debian$i"
            }
          ],
          "external_ips": [{"type": "ephemeral"}]
        }
EOF
error sending request for url (http://192.168.1.20//organizations/o/projects/p/instances): connection closed before message completed
bnaecker@feldspar : ~/omicron $

This isn't a normal 400, it looks like Nexus actually crashed. We can double check that by looking at the nexus log file, after logging into the zone oxz_nexus:

root@oxz_nexus:~# tail -f $(svcs -L nexus)
<SNIP>
{"msg":"update for saga cached state","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:05.554247251Z","hostname":"oxz_nexus","pid":27116,"sec_id":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","component":"SEC","component":"nexus","component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"Unwinding","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}
{"msg":"updating state","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:05.555095628Z","hostname":"oxz_nexus","pid":27116,"component":"SecStore","component":"nexus","component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"unwinding","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/saga_exec.rs:1149:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Sep 15 10:07:06 Stopping because all processes in service exited. ]
[ Sep 15 10:07:06 Executing stop method (:kill). ]
[ Sep 15 10:07:06 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ Sep 15 10:07:06 Method "start" exited with status 0. ]
note: configured to log to "/dev/stdout"
{"msg":"setting up nexus server","v":0,"name":"nexus","level":30,"time":"2022-09-15T17:07:06.4843665Z","hostname":"oxz_nexus","pid":18440,"name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"}
<SNIP>

So Nexus appears to panic during a saga unwind. That error comes from Steno, which currently just unwrap()s any errors from an unwind action, because it's not clear what to do there yet. So which saga node failed? It's not super clear from the log-file, which only shows info-level messages and higher. I ran the whole process again, but this time capturing debug messages using DTrace:

bnaecker@feldspar : ~/omicron $ pfexec dtrace -p $(pgrep nexus) -x strsize=4k -qn 'slog$target:::debug { printf("%s\n", copyinstr(arg0)); }'
{"ok":{"location":{"module":"omicron_nexus::authn::external::spoof","file":"nexus/src/authn/external/spoof.rs","line":97},"level":"DEBUG","timestamp":"2022-09-15T17:05:06.843174524Z","message":"looking up silo for user","kv":{"component":"dropshot_external","local_addr":"192.168.1.20:80","method":"POST","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","remote_addr":"192.168.1.145:54836","req_id":"381dfffe-3561-424c-841f-348650469062","silo_user_id":"001de000-05e4-4000-8000-000000004007","uri":"//organizations/o/projects/p/instances"}}}
<SNIP>
{"ok":{"location":{"module":"steno::sec","file":"/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs","line":1227},"level":"DEBUG","timestamp":"2022-09-15T17:07:05.856940294Z","message":"saga log event","kv":{"component":"ServerContext","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","new_state":"N046 undo_started","sec_id":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"}}}
{"ok":{"location":{"module":"omicron_nexus::db::sec_store","file":"nexus/src/db/sec_store.rs","line":56},"level":"DEBUG","timestamp":"2022-09-15T17:07:05.856967924Z","message":"recording saga event","kv":{"component":"ServerContext","event_type":"UndoStarted","name":"e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c","node_id":"46","saga_id":"27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"}}}
<SNIP>

To make that a bit more clear:

{
  "ok": {
    "location": {
      "module": "steno::sec",
      "file": "/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs",
      "line": 1227
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.856940294Z",
    "message": "saga log event",
    "kv": {
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "new_state": "N046 undo_started",
      "sec_id": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "omicron_nexus::db::sec_store",
      "file": "nexus/src/db/sec_store.rs",
      "line": 56
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.856967924Z",
    "message": "recording saga event",
    "kv": {
      "component": "ServerContext",
      "event_type": "UndoStarted",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "node_id": "46",
      "saga_id": "27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"
    }
  }
}

Ok, so Steno is running the undo action for node with ID N046. (Note that the second <SNIP>d portion in that log file is all authz checks. There's nothing else after that in the file at all, since DTrace exited with the Nexus process panicked.)

So why did we start unwinding this saga, and which saga are we in? Here are a few pretty-printed lines from the DTrace output, showing the initial place where we start to unwind the saga with this ID.

{
  "ok": {
    "location": {
      "module": "sled_agent_client",
      "file": "sled-agent-client/src/lib.rs",
      "line": 12
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.548156369Z",
    "message": "client response",
    "kv": {
      "SledAgent": "fb0f7546-4d46-40ca-9d56-cbb810684ca7",
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "result": "Err(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv6(fd00:1122:3344:101::1)), port: Some(12345), path: \"/instances/4654cb63-6bd4-4d3a-b0b0-3dc95c29d868\", query: None, fragment: None }, source: TimedOut })"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "steno::sec",
      "file": "/home/bnaecker/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.2.0/src/sec.rs",
      "line": 1227
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.549793982Z",
    "message": "saga log event",
    "kv": {
      "component": "ServerContext",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "new_state": "N083 failed",
      "sec_id": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
    }
  }
}
{
  "ok": {
    "location": {
      "module": "omicron_nexus::db::sec_store",
      "file": "nexus/src/db/sec_store.rs",
      "line": 56
    },
    "level": "DEBUG",
    "timestamp": "2022-09-15T17:07:05.549838653Z",
    "message": "recording saga event",
    "kv": {
      "component": "ServerContext",
      "event_type": "Failed(ActionFailed { source_error: Object {\"InternalError\": Object {\"internal_message\": String(\"CommunicationError: error sending request for url (http://[fd00:1122:3344:101::1]:12345/instances/4654cb63-6bd4-4d3a-b0b0-3dc95c29d868): operation timed out\")}} })",
      "name": "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c",
      "node_id": "83",
      "saga_id": "27ac2b12-5a0b-434c-97a1-ac1effbb9cf5"
    }
  }
}

Ok, so a request to the sled-agent timed out, and the SEC started to unwind the saga, beginning with node 82. We're running saga 27ac2b12-5a0b-434c-97a1-ac1effbb9cf5, and we panic while undoing node N046. So which node is that?

I wasn't sure how to get that information, other than actually looking at the saga itself in the database. So:

root@[fd00:1122:3344:101::2]:32221/omicron> select name, saga_state from saga where id = '27ac2b12-5a0b-434c-97a1-ac1effbb9cf5';
       name       | saga_state
------------------+-------------
  instance-create | unwinding
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

So we're unwinding the instance creation saga with that ID. Makes sense. What is node 46?

root@[fd00:1122:3344:101::2]:32221/omicron> select saga_dag->'graph'->'nodes'->46 from saga where id = '27ac2b12-5a0b-434c-97a1-ac1effbb9cf5';
                                                                  ?column?
---------------------------------------------------------------------------------------------------------------------------------------------
  {"Action": {"action_name": "instance-create.attach-disks-to-instance", "label": "AttachDisksToInstance-0", "name": "attach_disk_output"}}
(1 row)


Time: 1ms total (execution 1ms / network 0ms)

We're failing to run node "AttachDisksToInstance-0", specifically its undo action. That's here:

omicron/nexus/src/app/sagas/instance_create.rs

Line 756 in 9614b57

async fn ensure_instance_disk_attach_state(

There are a few places we propagate the underlying errors from database calls, but the one that's relevant appears to be here:

omicron/nexus/src/app/sagas/instance_create.rs

Line 786 in 9614b57

return Err(ActionError::action_failed(Error::invalid_request(

That's because the instance appears to still be in the starting state, according to the database.

root@[fd00:1122:3344:101::2]:32221/omicron> select id, name, state from instance where time_deleted is null;
                   id                  | name |  state
---------------------------------------+------+-----------
  4654cb63-6bd4-4d3a-b0b0-3dc95c29d868 | i    | starting
  d1e6206e-d148-4942-a7f4-5eece83cb102 | i0   | starting
(2 rows)


Time: 1ms total (execution 1ms / network 0ms)

This is instance with the imaginative name i, which I verified from the Nexus log.

To recap the whole chain of events:

Instance creation times out
Sled-agent tears down the zone (there's no Propolis zone hanging around)
Nexus does not update the instance state. This might be correct, I'm not sure. It doesn't really have any other information from the sled agent, so this may be the right behavior here.
The instance-creation saga is unwound
The undo action detaching all disks from the instance returns an Err, because the instance is still in the starting state.
Steno unwrap()s the error, and Nexus panics.

The text was updated successfully, but these errors were encountered:

jmpesp · 2022-09-15T19:54:59Z

I'm making a fix for this that addresses point 3 specifically, but point 6 is troubling. It looks like Steno will unwrap any error in undo_node! Many of the saga undo nodes could return Err. We shouldn't address this by panicking though.

davepacheco · 2022-09-15T20:53:00Z

Yeah, I think that’s oxidecomputer/steno#26

bnaecker · 2022-09-15T21:14:30Z

@jmpesp pointed out in chat that there are other undo actions that return errors, specifically instance_delete_network_interface, which will fail due to the fact that the instance is not in the Stopped state. This issue could probably be retitled to something like "Timeout provisioning instance panics the instance-create saga unwind".

If a call to sled-agent's `instance_put` fails with anything other than a 400 (bad request), set instance's state to failed - we cannot know what state the instance is in. This exposed another two problems: - Nexus would only delete network interfaces off instances that were stopped - Nexus would only detach disks off instances that were creating or stopped. This commit changes the relevant network interface queries to allow deletion of network interfaces for instances that are either stopped or failed, and adds Failed to the list of ok_to_detach_instance_states for disks. Note that network interface insertion still requires an instance to be stopped, not failed. Closes oxidecomputer#1713

jmpesp · 2022-09-15T21:43:09Z

#1715 addresses point 3 (and deals with the aftermath :))

If a call to sled-agent's `instance_put` fails with anything other than a 400 (bad request), set instance's state to failed - we cannot know what state the instance is in. This exposed another two problems: - Nexus would only delete network interfaces off instances that were stopped - Nexus would only detach disks off instances that were creating or stopped. This commit changes the relevant network interface queries to allow deletion of network interfaces for instances that are either stopped or failed, and adds Failed to the list of ok_to_detach_instance_states for disks. Note that network interface insertion still requires an instance to be stopped, not failed. Closes #1713

bnaecker added the nexus Related to nexus label Sep 15, 2022

jmpesp self-assigned this Sep 15, 2022

jmpesp mentioned this issue Sep 15, 2022

Set instance state to failed on bad instance_put #1715

Merged

jmpesp closed this as completed in #1715 Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

bnaecker commented Sep 15, 2022

jmpesp commented Sep 15, 2022

davepacheco commented Sep 15, 2022

bnaecker commented Sep 15, 2022

jmpesp commented Sep 15, 2022

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

Disk-attach subsaga can panic Nexus if the instance fails to provision #1713

Comments

bnaecker commented Sep 15, 2022

jmpesp commented Sep 15, 2022

davepacheco commented Sep 15, 2022

bnaecker commented Sep 15, 2022

jmpesp commented Sep 15, 2022