Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nexus unwrap in saga_exec.rs #3085

Closed
leftwo opened this issue May 11, 2023 · 16 comments · Fixed by #3097
Closed

nexus unwrap in saga_exec.rs #3085

leftwo opened this issue May 11, 2023 · 16 comments · Fixed by #3097
Assignees

Comments

@leftwo
Copy link
Contributor

leftwo commented May 11, 2023

Thin on details for the moment, filing this to collect data.

saw this on the dogfood rack:

11:13:34.434Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): created 0 built-in silos
11:13:34.434Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    roles = RoleSet { roles: {} }
11:13:34.435Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
    resource = Database
    result = Ok(())
11:13:34.436Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    roles = RoleSet { roles: {} }
11:13:34.437Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    roles = RoleSet { roles: {} }
    saga_node = CallPantrySnapshotForDisk
    saga_type = recovery
11:13:34.437Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Modify
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
    resource = Database
    result = Ok(())
11:13:34.437Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): attempting to create silo users
11:13:34.437Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    roles = RoleSet { roles: {} }
11:13:34.437Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
    resource = Database
    result = Ok(())
11:13:34.438Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    resource = Database
    result = Ok(())
    saga_node = CallPantrySnapshotForDisk
    saga_type = recovery
11:13:34.439Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): created 0 silo users
11:13:34.439Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): attempting to create silo user role assignments
11:13:34.439Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    roles = RoleSet { roles: {} }
11:13:34.439Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Query
    actor = Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })
    resource = Database
    result = Ok(())
11:13:34.440Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): roles
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    roles = RoleSet { roles: {(Fleet, 001de000-1334-4000-8000-000000000000, "admin"), (Silo, 001de000-5110-4000-8000-000000000000, "admin")} }
    saga_node = CallPantrySnapshotForDisk
    saga_type = recovery
11:13:34.444Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): authorize result
    action = Read
    actor = Some(Actor::SiloUser { silo_user_id: 001de000-05e4-4000-8000-000000004007, silo_id: 001de000-5110-4000-8000-000000000000, .. })
    actor_id = 001de000-05e4-4000-8000-000000004007
    authenticated = true
    resource = Disk { parent: Project { parent: Silo { parent: Fleet, key: 001de000-5110-4000-8000-000000000000, lookup_type: ById(001de000-5110-4000-8000-000000000000) }, key: 1631d9d9-0598-41eb-ac80-44597af0dcdf, lookup_type: ById(1631d9d9-0598-41eb-ac80-44597af0dcdf) }, key: ff78438f-2fa4-4043-bc41-f4e710372386, lookup_type: ById(ff78438f-2fa4-4043-bc41-f4e710372386) }
    result = Ok(())
    saga_node = CallPantrySnapshotForDisk
    saga_type = recovery
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "1dbe6656-3f40-4607-ae2f-b6d25d9b1aad", "content-length": "124", "date": "Thu, 11 May 2023 11:13:33 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "1dbe6656-3f40-4607-ae2f-b6d25d9b1aad" }', /home/alan/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.3.1/src/saga_exec.rs:1187:65
@leftwo
Copy link
Contributor Author

leftwo commented May 11, 2023

Nexus log is here: /net/catacomb/data/staff/core/rack2/omicron-3085

@davepacheco
Copy link
Collaborator

The proximate cause is oxidecomputer/steno#26. It looks like some saga undo action in Nexus failed on a transient error. These need to be retried. Do we have a core file? The question is which action failed and what was it doing.

@davepacheco
Copy link
Collaborator

Taking a closer look, I see we just finished authorizing something for the CallPantrySnapshotForDisk node, so it seems likely that's the undo action that failed. Also, InternalServerError isn't generally supposed to be transient.

So I guess the next question is: why did we see that InternalServerError? It looks like req_id 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad, presumably in some crucible agent log?

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

Saga 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6 is in state 'unwinding':

root@[fd00:1122:3344:10c::6]:32221/omicron> select id,saga_state from saga where saga_state = 'unwinding';
                   id                  | saga_state
---------------------------------------+-------------
  3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6 | unwinding

The last node it attempted to unwind was 15:

root@[fd00:1122:3344:10c::6]:32221/omicron> select node_id,event_type from saga_node_event where saga_id
= '3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6' order by event_time;
  node_id |  event_type
----------+----------------
       25 | started
       25 | succeeded
        0 | started
        0 | succeeded
        1 | started
        1 | succeeded
        2 | started
        2 | succeeded
        3 | started
        3 | succeeded
        4 | started
        4 | succeeded
        5 | started
        5 | succeeded
        6 | started
        6 | succeeded
        7 | started
        7 | succeeded
        8 | started
        8 | succeeded
        9 | started
        9 | succeeded
       10 | started
       10 | succeeded
       11 | started
       11 | succeeded
       12 | started
       12 | succeeded
       13 | started
       13 | succeeded
       14 | started
       14 | succeeded
       15 | started
       15 | succeeded
       16 | started
       16 | succeeded
       17 | started
       17 | failed
       16 | undo_started
       16 | undo_finished
       15 | undo_started

Grabbing the saga.json and moving it to my workstation (the sleds could really use jq!):

root@oxz_cockroachdb_oxp_beae56c2-d080-4be1-80bd-b12b0335e57d:~# /opt/oxide/cockroachdb/bin/cockroach sql --insecure --url=postgres://root@$(cat /data/cockroach.listen-addr)/omicron -e "select saga_dag from saga where id = '3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6'" --format raw | grep -v '#' > saga.json

It's the finalize disk saga:

$ cat saga.json | jq .saga_name
"finalize-disk"

Node 15 is:

$ cat saga.json | jq .graph.nodes[15]
{
  "Action": {
    "action_name": "snapshot_create.call_pantry_snapshot_for_disk",
    "label": "CallPantrySnapshotForDisk",
    "name": "call_pantry_snapshot_for_disk"
  }
}

which is part of the snapshot subsaga. The snapshot id we should look for in the logs is:

$ cat saga.json | jq .graph.nodes[4]
{
  "Action": {
    "action_name": "common.uuid_generate",
    "label": "GenerateSnapshotId",
    "name": "snapshot_id"
  }
}

select data from saga_node_event where saga_id = '3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6' and node_id = 4 and event_type = 'succeeded';
                   data
------------------------------------------
  "ffe64091-9c1e-4344-bdce-cb333e210a2e"

in the Nexus log, we see:

11:13:32.772Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): Undoing pantry snapshot request for ffe64091-9c1e-4344-bdce-cb333e210a2e
    saga_type = recovery

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

Unfortunately, there aren't any associated Pantry logs:

root@oxz_crucible_pantry:~# wc -l /var/svc/log/oxide-crucible-pantry:default.log
       0 /var/svc/log/oxide-crucible-pantry:default.log

@leftwo
Copy link
Contributor Author

leftwo commented May 11, 2023

Pantry log uploaded to catacomb:
/net/catacomb/data/staff/core/rack2/omicron-3085/oxide-crucible-pantry\:default.log.0

@davepacheco
Copy link
Collaborator

I think the log's just been rotated:

root@oxz_crucible_pantry:~# ls -l $(svcs -L crucible/pantry)*
-rw-r--r--   1 root     root           0 May 11 03:10 /var/svc/log/oxide-crucible-pantry:default.log
-rw-r--r--   1 root     root     433184080 May 10 22:45 /var/svc/log/oxide-crucible-pantry:default.log.0

but that req_id isn't in that log either:

root@oxz_crucible_pantry:~# grep 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad /var/svc/log/oxide-crucible-pantry:default.log.0 | bunyan
-bash: bunyan: command not found
root@oxz_crucible_pantry:~# grep 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad /var/svc/log/oxide-crucible-pantry:default.log.0 

From the code, I thought we were calling a crucible agents in the undo action:

let client = CrucibleAgentClient::new(&url);
client
.region_delete_snapshot(
&RegionId(region.id().to_string()),
&snapshot_id.to_string(),
)
.await?;

@davepacheco
Copy link
Collaborator

davepacheco commented May 11, 2023

Found the failed request in one of the crucible-agent logs:

BRM42220051-switch # pilot host exec -c 'grep 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad $(svcs -H -ozone -Z crucible/agent | while read z; do svcs -L -z "$z" crucible/agent; done)' 0-31
 8  BRM44220011        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:3d5%tfportrear25_0' (ED25519) to the list of known hosts.
10  BRM42220009        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:3d4%tfportrear24_0' (ED25519) to the list of known hosts.
11  BRM42220057        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:393%tfportrear8_0' (ED25519) to the list of known hosts.
12  BRM42220018        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:1d1%tfportrear27_0' (ED25519) to the list of known hosts.
13  BRM44220005        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:357%tfportrear11_0' (ED25519) to the list of known hosts.
14  BRM42220051        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:195%tfportrear26_0' (ED25519) to the list of known hosts.
16  BRM42220086        ok: /pool/ext/6cd96f8e-05cb-439f-bf3a-1e4b1c8c3316/zone/oxz_crucible_oxp_6cd96f8e-05cb-439f-bf3a-1e4b1c8c3316/root/var/svc/log/oxide-crucible-agent:default.log:May 11 11:13:34.447 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad, remote_addr: [fd00:1122:3344:10c::4]:56631, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
17  BRM42220006        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:191%tfportrear6_0' (ED25519) to the list of known hosts.
18  BRM44220004        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:3d3%tfportrear23_0' (ED25519) to the list of known hosts.
19  BRM42220017        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:192%tfportrear7_0' (ED25519) to the list of known hosts.
21  BRM42220026        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:351%tfportrear4_0' (ED25519) to the list of known hosts.
23  BRM42220031        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:353%tfportrear5_0' (ED25519) to the list of known hosts.
24  BRM42220014        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:352%tfportrear18_0' (ED25519) to the list of known hosts.
26  BRM42220016        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:395%tfportrear19_0' (ED25519) to the list of known hosts.
28  BRM44220010        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:354%tfportrear16_0' (ED25519) to the list of known hosts.
30  BRM42220084        failure: exit code 1: Warning: Permanently added 'fe80::aa40:25ff:fe04:3d6%tfportrear17_0' (ED25519) to the list of known hosts.
Error: some operations failed
BRM42220051-switch # 

That's:

May 11 11:13:34.447 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad, remote_addr: [fd00:1122:3344:10c::4]:56631, local_addr: [fd00:1122:3344:101::6]:32345

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

That's a log file for a very unhappy agent:

May 11 05:45:08.139 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: 6f356b7f-0d48-49d0-a01a-2e0e1be6e7aa, remote_addr: [fd00:1122:3344:10c::4]:51641, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:11.081 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:59251, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:11.082 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: c4ca37a2-4217-4043-ab27-82765f76af75, remote_addr: [fd00:1122:3344:10c::4]:59251, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:11.681 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:46477, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:11.681 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: fab307e8-9ba0-4e2d-831b-905d09335683, remote_addr: [fd00:1122:3344:10c::4]:46477, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:12.106 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:57518, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:12.106 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: a4552b8e-ffac-49bf-8ee5-494aa0b810d0, remote_addr: [fd00:1122:3344:10c::4]:57518, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:12.567 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:56748, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 05:45:12.567 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: e4ea1836-e1b7-4b43-bea7-fdbe07483465, remote_addr: [fd00:1122:3344:10c::4]:56748, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 11:13:32.839 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:62489, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 11:13:32.839 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: 5cabd682-cd4c-4ebc-96b9-5774ee5bb328, remote_addr: [fd00:1122:3344:10c::4]:62489, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 11:13:33.384 INFO accepted connection, remote_addr: [fd00:1122:3344:10c::4]:46530, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot
May 11 11:13:33.384 INFO request completed, error_message_external: Internal Server Error, error_message_internal: downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e, response_code: 500, uri: /crucible/0/regions/3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83/snapshots/ffe64091-9c1e-4344-bdce-cb333e210a2e, method: DELETE, req_id: 14ef4316-ef4b-4916-9a0b-0d1c7b8fe87c, remote_addr: [fd00:1122:3344:10c::4]:46530, local_addr: [fd00:1122:3344:101::6]:32345, component: dropshot

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

Though that might be from the saga unwind rerunning...

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

downstairs running for region 3c5524a0-47dd-4d6e-8c01-4fc33b6bcb83 snapshot ffe64091-9c1e-4344-bdce-cb333e210a2e is an error message that occurs when trying to delete a region if there's a running read-only snapshot for it: https://github.com/oxidecomputer/crucible/blob/9f69deab230079a5bae4a10a099d7fc1f3d29df7/agent/src/datafile.rs#L351

@davepacheco
Copy link
Collaborator

(tangential: ran into and filed https://www.illumos.org/issues/15649 while debugging this)

@jmpesp
Copy link
Contributor

jmpesp commented May 11, 2023

According to the entries in saga_node_event:

       16 | started
       16 | succeeded
       17 | started
       17 | failed
       16 | undo_started
       16 | undo_finished
       15 | undo_started

node 17 failed:

jwm@duster ~/src/oxidecomputer/omicron [deploy] [] $ cat saga.json | jq .graph.nodes[17]
{
  "Action": {
    "action_name": "snapshot_create.start_running_snapshot",
    "label": "StartRunningSnapshot",
    "name": "replace_sockets_map"
  }
}

which is

    START_RUNNING_SNAPSHOT -> "replace_sockets_map" {
        + ssc_start_running_snapshot
        - ssc_start_running_snapshot_undo
    }

According to how sagas work, ssc_start_running_snapshot_undo won't run if ssc_start_running_snapshot fails! So it makes sense why there are still running read-only downstairs.

@askfongjojo
Copy link

askfongjojo commented May 11, 2023

I wonder if my disk/snapshot/image import request last night was causing some sort of contention on the sled in gc17. One of the crucible downstairs is also located on that sled (and strangely, the other two copies are co-located on the same sled, which is unexpected):

root@[fd00:1122:3344:10c::6]:32221/omicron> select * from disk, project where project.name = 'try' and project.id = disk.project_id;
                   id                  |          name          |                   description                   |         time_created          |         time_modified         | time_deleted | rcgen |              project_id              |              volume_id               | disk_state | attach_instance_id | state_generation |      time_state_updated      | size_bytes | block_size | origin_snapshot | origin_image |         pantry_address         |                  id                  | name |  description   |         time_created          |         time_modified         | time_deleted | rcgen |               silo_id
---------------------------------------+------------------------+-------------------------------------------------+-------------------------------+-------------------------------+--------------+-------+--------------------------------------+--------------------------------------+------------+--------------------+------------------+------------------------------+------------+------------+-----------------+--------------+--------------------------------+--------------------------------------+------+----------------+-------------------------------+-------------------------------+--------------+-------+---------------------------------------
  ff78438f-2fa4-4043-bc41-f4e710372386 | tmp-for-image-80257445 | temporary disk for importing image ubuntu-focal | 2023-05-11 05:33:47.923522+00 | 2023-05-11 05:33:47.923522+00 | NULL         |     1 | 1631d9d9-0598-41eb-ac80-44597af0dcdf | 31611db3-b950-4285-b3ac-2a842bab0f91 | finalizing | NULL               |                5 | 2023-05-11 05:45:06.81403+00 | 3221225472 |        512 | NULL            | NULL         | [fd00:1122:3344:10c::12]:17000 | 1631d9d9-0598-41eb-ac80-44597af0dcdf | try  | always trying! | 2023-05-11 05:32:42.093378+00 | 2023-05-11 05:32:42.093378+00 | NULL         |     4 | 001de000-5110-4000-8000-000000000000
(1 row)


Time: 4ms total (execution 4ms / network 1ms)

root@[fd00:1122:3344:10c::6]:32221/omicron> select * from volume where id = '31611db3-b950-4285-b3ac-2a842bab0f91';
                   id                  |         time_created          |         time_modified         | time_deleted | rcgen |                                                                                                                                                                                                                    data                                                                                                                                                                                                                     | resources_to_clean_up
---------------------------------------+-------------------------------+-------------------------------+--------------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------
  31611db3-b950-4285-b3ac-2a842bab0f91 | 2023-05-11 05:33:48.836773+00 | 2023-05-11 05:33:48.836773+00 | NULL         |     1 | {"type":"volume","block_size":512,"id":"ff78438f-2fa4-4043-bc41-f4e710372386","sub_volumes":[{"type":"region","block_size":512,"blocks_per_extent":131072,"extent_count":48,"gen":2,"opts":{"id":"ff78438f-2fa4-4043-bc41-f4e710372386","key":"sABsfTBVbpNyKa+Mx53LCOCtzrrhAIGVZIvwlXzHpfs=","lossy":false,"read_only":false,"target":["[fd00:1122:3344:101::8]:19008","[fd00:1122:3344:101::6]:19008","[fd00:1122:3344:10b::8]:19007"]}}]} | NULL
(1 row)

root@[fd00:1122:3344:10c::6]:32221/omicron> select serial_number from sled where ip = 'fd00:1122:3344:101::1';
  serial_number
-----------------
  BRM42220086
(1 row)

root@[fd00:1122:3344:10c::6]:32221/omicron> select serial_number from sled where ip = 'fd00:1122:3344:10b::1';
  serial_number
-----------------
  BRM42220006
(1 row)

It may be unrelated but I just want to come forward and admit that I was on the crime scene as well.

@leftwo
Copy link
Contributor Author

leftwo commented May 11, 2023

The nexus log had also been rotated, the .0 log has been uploaded to catacomb as well:
/net/catacomb/data/staff/core/rack2/omicron-3085

@leftwo
Copy link
Contributor Author

leftwo commented May 11, 2023

From the .0 nexus log file, I believe these are the saga events leading up to the error:
grep 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6 /net/catacomb/data/staff/core/rack2/omicron-3085/oxide-nexus\:default.log.0 | looker

05:45:08.018Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): dataset Dataset { identity: DatasetIdentity { id: 0ce06c51-98c4-450d-a97c-73a60ff96cd2, time_created: 2023-05-10T23:56:10.241077Z, time_modified: 2023-05-10T23:56:10.241077Z }, time_deleted: None, rcgen: Generation(Generation(1)), pool_id: 310b934d-2693-419d-a9b1-74a7f192fde6, ip: fd00:1122:3344:101::8, port: SqlU16(32345), kind: Crucible, size_used: Some(173946175488) } region Region { identity: RegionIdentity { id: fb6f6bcb-9d55-482d-a189-ffc450097168, time_created: 2023-05-11T05:33:48.016725Z, time_modified: 2023-05-11T05:33:48.016725Z }, dataset_id: 0ce06c51-98c4-450d-a97c-73a60ff96cd2, volume_id: 31611db3-b950-4285-b3ac-2a842bab0f91, block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 48 } url http://[fd00:1122:3344:101::8]:32345
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.019Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): crucible region Region { block_size: 512, cert_pem: None, encrypted: true, extent_count: 48, extent_size: 131072, id: RegionId("fb6f6bcb-9d55-482d-a189-ffc450097168"), key_pem: None, port_number: 19008, root_pem: None, state: Created }
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.051Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): crucible snapshot Snapshot { created: 2023-05-11T05:45:07Z, name: "ffe64091-9c1e-4344-bdce-cb333e210a2e" }
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.080Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): crucible running snapshot RunningSnapshot { id: RegionId("fb6f6bcb-9d55-482d-a189-ffc450097168"), name: "ffe64091-9c1e-4344-bdce-cb333e210a2e", port_number: 19009, state: Requested }
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.080Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): map [fd00:1122:3344:101::8]:19008 to [fd00:1122:3344:101::8]:19009
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.082Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): dataset Dataset { identity: DatasetIdentity { id: fd0c1955-3f47-4242-b244-da16793fa049, time_created: 2023-05-10T23:56:10.241072Z, time_modified: 2023-05-10T23:56:10.241072Z }, time_deleted: None, rcgen: Generation(Generation(1)), pool_id: c4d03dce-3b90-45f1-9f6c-48146238563c, ip: fd00:1122:3344:10b::8, port: SqlU16(32345), kind: Crucible, size_used: Some(170724950016) } region Region { identity: RegionIdentity { id: e13962e0-4b19-49f8-96f1-d59a9ba0c1bf, time_created: 2023-05-11T05:33:48.016725Z, time_modified: 2023-05-11T05:33:48.016725Z }, dataset_id: fd0c1955-3f47-4242-b244-da16793fa049, volume_id: 31611db3-b950-4285-b3ac-2a842bab0f91, block_size: ByteCount(ByteCount(512)), blocks_per_extent: 131072, extent_count: 48 } url http://[fd00:1122:3344:10b::8]:32345
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.083Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): crucible region Region { block_size: 512, cert_pem: None, encrypted: true, extent_count: 48, extent_size: 131072, id: RegionId("e13962e0-4b19-49f8-96f1-d59a9ba0c1bf"), key_pem: None, port_number: 19007, root_pem: None, state: Created }
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
    saga_name = finalize-disk
05:45:08.084Z DEBG 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): recording saga event
    event_type = Failed(ActionFailed { source_error: String("Error Response: status: 404 Not Found; headers: {\\"content-type\\": \\"application/json\\", \\"x-request-id\\": \\"e3f702f0-198e-4015-a99d-7daf4aef5e46\\", \\"content-length\\": \\"84\\", \\"date\\": \\"Thu, 11 May 2023 05:45:08 GMT\\"}; value: Error { error_code: None, message: \\"Not Found\\", request_id: \\"e3f702f0-198e-4015-a99d-7daf4aef5e46\\" }") })
    node_id = 17
    saga_id = 3597e6a6-ea85-4fcc-83b7-1aeea97bdcf6
05:45:08.085Z INFO 7838d1e5-c2d5-4b51-a18a-1076a54fb103 (ServerContext): update for saga cached state
    new_state = Unwinding

jmpesp added a commit to jmpesp/omicron that referenced this issue May 12, 2023
When Saga nodes fail, the corresponding undo function is not executed.
If a saga node can partially succeed before failing, the undo function
has to be split into a separate saga step in order to properly unwind
the created resources. This wasn't being done for running snapshots, so
this commit corrects that and adds a test for it.

Fixes oxidecomputer#3085
jmpesp added a commit that referenced this issue May 15, 2023
When Saga nodes fail, the corresponding undo function is not executed.
If a saga node can partially succeed before failing, the undo function
has to be split into a separate saga step in order to properly unwind
the created resources. This wasn't being done for running snapshots, so
this commit corrects that and adds a test for it.

Fixes #3085
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants