Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nexus crashed on failed disk snapshot operation #2835

Closed
askfongjojo opened this issue Apr 13, 2023 · 3 comments
Closed

Nexus crashed on failed disk snapshot operation #2835

askfongjojo opened this issue Apr 13, 2023 · 3 comments
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Apr 13, 2023

This is another instance of steno oxidecomputer/steno#26, similar to a few other issues linked to the ticket. The scenario has to do with creating a snapshot for an unattached disk that was made from the propolis alpine image:

Apr 13 00:29:34.003 INFO request completed, error_message_external: Internal Server Error, error_message_internal: Error: e No such file or directory (os error 2) No extent file found for "/opt/oxide/propolis-server/blob/alpine.iso", response_code: 500, uri: /crucible/pantry/0/volume/72427479-2b8f-4d7b-a4a7-2ddaa69c4ee1, method: POST, req_id: 187b8073-1eab-4c48-9334-63c341e78a25, remote_addr: [fd00:1122:3344:101::3]:33049, local_addr: [fd00:1122:3344:101::a]:17000, component: dropshot

Real images won't hit this error but it is conceivable that snapshotting can fail for other reasons. In a situation like this, Nexus panics and can't be recovered as it keeps retrying to unwind the saga even if I manually svcadm clear it:

{"msg":"authorize result","v":0,"name":"nexus","level":20,"time":"2023-04-12T17:34:19.524915113-07:00","hostname":"oxz_nexus","pid":14511,"component":"DataLoader","component":"nexus","component":"ServerContext","name":"91ac0ab6-194e-4fc6-aafe-e46546eeffd6","result":"Ok(())","resource":"Database","action":"Query","actor":"Some(Actor::UserBuiltin { user_builtin_id: 001de000-05e4-4000-8000-000000000001, .. })"}
{"msg":"saga resume","v":0,"name":"nexus","level":30,"time":"2023-04-12T17:34:19.537782834-07:00","hostname":"oxz_nexus","pid":14511,"saga_name":"snapshot-create","saga_id":"a336a6a4-d1e1-4dfe-9b7b-69287b3325fd","sec_id":"91ac0ab6-194e-4fc6-aafe-e46546eeffd6","component":"SEC","component":"nexus","component":"ServerContext","name":"91ac0ab6-194e-4fc6-aafe-e46546eeffd6","dag":"{\"end_node\":18,\"graph\":{\"edge_property\":\"directed\",\"edges\":[[0,1,null],[1,2,null],[2,3,null],[3,4,null],[4,5,null],[5,6,null],[6,7,null],[7,8,null],[8,9,null],[9,10,null],[10,11,null],[11,12,null],[12,13,null],[13,14,null],[14,15,null],[15,16,null],[17,0,null],[16,18,null]],\"node_holes\":[],\"nodes\":[{\"Action\":{\"action_name\":\"common.uuid_generate\",\"label\":\"GenerateSnapshotId\",\"name\":\"snapshot_id\"}},{\"Action\":{\"action_name\":\"common.uuid_generate\",\"label\":\"GenerateVolumeId\",\"name\":\"volume_id\"}},{\"Action\":{\"action_name\":\"common.uuid_generate\",\"label\":\"GenerateDestinationVolumeId\",\"name\":\"destination_volume_id\"}},{\"Action\":{\"action_name\":\"snapshot_create.regions_alloc\",\"label\":\"RegionsAlloc\",\"name\":\"datasets_and_regions\"}},{\"Action\":{\"action_name\":\"snapshot_create.regions_ensure\",\"label\":\"RegionsEnsure\",\"name\":\"regions_ensure\"}},{\"Action\":{\"action_name\":\"snapshot_create.create_destination_volume_record\",\"label\":\"CreateDestinationVolumeRecord\",\"name\":\"created_destination_volume\"}},{\"Action\":{\"action_name\":\"snapshot_create.create_snapshot_record\",\"label\":\"CreateSnapshotRecord\",\"name\":\"created_snapshot\"}},{\"Action\":{\"action_name\":\"snapshot_create.space_account\",\"label\":\"SpaceAccount\",\"name\":\"no_result\"}},{\"Action\":{\"action_name\":\"snapshot_create.get_pantry_address\",\"label\":\"GetPantryAddress\",\"name\":\"pantry_address\"}},{\"Action\":{\"action_name\":\"snapshot_create.attach_disk_to_pantry\",\"label\":\"AttachDiskToPantry\",\"name\":\"disk_generation_number\"}},{\"Action\":{\"action_name\":\"snapshot_create.call_pantry_attach_for_disk\",\"label\":\"CallPantryAttachForDisk\",\"name\":\"call_pantry_attach_for_disk\"}},{\"Action\":{\"action_name\":\"snapshot_create.call_pantry_snapshot_for_disk\",\"label\":\"CallPantrySnapshotForDisk\",\"name\":\"call_pantry_snapshot_for_disk\"}},{\"Action\":{\"action_name\":\"snapshot_create.call_pantry_detach_for_disk\",\"label\":\"CallPantryDetachForDisk\",\"name\":\"call_pantry_detach_for_disk\"}},{\"Action\":{\"action_name\":\"snapshot_create.start_running_snapshot\",\"label\":\"StartRunningSnapshot\",\"name\":\"replace_sockets_map\"}},{\"Action\":{\"action_name\":\"snapshot_create.create_volume_record\",\"label\":\"CreateVolumeRecord\",\"name\":\"created_volume\"}},{\"Action\":{\"action_name\":\"snapshot_create.finalize_snapshot_record\",\"label\":\"FinalizeSnapshotRecord\",\"name\":\"finalized_snapshot\"}},{\"Action\":{\"action_name\":\"snapshot_create.detach_disk_from_pantry\",\"label\":\"DetachDiskFromPantry\",\"name\":\"detach_disk_from_pantry\"}},{\"Start\":{\"params\":{\"create_params\":{\"description\":\"snap\",\"disk\":\"server-image-disk\",\"name\":\"server-image-snap\"},\"disk_id\":\"72427479-2b8f-4d7b-a4a7-2ddaa69c4ee1\",\"project_id\":\"43acf783-a348-4d4c-ac3c-04ded1bcbd7a\",\"serialized_authn\":{\"kind\":{\"Authenticated\":{\"actor\":{\"SiloUser\":{\"silo_id\":\"001de000-5110-4000-8000-000000000000\",\"silo_user_id\":\"001de000-05e4-4000-8000-000000004007\"}}}}},\"silo_id\":\"001de000-5110-4000-8000-000000000000\",\"use_the_pantry\":true}}},\"End\"]},\"saga_name\":\"snapshot-create\",\"start_node\":17}"}
{"msg":"ssc_regions_ensure_undo: Deleting crucible regions","v":0,"name":"nexus","level":40,"time":"2023-04-12T17:34:19.538472666-07:00","hostname":"oxz_nexus","pid":14511,"saga_type":"recovery","component":"nexus","component":"ServerContext","name":"91ac0ab6-194e-4fc6-aafe-e46546eeffd6"}
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: Internal Error: Internal Server Error', /home/angela/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.3.1/src/saga_exec.rs:1187:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ Apr 12 17:34:20 Stopping because all processes in service exited. ]
[ Apr 12 17:34:20 Executing stop method (:kill). ]
[ Apr 12 17:34:20 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
@askfongjojo askfongjojo added this to the MVP milestone Apr 13, 2023
@davepacheco
Copy link
Collaborator

Note that the answer for oxidecomputer/steno#26 is probably going to be something like: undo actions cannot fail -- they need to continue trying until they succeed, or maybe put the saga into a "needs support" state. That is, steno can stop panicking, but I think we shouldn't be blocked on that because we need to do something else in the saga when we're otherwise failing.

@askfongjojo askfongjojo transferred this issue from oxidecomputer/steno Apr 13, 2023
@askfongjojo
Copy link
Author

Discussed in #3085

@askfongjojo
Copy link
Author

Fixed in #3085

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants