Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another nexus saga unwrap failure #3133

Closed
askfongjojo opened this issue May 16, 2023 · 6 comments
Closed

another nexus saga unwrap failure #3133

askfongjojo opened this issue May 16, 2023 · 6 comments
Assignees

Comments

@askfongjojo
Copy link

askfongjojo commented May 16, 2023

This might be another manifestation of #3085 but thought I should report it because it happened during create disk from image. Please feel free to close if it is a dup.

The volume in trouble is: e5fbdb81-53f9-48f0-b5ce-2de6bade4d80
and the saga id is: d8cf8a60-588a-4fe2-a2d3-f1154872b3e8

Here is the extracted nexus log lines: https://gist.github.com/askfongjojo/3af284392c5026da1949c98b5da0fa5e

@askfongjojo
Copy link
Author

@rcgoodfellow put the complete nexus log here: catacomb:/data/staff/dogfood/may-15/oxide-nexus:default.log

@jmpesp
Copy link
Contributor

jmpesp commented May 16, 2023

This unwrap is different:

06:19:03.576Z DEBG d24831fc-5e19-45e7-8f05-b5fc2a9f0af4 (ServerContext): client response
    SledAgent = bb801695-d909-467a-9fe4-f1fd40cf6107
    result = Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv6(fd00:1122:3344:109::1)), port: Some(12345), path: "/v2p/f8c353b2-962e-42c5-9fcc-43abdcaa6985", query: None, fragment: None }, source: TimedOut })

06:19:03.576Z ERRO d24831fc-5e19-45e7-8f05-b5fc2a9f0af4 (ServerContext): Err(Communication Error: error sending request for url (http://[fd00:1122:3344:109::1]:12345/v2p/f8c353b2-962e-42c5-9fcc-43abdcaa6985): operation timed out)

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/alan/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.3.1/src/saga_exec.rs:1187:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 15 23:19:04 Stopping because all processes in service exited. ]
[ May 15 23:19:04 Executing stop method (:kill). ]

This looks like it hit the following in sic_v2p_ensure_undo:

    osagactx
        .nexus()
        .delete_instance_v2p_mappings(&opctx, instance_id)
        .await
        .map_err(ActionError::action_failed)?;

@jmpesp jmpesp self-assigned this May 16, 2023
@askfongjojo
Copy link
Author

There was another occurrence of unsuccessful unwind that @rcgoodfellow worked around (we updated the saga state from unwinding to done to temporarily bypass the unwinding retries so that others can use the system). The saga id is cec7cf92-cc4d-4d42-a816-a46e216aec7d, also an instance of mine. I was trying to delete it but didn't realize it brought down nexus again.

It is also a v2p communication error:

{"msg":"client response","v":0,"name":"nexus","level":20,"time":"2023-05-16T00:38:57.917588018-07:00","hostname":"oxz_nexus","pid":18826,"SledAgent":"bb801695-d909-467a-9fe4-f1fd40cf6107","component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4","result":"Err(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv6(fd00:1122:3344:109::1)), port: Some(12345), path: \"/v2p/f8c353b2-962e-42c5-9fcc-43abdcaa6985\", query: None, fragment: None }, source: TimedOut })"}
{"msg":"Err(Communication Error: error sending request for url (http://[fd00:1122:3344:109::1]:12345/v2p/f8c353b2-962e-42c5-9fcc-43abdcaa6985): operation timed out)","v":0,"name":"nexus","level":50,"time":"2023-05-16T00:38:57.917638114-07:00","hostname":"oxz_nexus","pid":18826,"component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4"}
thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/alan/.cargo/registry/src/github.com-1ecc6299db9ec823/steno-0.3.1/src/saga_exec.rs:1187:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 16 00:38:58 Stopping because all processes in service exited. ]
[ May 16 00:38:58 Executing stop method (:kill). ]
[ May 16 00:38:58 Executing start method ("ctrun -l child -o noorphan,regent /opt/oxide/nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml &"). ]
[ May 16 00:38:58 Method "start" exited with status 0. ]

@askfongjojo
Copy link
Author

I failed to mention that prior to these two unwinding issues that brought down nexus, I actually had a failed VM booting (from stopped state) which I was going to investigate. This might have been the cause of the subsequent v2p errors.

{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2023-05-15T23:17:08.177706912-07:00","hostname":"oxz_nexus","pid":13159,"uri":"https://venus.oxide-preview.com/v1/instances/expensive-toy/disks?project=try","method":"GET","req_id":"b896f1e3-0546-4f48-9551-8db71327cf41","remote_addr":"172.20.17.42:50644","local_addr":"172.30.1.5:443","component":"dropshot_external","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4","response_code":"200"}
{"msg":"client response","v":0,"name":"nexus","level":20,"time":"2023-05-15T23:17:12.748559005-07:00","hostname":"oxz_nexus","pid":13159,"SledAgent":"bb801695-d909-467a-9fe4-f1fd40cf6107","component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4","result":"Err(reqwest::Error { kind: Request, url: Url { scheme: \"http\", cannot_be_a_base: false, username: \"\", password: None, host: Some(Ipv6(fd00:1122:3344:109::1)), port: Some(12345), path: \"/instances/338a0cf2-4e71-4310-8421-02d8d00a124b\", query: None, fragment: None }, source: TimedOut })"}
{"msg":"Handling sled agent instance PUT result","v":0,"name":"nexus","level":20,"time":"2023-05-15T23:17:12.748628417-07:00","hostname":"oxz_nexus","pid":13159,"component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4","result":"Err(Communication Error: error sending request for url (http://[fd00:1122:3344:109::1]:12345/instances/338a0cf2-4e71-4310-8421-02d8d00a124b): operation timed out)"}
{"msg":"saw Communication Error: error sending request for url (http://[fd00:1122:3344:109::1]:12345/instances/338a0cf2-4e71-4310-8421-02d8d00a124b): operation timed out from instance_put!","v":0,"name":"nexus","level":50,"time":"2023-05-15T23:17:12.748651-07:00","hostname":"oxz_nexus","pid":13159,"component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4"}
{"msg":"saw Ok(true) from setting InstanceState::Failed after bad instance_put","v":0,"name":"nexus","level":50,"time":"2023-05-15T23:17:12.753116335-07:00","hostname":"oxz_nexus","pid":13159,"component":"nexus","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4"}
{"msg":"request completed","v":0,"name":"nexus","level":30,"time":"2023-05-15T23:17:12.753278494-07:00","hostname":"oxz_nexus","pid":13159,"uri":"https://venus.oxide-preview.com/v1/instances/expensive-toy/start?project=try","method":"POST","req_id":"b731ee32-72cc-4334-9374-cd52083ad483","remote_addr":"172.20.17.42:50644","local_addr":"172.30.1.5:443","component":"dropshot_external","component":"ServerContext","name":"d24831fc-5e19-45e7-8f05-b5fc2a9f0af4","error_message_external":"Internal Server Error","error_message_internal":"CommunicationError: error sending request for url (http://[fd00:1122:3344:109::1]:12345/instances/338a0cf2-4e71-4310-8421-02d8d00a124b): operation timed out","response_code":"500"}

The instance was marked failed and didn't cause any panic.

@askfongjojo
Copy link
Author

Failures due to unresponsive sleds will have to be handled as part of #2483. I'm fine with closing this ticket if there is not going to be any further data points or root cause analysis needed.

@jmpesp
Copy link
Contributor

jmpesp commented May 25, 2023

I think this ticket will be addressed by #3225 and #3211, so it's probably safe to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants