-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nexus unwrap in saga_exec.rs #3085
Comments
Nexus log is here: /net/catacomb/data/staff/core/rack2/omicron-3085 |
The proximate cause is oxidecomputer/steno#26. It looks like some saga undo action in Nexus failed on a transient error. These need to be retried. Do we have a core file? The question is which action failed and what was it doing. |
Taking a closer look, I see we just finished authorizing something for the So I guess the next question is: why did we see that InternalServerError? It looks like req_id 1dbe6656-3f40-4607-ae2f-b6d25d9b1aad, presumably in some crucible agent log? |
Saga
The last node it attempted to unwind was 15:
Grabbing the saga.json and moving it to my workstation (the sleds could really use jq!):
It's the finalize disk saga:
Node 15 is:
which is part of the snapshot subsaga. The snapshot id we should look for in the logs is:
in the Nexus log, we see:
|
Unfortunately, there aren't any associated Pantry logs:
|
Pantry log uploaded to catacomb: |
I think the log's just been rotated:
but that req_id isn't in that log either:
From the code, I thought we were calling a crucible agents in the undo action: omicron/nexus/src/app/sagas/snapshot_create.rs Lines 1044 to 1051 in 8b0ab46
|
Found the failed request in one of the crucible-agent logs:
That's:
|
That's a log file for a very unhappy agent:
|
Though that might be from the saga unwind rerunning... |
|
(tangential: ran into and filed https://www.illumos.org/issues/15649 while debugging this) |
According to the entries in
node 17 failed:
which is
According to how sagas work, |
I wonder if my disk/snapshot/image import request last night was causing some sort of contention on the sled in gc17. One of the crucible downstairs is also located on that sled (and strangely, the other two copies are co-located on the same sled, which is unexpected):
It may be unrelated but I just want to come forward and admit that I was on the crime scene as well. |
The nexus log had also been rotated, the |
From the .0 nexus log file, I believe these are the saga events leading up to the error:
|
When Saga nodes fail, the corresponding undo function is not executed. If a saga node can partially succeed before failing, the undo function has to be split into a separate saga step in order to properly unwind the created resources. This wasn't being done for running snapshots, so this commit corrects that and adds a test for it. Fixes oxidecomputer#3085
When Saga nodes fail, the corresponding undo function is not executed. If a saga node can partially succeed before failing, the undo function has to be split into a separate saga step in order to properly unwind the created resources. This wasn't being done for running snapshots, so this commit corrects that and adds a test for it. Fixes #3085
Thin on details for the moment, filing this to collect data.
saw this on the dogfood rack:
The text was updated successfully, but these errors were encountered: