-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disk-attach subsaga can panic Nexus if the instance fails to provision #1713
Comments
I'm making a fix for this that addresses point 3 specifically, but point 6 is troubling. It looks like Steno will unwrap any error in undo_node! Many of the saga undo nodes could return Err. We shouldn't address this by panicking though. |
Yeah, I think that’s oxidecomputer/steno#26 |
@jmpesp pointed out in chat that there are other undo actions that return errors, specifically |
If a call to sled-agent's `instance_put` fails with anything other than a 400 (bad request), set instance's state to failed - we cannot know what state the instance is in. This exposed another two problems: - Nexus would only delete network interfaces off instances that were stopped - Nexus would only detach disks off instances that were creating or stopped. This commit changes the relevant network interface queries to allow deletion of network interfaces for instances that are either stopped or failed, and adds Failed to the list of ok_to_detach_instance_states for disks. Note that network interface insertion still requires an instance to be stopped, not failed. Closes oxidecomputer#1713
#1715 addresses point 3 (and deals with the aftermath :)) |
If a call to sled-agent's `instance_put` fails with anything other than a 400 (bad request), set instance's state to failed - we cannot know what state the instance is in. This exposed another two problems: - Nexus would only delete network interfaces off instances that were stopped - Nexus would only detach disks off instances that were creating or stopped. This commit changes the relevant network interface queries to allow deletion of network interfaces for instances that are either stopped or failed, and adds Failed to the list of ok_to_detach_instance_states for disks. Note that network interface insertion still requires an instance to be stopped, not failed. Closes #1713
I'm testing some updates to OPTE, and ran into a situation where Nexus panics if the instance fails to start. This appears independent of the OPTE changes, and I believe it should be possible to repro this on the latest
main
branch as well. I'll try to provide as much context as I can.I built and installed the control plane as normal, using
./tools/create_virtual_hardware.sh
andomicron-package {package,install}
. All of Omicron appeared to come up as expected. I then created an IP pool, a few global images (populated with./tools/populate/populate-images.sh
), a disk, and an instance that should attach that disk. All of this was using theoxide
CLI. The instance provision failed with the following:This isn't a normal 400, it looks like Nexus actually crashed. We can double check that by looking at the
nexus
log file, after logging into the zoneoxz_nexus
:So Nexus appears to panic during a saga unwind. That error comes from Steno, which currently just
unwrap()
s any errors from an unwind action, because it's not clear what to do there yet. So which saga node failed? It's not super clear from the log-file, which only shows info-level messages and higher. I ran the whole process again, but this time capturing debug messages using DTrace:To make that a bit more clear:
Ok, so Steno is running the undo action for node with ID
N046
. (Note that the second<SNIP>
d portion in that log file is all authz checks. There's nothing else after that in the file at all, since DTrace exited with the Nexus process panicked.)So why did we start unwinding this saga, and which saga are we in? Here are a few pretty-printed lines from the DTrace output, showing the initial place where we start to unwind the saga with this ID.
Ok, so a request to the sled-agent timed out, and the SEC started to unwind the saga, beginning with node 82. We're running saga
27ac2b12-5a0b-434c-97a1-ac1effbb9cf5
, and we panic while undoing nodeN046
. So which node is that?I wasn't sure how to get that information, other than actually looking at the saga itself in the database. So:
So we're unwinding the instance creation saga with that ID. Makes sense. What is node 46?
We're failing to run node
"AttachDisksToInstance-0"
, specifically its undo action. That's here:omicron/nexus/src/app/sagas/instance_create.rs
Line 756 in 9614b57
There are a few places we propagate the underlying errors from database calls, but the one that's relevant appears to be here:
omicron/nexus/src/app/sagas/instance_create.rs
Line 786 in 9614b57
That's because the instance appears to still be in the starting state, according to the database.
This is instance with the imaginative name
i
, which I verified from the Nexus log.To recap the whole chain of events:
Err
, because the instance is still in the starting state.unwrap()
s the error, and Nexus panics.The text was updated successfully, but these errors were encountered: