-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to delete VNIC when stopping an instance results in bad OPTE state #1364
Comments
So, this actually may not be a bug. I intended to put the OPTE state into the running zone, which is dropped here. That's dropped because
The
That did not fail the second time I stopped the instance. We also no longer see the OPTE port for that instance or the VNIC:
That shows the original I'm not sure what the right call is here. It seems like relying on a fallible Drop implementation is asking for trouble. But I'm not sure if an explicit call to do the work of the drop implementation would work any better. That is, if deleting the VNIC failed in the drop impl, I don't see why it couldn't also fail in an explicit function call. |
I think something we really need is the ability to:
Eventually we want to be able to surface this ongoing fault condition to the operator, who might choose to drain and reboot, and to Oxide, so that we can investigate and fix the bug, and possibly also so Nexus, which might choose to prefer provisioning resources on nodes that don't currently have faults reported? I realise we don't have much of this infrastructure today. I think it would be good to consider how we get from fallible Drop handler to infallible Drop handler, even if it's just something that notes the failure in a persistent in-memory list of faults and/or followup actions that need to be retried. |
Thanks for the thoughts Josh, good points all. As far as reporting faults or failures, I agree that reporting failures like these to Nexus and ultimately an operator is desirable. And you're right that there is little today. At least, little that would surface those to consumers outside Nexus. There is infrastructure today for storing such errors, such as through For the practical effort of fixing this issue, there are a few things. First, the sled agent already deletes all VNICs and OPTE ports on startup. I've not seen that fail on something like this, though I don't see any reason it couldn't fail. I'll experiment later today, but I'm expecting that whatever's still on the link will release that hold when completely uninstalling Omicron. Second, I believe that, whatever the bug is, it will not be resolved by trying to clean up later. I'm coming back to this several hours later, and deleting the VNIC manually still claims the link is busy. All that is to say that, we could start a task to delete these things at some point, or immediately when trying to restart the instance. But I'm not confident that would work. There's clearly something keeping the link busy, and I expect that we'd need to resolve that to fix this. On a related point, part of why this matters is that we create a new OPTE port every time we start the guest, with a new VNIC on top. Let's suppose we detected this problem, and queued the VNIC / port for deletion later. Then we'd actually want to check that queue when creating the ports for the instance on startup. If we find an extant OPTE port, we need to use it rather than delete it, otherwise we'll run into duplicate MAC / IP address issues like we've seen. Does that sound reasonable? I'm a bit leery, since I don't know exactly what state is kept around in either the VNIC or OPTE that would prevent associating OPTE with a different VNIC. I can experiment, however. Thanks again for the suggestions! |
So this particular problem can be chalked up to "too many shells open". I had However, it's not all gravy. The OPTE port was still around. Trying to delete that failed with:
Unloading the |
This is actually still something we need to resolve. It'll still result in an unusable sled, in the case where we can't delete the VNIC (for whatever reason). We should queue it for later deletion (periodically trying, maybe?), or for deletion when the sled gets a request to restart the stopped instance. If neither of those work, we should return a 500 to Nexus (or maybe 503), rather than barreling ahead and making things worse. I think it makes sense to keep around the OPTE port, the VNIC, and the instance UUID. When the sled agent gets a request to start a stopped instance, it checks for the existence of the UUID in that list, and uses that to (1) retry deletion immediately, or (2) fail the request. We should also move this from the Drop impl to an explicit function call. That should try to delete the VNIC, and not delete the OPTE port if that fails. Otherwise we've already primed the sled to panic when we unload xde, given the existence of oxidecomputer/opte#178. |
Dumb question, but why are we bothering to keep all this OPTE state around? If the instance is stopped, it may take days before it is restarted. It may never be rescheduled on the same machine. Trying to manage a cache of partially-destroyed OPTE state seems like an optimization that we may not yet need - what would be so bad about doing unconditional teardown - like we do in the |
The issue is what we do when deleting the VNIC fails. The current implementation goes ahead and deletes the OPTE port as well. That causes oxidecomputer/opte#178. That's a separate bug, and it's possible that fixing that is both easy and obviates all this. That'd be great, but I don't know if that's the case. If not, then we need to be a bit more careful. The drop impl can't just barrel ahead and delete the OPTE port too. If we do that, we create a time-bomb, priming the sled to run into into oxidecomputer/opte#178, panicking the machine when the driver is unloaded. That also means that restarting the instance will cause more problems. We'll attempt to create an OPTE port with the same MAC, private IP, and VNI, which will confuse the hell out of OPTE. That could also be resolved, probably, since @rzezeski is planning to disallow that. But we're in that situation today, and I'm not sure we'll get out of it soon. It's not a dumb question, I just think we might be in a position where it's necessary for a while. To be clear, I'm not going to work on this now. I wanted to keep some record until we know it won't be a problem because the above situation resolves. |
With the merger of oxidecomputer/opte#185, we should really not be able to hit this again. The deletion of the VNIC could still fail if another process opens it, say |
Closed by #1418 |
Added a new package, crucible-dtrace that pulls from buildomat a package that contains a set of DTrace scripts. These scripts are extracted into the global zone at /opt/oxide/crucible_dtrace/ Update Crucible to latest includes these updates: Clean up dependency checking, fixing space leak (#1372) Make a DTrace package (#1367) Use a single context in all messages (#1363) Remove `DownstairsWork`, because it's redundant (#1371) Remove `WorkState`, because it's implicit (#1370) Do work immediately upon receipt of a job, if possible (#1366) Move 'do work for one job' into a helper function (#1365) Remove `DownstairsWork` from map when handling it (#1361) Using `block_in_place` for IO operations (#1357) update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369) Parameterize more tests (#1364) Misc cleanup, remove sqlite references. (#1360) Fix `Extent::close` docstring (#1359) Make many `Region` functions synchronous (#1356) Remove `Workstate::Done` (unused) (#1355) Return a sorted `VecDeque` directly (#1354) Combine `proc_frame` and `do_work_for` (#1351) Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support arbitrary Volumes during replace compare (#1349) Remove the SQLite backend (#1352) Add a custom timeout for buildomat tests (#1344) Move `proc_frame` into `ActiveConnection` (#1348) Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into ConnectionState (#1340) Make `ConnectionState` an enum type (#1339) Parameterize `test_repair.sh` directories (#1345) Remove `Arc<Mutex<Downstairs>>` (#1338) Send message to Downstairs directly (#1336) Consolidate `on_disconnected` and `remove_connection` (#1333) Move disconnect logic to the Downstairs (#1332) Remove invalid DTrace probes. (#1335) Fix outdated comments (#1331) Use message passing when a new connection starts (#1330) Move cancellation into Downstairs, using a token to kill IO tasks (#1329) Make the Downstairs own per-connection state (#1328) Move remaining local state into a `struct ConnectionState` (#1327) Consolidate negotiation + IO operations into one loop (#1322) Allow replacement of a target in a read_only_parent (#1281) Do all IO through IO tasks (#1321) Make `reqwest_client` only present if it's used (#1326) Move negotiation into Downstairs as well (#1320) Update Rust crate clap to v4.5.4 (#1301) Reuse a reqwest client when creating Nexus clients (#1317) Reuse a reqwest client when creating repair client (#1324) Add % to keep buildomat happy (#1323) Downstairs task cleanup (#1313) Update crutest replace test, and mismatch printing. (#1314) Added more DTrace scripts. (#1309) Update Rust crate async-trait to 0.1.80 (#1298)
Update Crucible and Propolis to the latest Added a new package, crucible-dtrace that pulls from buildomat a package that contains a set of DTrace scripts. These scripts are extracted into the global zone at /opt/oxide/crucible_dtrace/ Crucible latest includes these updates: Clean up dependency checking, fixing space leak (#1372) Make a DTrace package (#1367) Use a single context in all messages (#1363) Remove `DownstairsWork`, because it's redundant (#1371) Remove `WorkState`, because it's implicit (#1370) Do work immediately upon receipt of a job, if possible (#1366) Move 'do work for one job' into a helper function (#1365) Remove `DownstairsWork` from map when handling it (#1361) Using `block_in_place` for IO operations (#1357) update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369) Parameterize more tests (#1364) Misc cleanup, remove sqlite references. (#1360) Fix `Extent::close` docstring (#1359) Make many `Region` functions synchronous (#1356) Remove `Workstate::Done` (unused) (#1355) Return a sorted `VecDeque` directly (#1354) Combine `proc_frame` and `do_work_for` (#1351) Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support arbitrary Volumes during replace compare (#1349) Remove the SQLite backend (#1352) Add a custom timeout for buildomat tests (#1344) Move `proc_frame` into `ActiveConnection` (#1348) Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into ConnectionState (#1340) Make `ConnectionState` an enum type (#1339) Parameterize `test_repair.sh` directories (#1345) Remove `Arc<Mutex<Downstairs>>` (#1338) Send message to Downstairs directly (#1336) Consolidate `on_disconnected` and `remove_connection` (#1333) Move disconnect logic to the Downstairs (#1332) Remove invalid DTrace probes. (#1335) Fix outdated comments (#1331) Use message passing when a new connection starts (#1330) Move cancellation into Downstairs, using a token to kill IO tasks (#1329) Make the Downstairs own per-connection state (#1328) Move remaining local state into a `struct ConnectionState` (#1327) Consolidate negotiation + IO operations into one loop (#1322) Allow replacement of a target in a read_only_parent (#1281) Do all IO through IO tasks (#1321) Make `reqwest_client` only present if it's used (#1326) Move negotiation into Downstairs as well (#1320) Update Rust crate clap to v4.5.4 (#1301) Reuse a reqwest client when creating Nexus clients (#1317) Reuse a reqwest client when creating repair client (#1324) Add % to keep buildomat happy (#1323) Downstairs task cleanup (#1313) Update crutest replace test, and mismatch printing. (#1314) Added more DTrace scripts. (#1309) Update Rust crate async-trait to 0.1.80 (#1298) Propolis just has this one update: Allow boot order config in propolis-standalone --------- Co-authored-by: Alan Hanson <[email protected]>
When we create an instance, we create an OPTE port and currently a VNIC over that. We correctly destroy this state when the instance is completely deleted, but not when it's just stopped. If you stop and re-start an instance with the CLI, you should see something like this:
That's two OPTE ports for the guest, with the same exact information. It's not really a bug in OPTE that this doesn't raise an error, since it's theoretically possible to get to this state normally. (The current implementation in Nexus will never assign the same MAC twice, but that's not necessary, rather a convenience.) The sled-agent needs to also destroy this state when it stops an instance, consistent with the idea that stopped instances consume no resources on the sled.
The text was updated successfully, but these errors were encountered: