Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to delete VNIC when stopping an instance results in bad OPTE state #1364

Closed
bnaecker opened this issue Jul 7, 2022 · 9 comments
Closed
Assignees
Labels
bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management

Comments

@bnaecker
Copy link
Collaborator

bnaecker commented Jul 7, 2022

When we create an instance, we create an OPTE port and currently a VNIC over that. We correctly destroy this state when the instance is completely deleted, but not when it's just stopped. If you stop and re-start an instance with the CLI, you should see something like this:

bnaecker@feldspar : ~/omicron $ pfexec opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     STATE
opte0                            A8:40:25:F2:86:DB        172.30.0.5       running
opte1                            A8:40:25:F2:86:DB        172.30.0.5       running
bnaecker@feldspar : ~/omicron $

That's two OPTE ports for the guest, with the same exact information. It's not really a bug in OPTE that this doesn't raise an error, since it's theoretically possible to get to this state normally. (The current implementation in Nexus will never assign the same MAC twice, but that's not necessary, rather a convenience.) The sled-agent needs to also destroy this state when it stops an instance, consistent with the idea that stopped instances consume no resources on the sled.

@bnaecker bnaecker added bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management labels Jul 7, 2022
@bnaecker bnaecker self-assigned this Jul 7, 2022
@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 7, 2022

So, this actually may not be a bug. I intended to put the OPTE state into the running zone, which is dropped here. That's dropped because

  • The Instance is removed from the map, and goes out of scope at the end of that function.
  • Instance contains InstanceInner
  • InstanceInner contains RunningState
  • RunningState contains the RunningZone
  • RunningZone contains the opte_ports for the zone.

The Drop impl for OptePort should both delete the VNIC and then the OPTE port itself. That seems to fail in some cases. From the sled agent log after stopping the instance:

[2022-07-07T00:02:18.468275299Z]  INFO: SledAgent/InstanceManager/18856 on feldspar: Stopped and uninstalled zone (instance_id=04d9934a-e70d-4c31-8761-7280a6197948)
    zone: oxz_propolis-server_9657cde2-9c94-42b6-ab29-540f27f5ed0d
Failed to delete VNIC: Failed to delete vnic vopte0: Command [/usr/sbin/dladm delete-vnic vopte0] executed and failed with status: exit status: 1. Stdout: , Stderr: dladm: vnic deletion failed: link busy

WARNING: Failed to delete OPTE port 'opte0'

That did not fail the second time I stopped the instance. We also no longer see the OPTE port for that instance or the VNIC:

bnaecker@feldspar : ~/omicron $ dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
net0        vnic      1500   up       --         igb0
net1        vnic      1500   up       --         igb0
stub0       etherstub 9000   up       --         --
underlay0   vnic      9000   up       --         stub0
oxControlService0 vnic 9000  up       --         stub0
oxControlStorage0 vnic 9000  up       --         stub0
oxControlStorage1 vnic 9000  up       --         stub0
oxControlStorage2 vnic 9000  up       --         stub0
oxControlStorage3 vnic 9000  up       --         stub0
oxControlStorage4 vnic 9000  up       --         stub0
oxControlService1 vnic 9000  up       --         stub0
oxControlService2 vnic 9000  up       --         stub0
vopte0      vnic      1500   up       --         opte0
bnaecker@feldspar : ~/omicron $ pfexec opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     STATE
opte0                            A8:40:25:F2:86:DB        172.30.0.5       running
bnaecker@feldspar : ~/omicron $

That shows the original opte0 port and VNIC, that could not be deleted. But the new one created the second time we started the instance (opte1 and vopte1, respectively) have been successfully removed.

I'm not sure what the right call is here. It seems like relying on a fallible Drop implementation is asking for trouble. But I'm not sure if an explicit call to do the work of the drop implementation would work any better. That is, if deleting the VNIC failed in the drop impl, I don't see why it couldn't also fail in an explicit function call.

@jclulow
Copy link
Collaborator

jclulow commented Jul 7, 2022

I think something we really need is the ability to:

  • report the failure somehow as a software fault that can eventually be reported to the operator and/or to Oxide
  • try again to clean up later, if sled agent hasn't been restarted yet
  • if sled agent has been restarted we need to detect and try to clean up this detritus on start, and if we can't do that, also report a software fault

Eventually we want to be able to surface this ongoing fault condition to the operator, who might choose to drain and reboot, and to Oxide, so that we can investigate and fix the bug, and possibly also so Nexus, which might choose to prefer provisioning resources on nodes that don't currently have faults reported?

I realise we don't have much of this infrastructure today. I think it would be good to consider how we get from fallible Drop handler to infallible Drop handler, even if it's just something that notes the failure in a persistent in-memory list of faults and/or followup actions that need to be retried.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 7, 2022

Thanks for the thoughts Josh, good points all.

As far as reporting faults or failures, I agree that reporting failures like these to Nexus and ultimately an operator is desirable. And you're right that there is little today. At least, little that would surface those to consumers outside Nexus. There is infrastructure today for storing such errors, such as through oximeter. That'd at least get the data into durable storage. Is that something worth pursuing in the short-term?

For the practical effort of fixing this issue, there are a few things. First, the sled agent already deletes all VNICs and OPTE ports on startup. I've not seen that fail on something like this, though I don't see any reason it couldn't fail. I'll experiment later today, but I'm expecting that whatever's still on the link will release that hold when completely uninstalling Omicron.

Second, I believe that, whatever the bug is, it will not be resolved by trying to clean up later. I'm coming back to this several hours later, and deleting the VNIC manually still claims the link is busy. All that is to say that, we could start a task to delete these things at some point, or immediately when trying to restart the instance. But I'm not confident that would work. There's clearly something keeping the link busy, and I expect that we'd need to resolve that to fix this.

On a related point, part of why this matters is that we create a new OPTE port every time we start the guest, with a new VNIC on top. Let's suppose we detected this problem, and queued the VNIC / port for deletion later. Then we'd actually want to check that queue when creating the ports for the instance on startup. If we find an extant OPTE port, we need to use it rather than delete it, otherwise we'll run into duplicate MAC / IP address issues like we've seen. Does that sound reasonable? I'm a bit leery, since I don't know exactly what state is kept around in either the VNIC or OPTE that would prevent associating OPTE with a different VNIC. I can experiment, however.

Thanks again for the suggestions!

@bnaecker bnaecker changed the title Stopping instance does not remove guest networking state Failure to delete VNIC when stopping an instance results in bad OPTE state Jul 7, 2022
@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 7, 2022

So this particular problem can be chalked up to "too many shells open". I had snoop open on the VNIC in a window I lost track of, which kept a reference to the VNIC and thus prevented it from being deleted. Once I interrupted snoop, the VNIC could be deleted fine.

However, it's not all gravy. The OPTE port was still around. Trying to delete that failed with:

bnaecker@feldspar : ~/omicron $ pfexec opteadm list-ports
LINK                             MAC ADDRESS              IPv4 ADDRESS     STATE
opte0                            A8:40:25:F2:86:DB        172.30.0.5       ready
bnaecker@feldspar : ~/omicron $ pfexec opteadm delete-xde opte0
ERROR: command DeleteXde failed: System { errno: 2, msg: "failed to destroy DLS devnet: 2" }
bnaecker@feldspar : ~/omicron $

Unloading the xde driver could presumably clear this entirely, however that runs afoul of this issue, which panics the machine. This issue itself is resolved, thankfully without a lurking bug.

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 7, 2022

This is actually still something we need to resolve. It'll still result in an unusable sled, in the case where we can't delete the VNIC (for whatever reason). We should queue it for later deletion (periodically trying, maybe?), or for deletion when the sled gets a request to restart the stopped instance. If neither of those work, we should return a 500 to Nexus (or maybe 503), rather than barreling ahead and making things worse.

I think it makes sense to keep around the OPTE port, the VNIC, and the instance UUID. When the sled agent gets a request to start a stopped instance, it checks for the existence of the UUID in that list, and uses that to (1) retry deletion immediately, or (2) fail the request.

We should also move this from the Drop impl to an explicit function call. That should try to delete the VNIC, and not delete the OPTE port if that fails. Otherwise we've already primed the sled to panic when we unload xde, given the existence of oxidecomputer/opte#178.

@bnaecker bnaecker reopened this Jul 7, 2022
@smklein
Copy link
Collaborator

smklein commented Jul 7, 2022

Dumb question, but why are we bothering to keep all this OPTE state around?

If the instance is stopped, it may take days before it is restarted. It may never be rescheduled on the same machine.

Trying to manage a cache of partially-destroyed OPTE state seems like an optimization that we may not yet need - what would be so bad about doing unconditional teardown - like we do in the Drop impls for most of our VNICs - when the zone is torn down?

@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 7, 2022

The issue is what we do when deleting the VNIC fails. The current implementation goes ahead and deletes the OPTE port as well. That causes oxidecomputer/opte#178. That's a separate bug, and it's possible that fixing that is both easy and obviates all this. That'd be great, but I don't know if that's the case.

If not, then we need to be a bit more careful. The drop impl can't just barrel ahead and delete the OPTE port too. If we do that, we create a time-bomb, priming the sled to run into into oxidecomputer/opte#178, panicking the machine when the driver is unloaded. That also means that restarting the instance will cause more problems. We'll attempt to create an OPTE port with the same MAC, private IP, and VNI, which will confuse the hell out of OPTE. That could also be resolved, probably, since @rzezeski is planning to disallow that. But we're in that situation today, and I'm not sure we'll get out of it soon.

It's not a dumb question, I just think we might be in a position where it's necessary for a while. To be clear, I'm not going to work on this now. I wanted to keep some record until we know it won't be a problem because the above situation resolves.

@bnaecker
Copy link
Collaborator Author

With the merger of oxidecomputer/opte#185, we should really not be able to hit this again. The deletion of the VNIC could still fail if another process opens it, say snoop(1M). But we'll not be able to confuse OPTE into creating a duplicate port. Also, with the merger of oxidecomputer/opte#185, failing to delete the VNIC will not remove the DLS devnet entry for OPTE. The TL;DR is that, once integration OPTE beyond those two commits, we should be pretty well-protected against creating bad OPTE state.

@bnaecker
Copy link
Collaborator Author

Closed by #1418

leftwo pushed a commit that referenced this issue Jun 26, 2024
Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts.  These scripts are extracted into
the global zone at /opt/oxide/crucible_dtrace/

Update Crucible to latest includes these updates:
Clean up dependency checking, fixing space leak (#1372)
Make a DTrace package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371)
Remove `WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366)
Move 'do work for one job' into a helper function (#1365)
Remove `DownstairsWork` from map when handling it (#1361)
Using `block_in_place` for IO operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369)
Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350)
Support arbitrary Volumes during replace compare (#1349)
Remove the SQLite backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341)
Move Work into ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333)
Move disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330)
Move cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328)
Move remaining local state into a `struct ConnectionState` (#1327)
Consolidate negotiation + IO operations into one loop (#1322)
Allow replacement of a target in a read_only_parent (#1281)
Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326)
Move negotiation into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317)
Reuse a reqwest client when creating repair client (#1324)
Add % to keep buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314)
Added more DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)
leftwo added a commit that referenced this issue Jun 26, 2024
Update Crucible and Propolis to the latest

Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts. These scripts are extracted into the 
global zone at /opt/oxide/crucible_dtrace/

Crucible latest includes these updates:
Clean up dependency checking, fixing space leak (#1372) Make a DTrace
package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371) Remove
`WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366) Move 'do
work for one job' into a helper function (#1365) Remove `DownstairsWork`
from map when handling it (#1361) Using `block_in_place` for IO
operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer
configuration (#1369) Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support
arbitrary Volumes during replace compare (#1349) Remove the SQLite
backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into
ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333) Move
disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330) Move
cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328) Move remaining
local state into a `struct ConnectionState` (#1327) Consolidate
negotiation + IO operations into one loop (#1322) Allow replacement of a
target in a read_only_parent (#1281) Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326) Move negotiation
into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317) Reuse a
reqwest client when creating repair client (#1324) Add % to keep
buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314) Added more
DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)

Propolis just has this one update:
Allow boot order config in propolis-standalone
---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that isn't working. networking Related to the networking. Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

3 participants