-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better cleanup of networking resources #1352
Conversation
8eb8b02
to
4bf48e8
Compare
- Remove IP addresses, VNICs, and etherstub during `omicron-package uninstall` - Fail `destroy_virtual_hardware.sh` if Omicron zones are still installed - In `destroy_virtual_hardware.sh`, also unload xde driver and delete interfaces over the simulated Chelsio VNICs - Update how-to-run.adoc.
4bf48e8
to
f55b44f
Compare
Here are some details explaining the improvements here. I started with a completely fresh system:
No Omicron zones, no simulated Chelios or U.2s, no nothing. From here, we can create the virtual hardware as usual:
I then built and installed Omicron itself:
We can see there are a bunch of resources and zones now:
At this point, trying to destroy the virtual hardware fails helpfully:
Ok, so let's uninstall Omicron now:
This one of the big changes. Now, all the IP addresses we created for the various Omicron services are destroyed, as are the IP interfaces and datalinks under them. The etherstub is also gone.
Let's try to destroy the hardware again:
That looks OK. We've now unloaded xde, destroyed the VNICs (and IP interfaces if needed), and removed the vdevs. I've also confirmed that we can run the whole thing again at this point, and it all looks good. This should resolve #1213 and #1212. I also went ahead and improved the "success" messages in
That should resolve #1214 as well, I hope, though I'm open to even clearer messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(All my comments aside) I think this is great! I like the new slog logging. I'm out of my element around the specific networking commands.
It would be really neat if we had tests for install_virtual_hardware.sh/destroy_virtual_hardware.sh, even absent omicron-package install
. But I imagine that we cannot run any of this under buildomat because it assumes root access in a GZ?
The biggest thing I don't really follow are how we manage the interdependencies between these different components. It seems like:
- we have resources created by install_virtual_hardware.sh,
omicron-package install
, and Sled Agent - these resources have dependencies between them, presumably only in one direction
- these resources are cleaned up by
omicron-package uninstall
and destroy_virtual_hardware.sh
I don't have a handle on how we guide people toward making sure that if they've added a new resource or dependency, there's code in the right spot to clean it up, and it's idempotent, etc. Or how we enforce or test any of that. That question is really much bigger than this PR -- in fact, this PR tightens that up, which is a big improvement. It's just that I'm left with the lingering feeling that there remain many ways for this to go wrong.
tools/destroy_virtual_hardware.sh
Outdated
} | ||
|
||
function verify_omicron_uninstalled { | ||
svcs "svc:/system/illumos/sled-agent:default" 2>&1 > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the right choice, and also seems like it won't always work in some edge cases. I'm thinking about cases where install
started and got far enough to create dependencies that would prevent this script from running, but not far enough to set up sled agent. We might also have a case where uninstall
failed after successfully removing sled-agent
, but before having removed all the things that this script expects to be gone. Still, this helps in the most common case, so it's worth doing. To address this more deeply I think would require first-classing more of these scripts. I think we expect them to function like a stack of modules with idempotent init/teardown that have assumptions about the modules underneath them, but that seems really hard to enforce between a complex Rust program like Sled Agent and these shell scripts.
set -x | ||
} | ||
|
||
function verify_omicron_uninstalled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function would succeed erroneously if, say, configd were down, or other failure modes that would cause this command to fail even if sled-agent was still installed. It seems like quite a lot of work to make this more robust though and I don't think it's worth doing in this shell script.
Thanks for the helpful comments @davepacheco. We talked a bit about this in chat, and I wanted to capture the main points here. First, this PR represents and improvement, and I'm planning to merge it when CI is happy. Second, there's still a lot to be desired, as Dave mentioned. Most of that is around error-handling; the general opacity and fragility of the bash scripts; and the sprawl of the requirements and assumptions throughout the various scripts and binaries. An overall better approach would be something like this. We'd like one location where we define a sequence of initialization and teardown steps, each of which is idempotent and captures its expected invariants. For example, the sled agent expects that the The second major improvement would be to express the "environment" in one or more configuration files. We currently do some of this, for example in A second configuration could be for something closer to an actual Gimlet, which might include actual physical disks rather than file-backed vdevs and the Chelsio physical links in lieu of the combination of This configuration would then be fed into the saga, provided as input to the actions. E.g., one step would create / remove the file-backed vdevs for the "emulated Gimlet" config, while the real Gimlet config would do nothing here (the disks are there already). All of this is a good bit of work. However, it does seems significantly better. Real Rust programs would take the place of bash. The environment configuration files would express all the expectations about the machine(s) on which Omicron is running. There are other considerations, too, such as the coming ramdisk image(s), how long we'll need to support commodity hardware such as most of us have at home, etc. But it does seem like a worthwhile improvement, since we've already seen so much difficulty with the existing system. |
Just as a data point, these changes allowed my Helios box to run Omicron again, which it could not with the previous version. So: thanks! |
Added a new package, crucible-dtrace that pulls from buildomat a package that contains a set of DTrace scripts. These scripts are extracted into the global zone at /opt/oxide/crucible_dtrace/ Update Crucible to latest includes these updates: Clean up dependency checking, fixing space leak (#1372) Make a DTrace package (#1367) Use a single context in all messages (#1363) Remove `DownstairsWork`, because it's redundant (#1371) Remove `WorkState`, because it's implicit (#1370) Do work immediately upon receipt of a job, if possible (#1366) Move 'do work for one job' into a helper function (#1365) Remove `DownstairsWork` from map when handling it (#1361) Using `block_in_place` for IO operations (#1357) update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369) Parameterize more tests (#1364) Misc cleanup, remove sqlite references. (#1360) Fix `Extent::close` docstring (#1359) Make many `Region` functions synchronous (#1356) Remove `Workstate::Done` (unused) (#1355) Return a sorted `VecDeque` directly (#1354) Combine `proc_frame` and `do_work_for` (#1351) Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support arbitrary Volumes during replace compare (#1349) Remove the SQLite backend (#1352) Add a custom timeout for buildomat tests (#1344) Move `proc_frame` into `ActiveConnection` (#1348) Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into ConnectionState (#1340) Make `ConnectionState` an enum type (#1339) Parameterize `test_repair.sh` directories (#1345) Remove `Arc<Mutex<Downstairs>>` (#1338) Send message to Downstairs directly (#1336) Consolidate `on_disconnected` and `remove_connection` (#1333) Move disconnect logic to the Downstairs (#1332) Remove invalid DTrace probes. (#1335) Fix outdated comments (#1331) Use message passing when a new connection starts (#1330) Move cancellation into Downstairs, using a token to kill IO tasks (#1329) Make the Downstairs own per-connection state (#1328) Move remaining local state into a `struct ConnectionState` (#1327) Consolidate negotiation + IO operations into one loop (#1322) Allow replacement of a target in a read_only_parent (#1281) Do all IO through IO tasks (#1321) Make `reqwest_client` only present if it's used (#1326) Move negotiation into Downstairs as well (#1320) Update Rust crate clap to v4.5.4 (#1301) Reuse a reqwest client when creating Nexus clients (#1317) Reuse a reqwest client when creating repair client (#1324) Add % to keep buildomat happy (#1323) Downstairs task cleanup (#1313) Update crutest replace test, and mismatch printing. (#1314) Added more DTrace scripts. (#1309) Update Rust crate async-trait to 0.1.80 (#1298)
Update Crucible and Propolis to the latest Added a new package, crucible-dtrace that pulls from buildomat a package that contains a set of DTrace scripts. These scripts are extracted into the global zone at /opt/oxide/crucible_dtrace/ Crucible latest includes these updates: Clean up dependency checking, fixing space leak (#1372) Make a DTrace package (#1367) Use a single context in all messages (#1363) Remove `DownstairsWork`, because it's redundant (#1371) Remove `WorkState`, because it's implicit (#1370) Do work immediately upon receipt of a job, if possible (#1366) Move 'do work for one job' into a helper function (#1365) Remove `DownstairsWork` from map when handling it (#1361) Using `block_in_place` for IO operations (#1357) update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369) Parameterize more tests (#1364) Misc cleanup, remove sqlite references. (#1360) Fix `Extent::close` docstring (#1359) Make many `Region` functions synchronous (#1356) Remove `Workstate::Done` (unused) (#1355) Return a sorted `VecDeque` directly (#1354) Combine `proc_frame` and `do_work_for` (#1351) Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support arbitrary Volumes during replace compare (#1349) Remove the SQLite backend (#1352) Add a custom timeout for buildomat tests (#1344) Move `proc_frame` into `ActiveConnection` (#1348) Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into ConnectionState (#1340) Make `ConnectionState` an enum type (#1339) Parameterize `test_repair.sh` directories (#1345) Remove `Arc<Mutex<Downstairs>>` (#1338) Send message to Downstairs directly (#1336) Consolidate `on_disconnected` and `remove_connection` (#1333) Move disconnect logic to the Downstairs (#1332) Remove invalid DTrace probes. (#1335) Fix outdated comments (#1331) Use message passing when a new connection starts (#1330) Move cancellation into Downstairs, using a token to kill IO tasks (#1329) Make the Downstairs own per-connection state (#1328) Move remaining local state into a `struct ConnectionState` (#1327) Consolidate negotiation + IO operations into one loop (#1322) Allow replacement of a target in a read_only_parent (#1281) Do all IO through IO tasks (#1321) Make `reqwest_client` only present if it's used (#1326) Move negotiation into Downstairs as well (#1320) Update Rust crate clap to v4.5.4 (#1301) Reuse a reqwest client when creating Nexus clients (#1317) Reuse a reqwest client when creating repair client (#1324) Add % to keep buildomat happy (#1323) Downstairs task cleanup (#1313) Update crutest replace test, and mismatch printing. (#1314) Added more DTrace scripts. (#1309) Update Rust crate async-trait to 0.1.80 (#1298) Propolis just has this one update: Allow boot order config in propolis-standalone --------- Co-authored-by: Alan Hanson <[email protected]>
omicron-package uninstall
destroy_virtual_hardware.sh
if Omicron zones are stillinstalled
destroy_virtual_hardware.sh
, also unload xde driver and deleteinterfaces over the simulated Chelsio VNICs