Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better cleanup of networking resources #1352

Merged
merged 5 commits into from
Jul 8, 2022

Conversation

bnaecker
Copy link
Collaborator

@bnaecker bnaecker commented Jul 5, 2022

  • Remove IP addresses, VNICs, and etherstub during omicron-package uninstall
  • Fail destroy_virtual_hardware.sh if Omicron zones are still
    installed
  • In destroy_virtual_hardware.sh, also unload xde driver and delete
    interfaces over the simulated Chelsio VNICs
  • Update how-to-run.adoc.

@bnaecker bnaecker requested a review from davepacheco July 5, 2022 17:49
@bnaecker bnaecker force-pushed the cleanup-networking-on-uninstall branch from 8eb8b02 to 4bf48e8 Compare July 5, 2022 17:59
- Remove IP addresses, VNICs, and etherstub during `omicron-package
  uninstall`
- Fail `destroy_virtual_hardware.sh` if Omicron zones are still
  installed
- In `destroy_virtual_hardware.sh`, also unload xde driver and delete
  interfaces over the simulated Chelsio VNICs
- Update how-to-run.adoc.
@bnaecker bnaecker force-pushed the cleanup-networking-on-uninstall branch from 4bf48e8 to f55b44f Compare July 5, 2022 18:05
@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 5, 2022

Here are some details explaining the improvements here.

I started with a completely fresh system:

bnaecker@feldspar : ~/omicron $ ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/dhcp         dhcp     ok           192.168.1.145/24
lo0/v6            static   ok           ::1/128
bnaecker@feldspar : ~/omicron $ dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
bnaecker@feldspar : ~/omicron $ zoneadm list
global
bnaecker@feldspar : ~/omicron $

No Omicron zones, no simulated Chelios or U.2s, no nothing. From here, we can create the virtual hardware as usual:

bnaecker@feldspar : ~/omicron $ pfexec ./tools/create_virtual_hardware.sh
+++ dirname ./tools/create_virtual_hardware.sh
++ cd ./tools
++ pwd
+ SOURCE_DIR=/home/bnaecker/omicron/tools
+ OMICRON_TOP=/home/bnaecker/omicron/tools/..
+ MARKER=/etc/opt/oxide/NO_INSTALL
+ [[ -f /etc/opt/oxide/NO_INSTALL ]]
+ [[ 0 -ge 1 ]]
++ dladm show-phys -p -o LINK
++ head -1
+ PHYSICAL_LINK=igb0
+ echo 'Using igb0 as physical link'
Using igb0 as physical link
+ ensure_run_as_root
++ id -u
+ [[ 0 -ne 0 ]]
+ ensure_zpools
+ readarray -t ZPOOLS
++ grep '"oxp_' /home/bnaecker/omicron/tools/../smf/sled-agent/config.toml
++ sed 's/[ ",]//g'
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_PATH=/home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ [[ -f /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev ]]
+ truncate -s 10GB /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ success 'ZFS vdev /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev exists'
+ echo -e '\e[1;36mZFS vdev /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev exists\e[0m'
ZFS vdev /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev exists
++ zpool list -o name
++ grep oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
+ [[ -z '' ]]
+ zpool create -f oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b /home/bnaecker/omicron/tools/../oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ success 'ZFS zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b exists'
+ echo -e '\e[1;36mZFS zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b exists\e[0m'
ZFS zpool oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b exists
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_PATH=/home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ [[ -f /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev ]]
+ truncate -s 10GB /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ success 'ZFS vdev /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists'
+ echo -e '\e[1;36mZFS vdev /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists\e[0m'
ZFS vdev /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists
++ zpool list -o name
++ grep oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ [[ -z '' ]]
+ zpool create -f oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03 /home/bnaecker/omicron/tools/../oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ success 'ZFS zpool oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists'
+ echo -e '\e[1;36mZFS zpool oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists\e[0m'
ZFS zpool oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_PATH=/home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ [[ -f /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev ]]
+ truncate -s 10GB /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ success 'ZFS vdev /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists'
+ echo -e '\e[1;36mZFS vdev /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists\e[0m'
ZFS vdev /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev exists
++ zpool list -o name
++ grep oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ [[ -z '' ]]
+ zpool create -f oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 /home/bnaecker/omicron/tools/../oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ success 'ZFS zpool oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists'
+ echo -e '\e[1;36mZFS zpool oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists\e[0m'
ZFS zpool oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 exists
+ ensure_simulated_chelsios igb0
+ local PHYSICAL_LINK=igb0
+ VNIC_NAMES=("net0" "net1")
+ for VNIC in "${VNIC_NAMES[@]}"
++ get_vnic_name_if_exists net0
+++ dladm show-vnic -p -o LINK net0
dladm: invalid vnic name 'net0': object not found
++ NAME=
++ [[ 1 -eq 0 ]]
++ echo ''
+ [[ -z '' ]]
+ dladm create-vnic -t -l igb0 net0
+ success 'VNIC net0 exists'
+ echo -e '\e[1;36mVNIC net0 exists\e[0m'
VNIC net0 exists
+ for VNIC in "${VNIC_NAMES[@]}"
++ get_vnic_name_if_exists net1
+++ dladm show-vnic -p -o LINK net1
dladm: invalid vnic name 'net1': object not found
++ NAME=
++ [[ 1 -eq 0 ]]
++ echo ''
+ [[ -z '' ]]
+ dladm create-vnic -t -l igb0 net1
+ success 'VNIC net1 exists'
+ echo -e '\e[1;36mVNIC net1 exists\e[0m'
VNIC net1 exists
bnaecker@feldspar : ~/omicron $

I then built and installed Omicron itself:

bnaecker@feldspar : ~/omicron $ ./target/release/omicron-package package
    Finished release [optimized] target(s) in 0.35s
[00:00:11] ######################################## 36920154/36920154 crucible: done
[00:00:22] ######################################## 90627584/90627584 maghemite: done
[00:00:16] ######################################## 60808521/60808521 propolis-server: done
[00:00:02] ########################################       6/6       clickhouse: done
[00:00:03] ########################################      12/12      cockroachdb: done
[00:00:03] ########################################       4/4       internal-dns: done
[00:00:00] ########################################     102/102     omicron-nexus: done
[00:00:00] ########################################       5/5       omicron-sled-agent: done
[00:00:00] ########################################       4/4       oximeter-collector: done
bnaecker@feldspar : ~/omicron $ pfexec ./target/release/omicron-package install
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/sled-agent.tar, src: out/sled-agent.tar
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/oximeter.tar.gz, src: out/oximeter.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/internal-dns.tar.gz, src: out/internal-dns.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/crucible.tar.gz, src: out/crucible.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/cockroachdb.tar.gz, src: out/cockroachdb.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/clickhouse.tar.gz, src: out/clickhouse.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/mg-ddm.tar, src: out/mg-ddm.tar
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/nexus.tar.gz, src: out/nexus.tar.gz
Jul 05 17:42:05.053 INFO Installing service, dst: /opt/oxide/propolis-server.tar.gz, src: out/propolis-server.tar.gz
Jul 05 17:42:05.244 INFO Unpacking service tarball, service_path: /opt/oxide/sled-agent, tar_path: /opt/oxide/sled-agent.tar
Jul 05 17:42:05.281 INFO Unpacking service tarball, service_path: /opt/oxide/mg-ddm, tar_path: /opt/oxide/mg-ddm.tar
Jul 05 17:42:05.347 INFO Installing boostrap service from /opt/oxide/sled-agent/pkg/manifest.xml
bnaecker@feldspar : ~/omicron $

We can see there are a bunch of resources and zones now:

bnaecker@feldspar : ~/omicron $ zoneadm list
global
oxz_internal-dns
oxz_crucible_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
oxz_crucible_oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_crucible_oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
oxz_cockroachdb_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
oxz_clickhouse_oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
oxz_oximeter
oxz_nexus
bnaecker@feldspar : ~/omicron $ ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/dhcp         dhcp     ok           192.168.1.145/24
lo0/v6            static   ok           ::1/128
net0/linklocal    addrconf ok           fe80::8:20ff:fe2e:6eb4/10
net1/linklocal    addrconf ok           fe80::8:20ff:fe5e:77cb/10
underlay0/linklocal addrconf ok         fe80::8:20ff:fec9:b125/10
underlay0/bootstrap6 static ok          fdb0:b42e:99fe:859::1/64
underlay0/sled6   static   ok           fd00:1122:3344:101::1/64
underlay0/internaldns static ok         fd00:1122:3344:1::2/64
bnaecker@feldspar : ~/omicron $ dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
net0        vnic      1500   up       --         igb0
net1        vnic      1500   up       --         igb0
stub0       etherstub 9000   up       --         --
underlay0   vnic      9000   up       --         stub0
oxControlService0 vnic 9000  up       --         stub0
oxControlStorage0 vnic 9000  up       --         stub0
oxControlStorage1 vnic 9000  up       --         stub0
oxControlStorage2 vnic 9000  up       --         stub0
oxControlStorage3 vnic 9000  up       --         stub0
oxControlStorage4 vnic 9000  up       --         stub0
oxControlService1 vnic 9000  up       --         stub0
oxControlService2 vnic 9000  up       --         stub0
bnaecker@feldspar : ~/omicron $

At this point, trying to destroy the virtual hardware fails helpfully:

bnaecker@feldspar : ~/omicron $ pfexec ./tools/destroy_virtual_hardware.sh
+++ dirname ./tools/destroy_virtual_hardware.sh
++ cd ./tools
++ pwd
+ SOURCE_DIR=/home/bnaecker/omicron/tools
+ cd /home/bnaecker/omicron/tools/..
+ OMICRON_TOP=/home/bnaecker/omicron
+ MARKER=/etc/opt/oxide/NO_INSTALL
+ [[ -f /etc/opt/oxide/NO_INSTALL ]]
++ id -u
+ [[ 0 -ne 0 ]]
+ verify_omicron_uninstalled
++ svcs svc:/system/illumos/sled-agent:default
+ [[ 0 -eq 0 ]]
+ set +x
Omicron is still installed, please run `omicron-package uninstall`, and then re-run this script
+ exit 1
bnaecker@feldspar : ~/omicron $

Ok, so let's uninstall Omicron now:

Jul 05 17:45:19.243 INFO Removing all Omicron zones
Jul 05 17:45:23.720 INFO Uninstalling all packages
Jul 05 17:45:23.978 INFO Removing artifacts in: out
Jul 05 17:45:23.978 INFO Keeping: 'out/console-assets'
Jul 05 17:45:23.978 INFO Removing: 'out/crucible.tar.gz'
Jul 05 17:45:23.978 INFO Keeping: 'out/xde'
Jul 05 17:45:23.979 INFO Removing: 'out/oximeter.tar.gz'
Jul 05 17:45:23.979 INFO Removing: 'out/propolis-server.tar.gz'
Jul 05 17:45:23.979 INFO Keeping: 'out/clickhouse'
Jul 05 17:45:23.979 INFO Removing: 'out/mg-ddm.tar'
Jul 05 17:45:23.979 INFO Removing: 'out/clickhouse.tar.gz'
Jul 05 17:45:23.979 INFO Keeping: 'out/downloads'
Jul 05 17:45:23.979 INFO Removing: 'out/nexus.tar.gz'
Jul 05 17:45:23.979 INFO Removing: 'out/internal-dns.tar.gz'
Jul 05 17:45:23.979 INFO Removing: 'out/sled-agent.tar'
Jul 05 17:45:23.979 INFO Removing: 'out/cockroachdb.tar.gz'
Jul 05 17:45:23.979 INFO Keeping: 'out/cockroachdb'
Jul 05 17:45:23.979 INFO Removing installed objects in: /opt/oxide
Jul 05 17:45:23.979 INFO Removing: '/opt/oxide/mg-ddm.tar'
Jul 05 17:45:23.979 INFO Removing: '/opt/oxide/sled-agent'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/propolis-server.tar.gz'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/crucible.tar.gz'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/nexus.tar.gz'
Jul 05 17:45:23.981 INFO Keeping: '/opt/oxide/opte'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/internal-dns.tar.gz'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/clickhouse.tar.gz'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/oximeter.tar.gz'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/sled-agent.tar'
Jul 05 17:45:23.981 INFO Removing: '/opt/oxide/mg-ddm'
Jul 05 17:45:23.982 INFO Removing: '/opt/oxide/cockroachdb.tar.gz'
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: lo0/v4
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: igb0/dhcp
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: lo0/v6
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: net0/linklocal
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: net1/linklocal
Jul 05 17:45:23.988 WARN Deleting existing Omicron IP address, addrobj: underlay0/linklocal
Jul 05 17:45:23.993 WARN Deleting existing Omicron IP address, addrobj: underlay0/bootstrap6
Jul 05 17:45:23.998 WARN Deleting existing Omicron IP address, addrobj: underlay0/sled6
Jul 05 17:45:24.004 WARN Deleting existing Omicron IP address, addrobj: underlay0/internaldns
Jul 05 17:45:24.014 WARN Deleting existing Omicron IP address, addrobj: lo0/v4
Jul 05 17:45:24.014 WARN Deleting existing Omicron IP address, addrobj: igb0/dhcp
Jul 05 17:45:24.014 WARN Deleting existing Omicron IP address, addrobj: lo0/v6
Jul 05 17:45:24.014 WARN Deleting existing Omicron IP address, addrobj: net0/linklocal
Jul 05 17:45:24.018 WARN Deleting existing Omicron IP address, addrobj: net1/linklocal
Jul 05 17:45:24.037 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlService0
Jul 05 17:45:24.040 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlStorage1
Jul 05 17:45:24.040 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlStorage3
Jul 05 17:45:24.040 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlStorage2
Jul 05 17:45:24.040 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlStorage0
Jul 05 17:45:24.040 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlStorage4
Jul 05 17:45:24.042 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlService1
Jul 05 17:45:24.045 WARN Deleting existing VNIC, vnic_kind: OxideControl, vnic_name: oxControlService2
Jul 05 17:45:24.056 WARN Deleting Omicron underlay VNIC, vnic_name: underlay0
Jul 05 17:45:24.075 WARN Deleting Omicron etherstub, stub_name: stub0
bnaecker@feldspar : ~/omicron $

This one of the big changes. Now, all the IP addresses we created for the various Omicron services are destroyed, as are the IP interfaces and datalinks under them. The etherstub is also gone.

bnaecker@feldspar : ~/omicron $ ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/dhcp         dhcp     ok           192.168.1.145/24
lo0/v6            static   ok           ::1/128
bnaecker@feldspar : ~/omicron $ dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
igb0        phys      1500   up       --         --
net0        vnic      1500   up       --         igb0
net1        vnic      1500   up       --         igb0
bnaecker@feldspar : ~/omicron $

Let's try to destroy the hardware again:

bnaecker@feldspar : ~/omicron $ pfexec ./tools/destroy_virtual_hardware.sh
+++ dirname ./tools/destroy_virtual_hardware.sh
++ cd ./tools
++ pwd
+ SOURCE_DIR=/home/bnaecker/omicron/tools
+ cd /home/bnaecker/omicron/tools/..
+ OMICRON_TOP=/home/bnaecker/omicron
+ MARKER=/etc/opt/oxide/NO_INSTALL
+ [[ -f /etc/opt/oxide/NO_INSTALL ]]
++ id -u
+ [[ 0 -ne 0 ]]
+ verify_omicron_uninstalled
+ svcs svc:/system/illumos/sled-agent:default
svcs: Pattern 'svc:/system/illumos/sled-agent:default' doesn't match any instances
+ [[ 1 -eq 0 ]]
+ unload_xde_driver
++ modinfo
++ grep xde
++ cut -d ' ' -f 1
+ local ID=
+ [[ -n '' ]]
+ success 'Unloaded xde kernel driver'
+ set +x
Unloaded xde kernel driver
+ try_remove_vnics
+ try_remove_address lo0/underlay
+ local ADDRESS=lo0/underlay
+ RC=0
++ ipadm show-addr -p -o addr lo0/underlay
ipadm: Address object not found+ [[ -n '' ]]
+ [[ 0 -eq 0 ]]
+ success 'Address lo0/underlay destroyed'
+ set +x
Address lo0/underlay destroyed
+ VNIC_LINKS=("net0" "net1")
+ for LINK in "${VNIC_LINKS[@]}"
+ try_remove_interface net0
+ local IFACE=net0
+ RC=0
++ ipadm show-if -p -o IFNAME net0
ipadm: Could not get interface(s): Interface does not exist+ [[ -n '' ]]
+ [[ 0 -eq 0 ]]
+ success 'Interface net0 destroyed'
+ set +x
Interface net0 destroyed
+ try_remove_vnic net0
+ local LINK=net0
+ RC=0
++ dladm show-vnic -p -o LINK net0
+ [[ -n net0 ]]
+ dladm delete-vnic net0
+ RC=0
+ [[ 0 -eq 0 ]]
+ success 'VNIC link net0 destroyed'
+ set +x
VNIC link net0 destroyed
+ for LINK in "${VNIC_LINKS[@]}"
+ try_remove_interface net1
+ local IFACE=net1
+ RC=0
++ ipadm show-if -p -o IFNAME net1
ipadm: Could not get interface(s): Interface does not exist+ [[ -n '' ]]
+ [[ 0 -eq 0 ]]
+ success 'Interface net1 destroyed'
+ set +x
Interface net1 destroyed
+ try_remove_vnic net1
+ local LINK=net1
+ RC=0
++ dladm show-vnic -p -o LINK net1
+ [[ -n net1 ]]
+ dladm delete-vnic net1
+ RC=0
+ [[ 0 -eq 0 ]]
+ success 'VNIC link net1 destroyed'
+ set +x
VNIC link net1 destroyed
+ try_destroy_zpools
+ readarray -t ZPOOLS
++ zfs list -d 0 -o name
++ grep '^oxp_'
+ for ZPOOL in "${ZPOOLS[@]}"
+ RC=0
+ VDEV_FILE=/home/bnaecker/omicron/oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ zfs destroy -r oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
+ zfs unmount oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
+ zpool destroy oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
+ rm -f /home/bnaecker/omicron/oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ RC=0
+ [[ 0 -eq 0 ]]
+ success 'Removed ZFS pool and vdev: oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b'
+ set +x
Removed ZFS pool and vdev: oxp_d462a7f7-b628-40fe-80ff-4e4189e2d62b
+ for ZPOOL in "${ZPOOLS[@]}"
+ RC=0
+ VDEV_FILE=/home/bnaecker/omicron/oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ zfs destroy -r oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ zfs unmount oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ zpool destroy oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ rm -f /home/bnaecker/omicron/oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ RC=0
+ [[ 0 -eq 0 ]]
+ success 'Removed ZFS pool and vdev: oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03'
+ set +x
Removed ZFS pool and vdev: oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ for ZPOOL in "${ZPOOLS[@]}"
+ RC=0
+ VDEV_FILE=/home/bnaecker/omicron/oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ zfs destroy -r oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ zfs unmount oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ zpool destroy oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
+ rm -f /home/bnaecker/omicron/oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03.vdev
+ RC=0
+ [[ 0 -eq 0 ]]
+ success 'Removed ZFS pool and vdev: oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03'
+ set +x
Removed ZFS pool and vdev: oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03
bnaecker@feldspar : ~/omicron $

That looks OK. We've now unloaded xde, destroyed the VNICs (and IP interfaces if needed), and removed the vdevs. I've also confirmed that we can run the whole thing again at this point, and it all looks good.

This should resolve #1213 and #1212. I also went ahead and improved the "success" messages in destroy_virtual_hardware.sh, which are supposed to ensure that a resource doesn't exist. That is, it'll remove something if it exists, or just claim success if it doesn't. The messages are things like:

+ try_remove_vnic net1
+ local LINK=net1
+ RC=0
++ dladm show-vnic -p -o LINK net1
dladm: invalid vnic name 'net1': object not found
+ [[ -n '' ]]
+ [[ 0 -eq 0 ]]
+ success 'Verified VNIC link net1 does not exist'
+ set +x
Verified VNIC link net1 does not exist

That should resolve #1214 as well, I hope, though I'm open to even clearer messages.

sled-agent/src/opte/opte.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(All my comments aside) I think this is great! I like the new slog logging. I'm out of my element around the specific networking commands.

It would be really neat if we had tests for install_virtual_hardware.sh/destroy_virtual_hardware.sh, even absent omicron-package install. But I imagine that we cannot run any of this under buildomat because it assumes root access in a GZ?

The biggest thing I don't really follow are how we manage the interdependencies between these different components. It seems like:

  • we have resources created by install_virtual_hardware.sh, omicron-package install, and Sled Agent
  • these resources have dependencies between them, presumably only in one direction
  • these resources are cleaned up by omicron-package uninstall and destroy_virtual_hardware.sh

I don't have a handle on how we guide people toward making sure that if they've added a new resource or dependency, there's code in the right spot to clean it up, and it's idempotent, etc. Or how we enforce or test any of that. That question is really much bigger than this PR -- in fact, this PR tightens that up, which is a big improvement. It's just that I'm left with the lingering feeling that there remain many ways for this to go wrong.

}

function verify_omicron_uninstalled {
svcs "svc:/system/illumos/sled-agent:default" 2>&1 > /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the right choice, and also seems like it won't always work in some edge cases. I'm thinking about cases where install started and got far enough to create dependencies that would prevent this script from running, but not far enough to set up sled agent. We might also have a case where uninstall failed after successfully removing sled-agent, but before having removed all the things that this script expects to be gone. Still, this helps in the most common case, so it's worth doing. To address this more deeply I think would require first-classing more of these scripts. I think we expect them to function like a stack of modules with idempotent init/teardown that have assumptions about the modules underneath them, but that seems really hard to enforce between a complex Rust program like Sled Agent and these shell scripts.

set -x
}

function verify_omicron_uninstalled {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function would succeed erroneously if, say, configd were down, or other failure modes that would cause this command to fail even if sled-agent was still installed. It seems like quite a lot of work to make this more robust though and I don't think it's worth doing in this shell script.

tools/destroy_virtual_hardware.sh Outdated Show resolved Hide resolved
tools/destroy_virtual_hardware.sh Outdated Show resolved Hide resolved
@bnaecker
Copy link
Collaborator Author

bnaecker commented Jul 8, 2022

Thanks for the helpful comments @davepacheco. We talked a bit about this in chat, and I wanted to capture the main points here.

First, this PR represents and improvement, and I'm planning to merge it when CI is happy. Second, there's still a lot to be desired, as Dave mentioned. Most of that is around error-handling; the general opacity and fragility of the bash scripts; and the sprawl of the requirements and assumptions throughout the various scripts and binaries.

An overall better approach would be something like this. We'd like one location where we define a sequence of initialization and teardown steps, each of which is idempotent and captures its expected invariants. For example, the sled agent expects that the net0 and net1 VNICs exist, and have no addresses or interfaces on top of them. So one init step could be to ensure those VNICs are there, with an undo step of removing them. If this is sounding like a saga, that's the point. We could write these various actions and undo actions in one location and run them via omicron-package install/uninstall as a saga. This would greatly improve the error handling and reporting, and make more clear what each step is trying to do and why.

The second major improvement would be to express the "environment" in one or more configuration files. We currently do some of this, for example in smf/sled-agent/config.toml. You can see the list of zpools we expect there, for example. Taking this further, we could have a configuration file which describes the "emulated Gimlet" that create_virtual_hardware.sh is currently trying to achieve. This would describe the file-backed vdevs; the VNICs net{0,1}; and the physical link(s) over which those are created.

A second configuration could be for something closer to an actual Gimlet, which might include actual physical disks rather than file-backed vdevs and the Chelsio physical links in lieu of the combination of net{0,1} and the physical data link.

This configuration would then be fed into the saga, provided as input to the actions. E.g., one step would create / remove the file-backed vdevs for the "emulated Gimlet" config, while the real Gimlet config would do nothing here (the disks are there already).

All of this is a good bit of work. However, it does seems significantly better. Real Rust programs would take the place of bash. The environment configuration files would express all the expectations about the machine(s) on which Omicron is running. There are other considerations, too, such as the coming ramdisk image(s), how long we'll need to support commodity hardware such as most of us have at home, etc. But it does seem like a worthwhile improvement, since we've already seen so much difficulty with the existing system.

@plotnick
Copy link
Contributor

plotnick commented Jul 8, 2022

Just as a data point, these changes allowed my Helios box to run Omicron again, which it could not with the previous version. So: thanks!

leftwo pushed a commit that referenced this pull request Jun 26, 2024
Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts.  These scripts are extracted into
the global zone at /opt/oxide/crucible_dtrace/

Update Crucible to latest includes these updates:
Clean up dependency checking, fixing space leak (#1372)
Make a DTrace package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371)
Remove `WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366)
Move 'do work for one job' into a helper function (#1365)
Remove `DownstairsWork` from map when handling it (#1361)
Using `block_in_place` for IO operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer configuration (#1369)
Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350)
Support arbitrary Volumes during replace compare (#1349)
Remove the SQLite backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341)
Move Work into ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333)
Move disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330)
Move cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328)
Move remaining local state into a `struct ConnectionState` (#1327)
Consolidate negotiation + IO operations into one loop (#1322)
Allow replacement of a target in a read_only_parent (#1281)
Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326)
Move negotiation into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317)
Reuse a reqwest client when creating repair client (#1324)
Add % to keep buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314)
Added more DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)
leftwo added a commit that referenced this pull request Jun 26, 2024
Update Crucible and Propolis to the latest

Added a new package, crucible-dtrace that pulls from buildomat a package
that contains a set of DTrace scripts. These scripts are extracted into the 
global zone at /opt/oxide/crucible_dtrace/

Crucible latest includes these updates:
Clean up dependency checking, fixing space leak (#1372) Make a DTrace
package (#1367)
Use a single context in all messages (#1363)
Remove `DownstairsWork`, because it's redundant (#1371) Remove
`WorkState`, because it's implicit (#1370)
Do work immediately upon receipt of a job, if possible (#1366) Move 'do
work for one job' into a helper function (#1365) Remove `DownstairsWork`
from map when handling it (#1361) Using `block_in_place` for IO
operations (#1357)
update omicron deps; use re-exported dropshot types in oximeter-producer
configuration (#1369) Parameterize more tests (#1364)
Misc cleanup, remove sqlite references. (#1360)
Fix `Extent::close` docstring (#1359)
Make many `Region` functions synchronous (#1356)
Remove `Workstate::Done` (unused) (#1355)
Return a sorted `VecDeque` directly (#1354)
Combine `proc_frame` and `do_work_for` (#1351)
Move `do_work_for` and `do_work` into `ActiveConnection` (#1350) Support
arbitrary Volumes during replace compare (#1349) Remove the SQLite
backend (#1352)
Add a custom timeout for buildomat tests (#1344)
Move `proc_frame` into `ActiveConnection` (#1348)
Remove `UpstairsConnection` from `DownstairsWork` (#1341) Move Work into
ConnectionState (#1340)
Make `ConnectionState` an enum type (#1339)
Parameterize `test_repair.sh` directories (#1345)
Remove `Arc<Mutex<Downstairs>>` (#1338)
Send message to Downstairs directly (#1336)
Consolidate `on_disconnected` and `remove_connection` (#1333) Move
disconnect logic to the Downstairs (#1332)
Remove invalid DTrace probes. (#1335)
Fix outdated comments (#1331)
Use message passing when a new connection starts (#1330) Move
cancellation into Downstairs, using a token to kill IO tasks (#1329)
Make the Downstairs own per-connection state (#1328) Move remaining
local state into a `struct ConnectionState` (#1327) Consolidate
negotiation + IO operations into one loop (#1322) Allow replacement of a
target in a read_only_parent (#1281) Do all IO through IO tasks (#1321)
Make `reqwest_client` only present if it's used (#1326) Move negotiation
into Downstairs as well (#1320)
Update Rust crate clap to v4.5.4 (#1301)
Reuse a reqwest client when creating Nexus clients (#1317) Reuse a
reqwest client when creating repair client (#1324) Add % to keep
buildomat happy (#1323)
Downstairs task cleanup (#1313)
Update crutest replace test, and mismatch printing. (#1314) Added more
DTrace scripts. (#1309)
Update Rust crate async-trait to 0.1.80 (#1298)

Propolis just has this one update:
Allow boot order config in propolis-standalone
---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants