Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sled agent hung on disabling fmd service. #4252

Open
rcgoodfellow opened this issue Oct 11, 2023 · 8 comments
Open

Sled agent hung on disabling fmd service. #4252

rcgoodfellow opened this issue Oct 11, 2023 · 8 comments

Comments

@rcgoodfellow
Copy link
Contributor

Spinning up an omicron development environment on my workstation hung at the following place. Sled agent was running, but not making any progress. It had brought up the switch zone, but had not brought up any other zones.

smf::Adm::new()
.disable()
.temporary()
.synchronous()
.run(smf::AdmSelection::ByPattern(&[service]))?;

We were able to see this on the host OS as follows.

ry@masaka:~/src/omicron$ pgrep -lf sled-agent
15742 /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
15741 ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxi
ry@masaka:~/src/omicron$ pfexec ptree 15742
15741  ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/oxi
  15742  /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
    15903  /usr/sbin/svcadm disable -s -t svc:/system/fmd:default

Looking at the fmd service, we found it in maintenance mode.

ry@masaka:~/src/omicron$ svcs | grep fmd
maintenance    17:05:33 svc:/system/fmd:default

Clearing the service with svcadm clear fmd resulted in sled agent springing to life. At this point, it had been stuck for over an hour as I went to work on other things. The control plane came up completely and appears to be functioning normally after getting unstuck.

I looked at the logs for fmd with cat $(svcs -L fmd) the contained this

[ Sep 18 22:28:05 Method "start" exited with status 0. ]
[ Oct 10 15:59:10 Stopping because service disabled. ]
[ Oct 10 15:59:10 Executing stop method (:kill). ]
[ Oct 10 15:59:11 Enabled. ]
[ Oct 10 15:59:11 Executing start method ("/usr/lib/fm/fmd/fmd"). ]
[ Oct 10 16:00:33 Method "start" exited with status 0. ]
[ Oct 10 17:04:33 Stopping because service disabled. ]
[ Oct 10 17:04:33 Executing stop method (:kill). ]
[ Oct 10 17:05:33 Method or service exit timed out.  Killing contract 61062. ]
@rcgoodfellow
Copy link
Contributor Author

Prior to this, one of my destroy_virtual_hardware.sh commands had failed in a way that is probably related to this.

Verified ZFS pool and vdev oxp_f4b4dc87-ab46-49fb-a4b4-d361ae214c03 does not exist
+ for ZPOOL_TYPE in "${ZPOOL_TYPES[@]}"
+ readarray -t ZPOOLS
++ zfs list -d 0 -o name
++ grep '^oxi_'
+ for ZPOOL in "${ZPOOLS[@]}"
+ VDEV_FILE=/home/ry/src/omicron/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b.vdev
+ zfs destroy -r oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b
cannot unmount '/var/fm/fmd': Device busy
+ warn 'Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b'
+ set +x
Failed to remove ZFS pool and vdev: oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b

I "fixed" this with

pfexec zfs destroy -fr oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b

and then reran the destroy virtual hardware script which ran to completion without error.

@andrewjstone
Copy link
Contributor

cc @citrus-it who introduced this code in #4201

Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes, which wasn't the case here. Ry pointed out that the log messages being spewed from sled-agent were actually 20 minutes in the future based on running date -u. Once Ry resolved the issue by kicking fmd, the sled-agent progressed and the log timestamps were back to normal. I mention this only for completeness and because it was quite odd.

One crazy hypothesis I had was that the call to usr/sbin/svcadm disable -s -t svc:/system/fmd:default was hung because fmd disable was waiting for some sort of time related catchup that occurred in the prior runs but not the latest run that we debugged. This is just a WAG, and I have no reason to believe it to be the case, other than that the timestamps were totally out of wack in sled-agent before we kicked FMD. However the timestamp mixup could have been a result of the hang or completely unrelated.

@rcgoodfellow
Copy link
Contributor Author

Noting that the

cannot unmount '/var/fm/fmd': Device busy

from the post above seems to be a recurring issue now when destroying virtual hardware.

Running svcadm disable fmd before running the destroy virtual hardware script allows it to run successfully.

@rcgoodfellow
Copy link
Contributor Author

Another thing to note is that @rcgoodfellow had seen this issue several times today, except that each time it resolved itself after about 15 minutes,

I think this delay might actually be zpool import.

I can see these as sled agent is getting going

ry@masaka:~$ pfexec ptree 6263
6262   ctrun -l child -o noorphan,regent /opt/oxide/sled-agent/sled-agent run /opt/
  6263   /opt/oxide/sled-agent/sled-agent run /opt/oxide/sled-agent/pkg/config.toml
    8562   /usr/sbin/zpool import -f oxp_e4b4dc87-ab46-49fb-a4b4-d361ae214c03

and it goes through them one at a time, each taking a few minutes it seems

@jclulow
Copy link
Collaborator

jclulow commented Oct 11, 2023

FWIW, I don't think we should be dorking with fmd on a workstation. It makes sense on a Gimlet where the system boots from a ramdisk, but on a workstation we're almost certainly going to ruin things (like you see here) and also we're going to discard some of your actual FMA data each time you destroy and recreate the (virtual) pools with those files in them.

@andrewjstone
Copy link
Contributor

I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:

cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy

@andrewjstone
Copy link
Contributor

andrewjstone commented Oct 26, 2023

I'm now also running into this, but @rcgoodfellow's workaround isn't doing the trick. FMD is not in maintenance mode and I'm getting:

cannot destroy 'oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap': dataset is busy

I ended up resolving this by deleting the swap

pfexec swap -d /dev/zvol/dsk/oxi_a462a7f7-b628-40fe-80ff-4e4189e2d62b/swap

@karencfv
Copy link
Contributor

This may be related #3651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants