-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live migration failures after upgrading from 4.5.2 to 4.5.4 #372
Comments
Can you confirm you have the multipath blacklist rule? $ multipathd show config | grep rbd |
Yep, I did already check that after seeing it mentioned in the archive post. It does appear to be blacklisted properly on all of my hosts:
|
I just added ceph storage to our 4.5.4-1 install on rocky 8.7, and see the same issues. I saw ours due to failure to delete testing images, and now moving things around between nodes see it. I've resorted to manually doing 'rbd unmap' and finding/removing entries in ovirt_cinderlib volume_attachment table for some. Our multipath blacklist is in place too, I looked for that. |
To follow up, this was resolved by both upgrading to rocky 8.8 and ovirt nightly - I missed that ovirt releases aren't happening any more. Our issues were compounded by this kernel bug on 8.7 https://bugzilla.redhat.com/show_bug.cgi?id=2079311 |
Everything was running on oVirt 4.5.2 on Rocky 8.6 with no issues for months. On Friday I updated our standalone engine to Rocku 8.7 and oVirt 4.5.4, decommissioned one of the hosts, then built a replacement on new hardware (same CPU) on 4.5.4. I was successfully able to migrate VMs to and from the host and called it good, then upgraded the rest of my hosts in sequence to Rocku 8.7 and oVirt 4.5.4. The only issue I ran into is I had to manually update and run
dnf update --allow-erasing
due to some ovs package changes. No issues were apparent in operation at this time.On Monday morning, one VM appeared to have hung and was manually restarted. A couple of hours later a few VMs stopped running with the error in the web UI: "VM is down. Exit message: User shut down from within the guest". This may have been triggered by an oVirt migration attempt after the VMs were marked unresponsive. Within 10 minutes after this, the host they were on was marked as unresponsive. oVirt tried to migrate them and failed due to the error below. My suspicion was that the VM host was holding onto the storage lease despite the VMs being dead. I saw errors like this in the log on the host marked Nonoperational:
And on the target migration host:
After this I believe the VMs failed and they were restarted on another host manually. But there were several still running on this host. I was able to successfully migrate them off to another host later. I rebooted that host and was able to migrate other VMs to and from it without issues, so I iniitially though that was the end of the problem, and left it running with a small number of VMs to see if it reoccurs.
After this I was manually rebalancing some of the VMs across other hosts and found that some of them fail to migrate between certain hosts with the same error, but can migrate to others. These are VMs that were never on the host that had the initial problem.
Connectivity looks good. I show no disconnections on any network interface, and only a handfull of packet drops on the host side on my storage network (not increasing, and likely related to the storage network being set up after the host booted).
I googled for the error and found this previous occurance:
https://bugzilla.redhat.com/show_bug.cgi?id=1755801
https://lists.ovirt.org/archives/list/[email protected]/thread/PVGTQPXCTEQI4LUUSXDRLSIH3GXXQC2N/?sort=date
This sounds like either a bug that was presumed fixed, or an issue due to an external mount of the same volume. The latter is definitely not the case here, we use a dedicated ceph RBD volume for oVirt, along with a dedicated cephfs volume for the primary storage domain (including some of our VMs which we haven't yet migrated to RBD since it's currently a long manual process involving VM downtime).
The text was updated successfully, but these errors were encountered: