Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The virtual machine cannot use the GPU #27

Open
newhuangchuan opened this issue Nov 2, 2021 · 16 comments
Open

The virtual machine cannot use the GPU #27

newhuangchuan opened this issue Nov 2, 2021 · 16 comments

Comments

@newhuangchuan
Copy link

Hello, I want to connect my GPU directly to the virtual machine in my kubevirt through the current method, but when I create the virtual machine, I was prompted by describe vmi:

Events:
  Type     Reason            Age                 From                       Message
  ----     ------            ----                ----                       -------
  Normal   SuccessfulCreate  6m6s                virtualmachine-controller  Created virtual machine pod virt-launcher-vmi-gpu-n4lh7
  Warning  SyncFailed        117s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:38.830552Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        117s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:39.372653Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        117s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:39.745594Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        116s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:40.131267Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        116s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:40.514658Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        115s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:40.843721Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        115s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:41.185059Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        115s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:41.521218Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        114s                virt-handler, k8s-master   server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:15:42.421670Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"
  Warning  SyncFailed        34s (x6 over 114s)  virt-handler, k8s-master   (combined from similar events): server error. command SyncVMI failed: "LibvirtError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2021-11-02T02:17:02.182163Z qemu-kvm: -device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.6,addr=0x0: vfio 0000:01:00.0: group 1 is not viable\nPlease ensure all devices within the iommu_group are bound to their vfio bus driver.')"

The GPU resource pool is as follows:

$ kubectl get node k8s-master -o json | jq '.status.allocatable'
{
  "cpu": "8",
  "devices.kubevirt.io/kvm": "110",
  "devices.kubevirt.io/tun": "110",
  "devices.kubevirt.io/vhost-net": "110",
  "ephemeral-storage": "891766110780",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "16338732Ki",
  "nvidia.com/GM107M_GeForce_GTX_860M": "1",
  "pods": "110"
}

What I understand is this, you can use the GPU directly, or you can create a VGPU, and I am currently using the GPU, do I need other configurations?

@rthallisey
Copy link
Collaborator

If you're using the latest kubevirt (> 0.45), have a look at #19

@newhuangchuan
Copy link
Author

You mean, I need to update the version of kubevirt to 0.45 or higher to be able to use it, right?

At present, the GPU pass-through I am using does not use the VGPU method.

@rthallisey
Copy link
Collaborator

No, I mean if you are using a new kubevirt version it might be issues #19.

Please ensure all devices within the iommu_group are bound to their vfio bus driver. Can you share the result of: lspci -nnk -d 10de:?

@newhuangchuan
Copy link
Author

For now, I am now reporting errors through describe vmi:Please ensure all devices within the iommu_group are bound to their vfio bus driver.

$ lspci -nnk -d 10de: 
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107M [GeForce GTX 860M] [10de:1392] (rev a2)
        Subsystem: ASUSTeK Computer Inc. GM107M [GeForce GTX 860M] [1043:861e]
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] [10d
        Subsystem: ASUSTeK Computer Inc. Device [1043:861e]
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

@rthallisey
Copy link
Collaborator

Try binding 01:00.1 to vfio-pci - echo 01:00.1 > /sys/bust/pci/drivers/vfio-pci/bind

@newhuangchuan
Copy link
Author

It prompts me that I can't find this device, but this device exists through lspci.

I want to know the words of gpu pass-through. Do I need to load the sound card?

@rthallisey
Copy link
Collaborator

From the device plugin's perspective, everything looks ok. You exposed the device to Kubernetes "nvidia.com/GM107M_GeForce_GTX_860M": "1", and libvirt fails to attach the device, but it's in the virt-launcher pod. Look into the iommu group on the host to see what device isn't using vfio-pci.

@newhuangchuan
Copy link
Author

从设备插件的角度来看,一切正常。您将设备暴露给 Kubernetes, "nvidia.com/GM107M_GeForce_GTX_860M": "1",而 libvirt 无法连接设备,但它位于 virt-launcher pod 中。查看主机上的 iommu 组以查看哪些设备未使用 vfio-pci。

Hello, what method can be used to view the iommu group on the host?

@newhuangchuan
Copy link
Author

@rthallisey Do I need to adjust the run level of my host machine to 3 and close his graphics window (I wonder if the reason is that the physical machine is already using the graphics card and cannot be connected directly).

@rthallisey
Copy link
Collaborator

Try this gist.

@newhuangchuan
Copy link
Author

Hello, the following is the result after I execute the script, it seems that the NVIDIA device is already in the iommu group.

$ ./iommu.sh 
IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation 4th Gen Core Processor DRAM Controller [8086:0c00] (rev 06)
IOMMU Group 1 00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller [8086:0c01] (rev 06)
IOMMU Group 1 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107M [GeForce GTX 860M] [10de:1392] (rev a2)
IOMMU Group 1 01:00.1 Audio device [0403]: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] [10de:0fbc] (rev a1)
IOMMU Group 2 00:14.0 USB controller [0c03]: Intel Corporation 9 Series Chipset Family USB xHCI Controller [8086:8cb1]
IOMMU Group 3 00:16.0 Communication controller [0780]: Intel Corporation 9 Series Chipset Family ME Interface #1 [8086:8cba]
IOMMU Group 4 00:1a.0 USB controller [0c03]: Intel Corporation 9 Series Chipset Family USB EHCI Controller #2 [8086:8cad]
IOMMU Group 5 00:1b.0 Audio device [0403]: Intel Corporation 9 Series Chipset Family HD Audio Controller [8086:8ca0]
IOMMU Group 6 00:1c.0 PCI bridge [0604]: Intel Corporation 9 Series Chipset Family PCI Express Root Port 1 [8086:8c90] (rev d0)
IOMMU Group 7 00:1c.5 PCI bridge [0604]: Intel Corporation 9 Series Chipset Family PCI Express Root Port 6 [8086:8c9a] (rev d0)
IOMMU Group 8 00:1c.6 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev d0)
IOMMU Group 8 04:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge [1b21:1080] (rev 03)
IOMMU Group 9 00:1d.0 USB controller [0c03]: Intel Corporation 9 Series Chipset Family USB EHCI Controller #1 [8086:8ca6]
IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Z97 Chipset LPC Controller [8086:8cc4]
IOMMU Group 10 00:1f.2 SATA controller [0106]: Intel Corporation 9 Series Chipset Family SATA Controller [AHCI Mode] [8086:8c82]
IOMMU Group 10 00:1f.3 SMBus [0c05]: Intel Corporation 9 Series Chipset Family SMBus Controller [8086:8ca2]
IOMMU Group 11 03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)

@newhuangchuan
Copy link
Author

Try this gist.

Hello, through the settings here, the virtual machine has been successfully started, but there are currently two problems:

First: The 750ti graphics driver for Windows 10 downloaded from the nvidia official website. After installation, the device can display normal operation, but the GPU information is not displayed in the task manager.

Second: After the virtual machine system restarts, the graphics card driver disappears.

@newhuangchuan
Copy link
Author

This is what the virtual machine looks like after restarting:
image

@dimm0
Copy link

dimm0 commented Mar 10, 2022

Hello, through the settings here, the virtual machine has been successfully started, but there are currently two problems:

@newhuangchuan what did you do to start the VM? I'm hitting the same issue...

@newhuangchuan
Copy link
Author

Hello, through the settings here, the virtual machine has been successfully started, but there are currently two problems:

@newhuangchuan what did you do to start the VM? I'm hitting the same issue...
Hello, I am here to unbind the graphics card from the host, and then change the graphics card driver to vfio-pci

$ modprobe pci_stub

$ echo "10de 1b80" > /sys/bus/pci/drivers/pci-stub/new_id

$ echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

$ echo 0000:01:00.0 > /sys/bus/pci/drivers/pci-stub/bind

$ echo "10de 10f0" > /sys/bus/pci/drivers/pci-stub/new_id

$ echo 0000:01:00.1 > /sys/bus/pci/devices/0000:01:00.1/driver/unbind

$ echo 0000:01:00.1 > /sys/bus/pci/drivers/pci-stub/bind

Sorry for replying to you now, I hope it will be useful to you

@rthallisey
Copy link
Collaborator

Hello, through the settings here, the virtual machine has been successfully started, but there are currently two problems:

First: The 750ti graphics driver for Windows 10 downloaded from the nvidia official website. After installation, the device can display normal operation, but the GPU information is not displayed in the task manager.

Second: After the virtual machine system restarts, the graphics card driver disappears.

I finally saw a repro for this issue after switching to a newer kubevirt version. This PR kubevirt/kubevirt#6664 changed the default behavior in kubevirt, which can lead to this issue.
Try disabling the ramFB in your VMI spec and see if the issue goes away @newhuangchuan:

spec:
  domain:
    devices:
      gpus:
      - deviceName: nvidia.com/GRID_T4-2Q
        name: myvgpu
        virtualGPUOptions:
          display:
            enabled: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants