OpenShift Virtualization testing—VMs with GPUs #725

computate · 2024-09-11T14:30:23Z

Edit by naved001: Blocked as of 10/22/2024 on getting access to a GPU.

naved001 · 2024-09-11T14:40:29Z

I imagine we would want to test passing a single GPU and multiple GPUs

jtriley · 2024-09-11T15:58:12Z

Some docs to read up on:

https://docs.openshift.com/container-platform/4.14/virt/virtual_machines/advanced_vm_management/virt-configuring-virtual-gpus.html

We should be able to test this soon given that we now have a V100 host in the ocp-test cluster.

jtriley · 2024-09-11T16:18:39Z

Looks like there's device mediation (ie vGPUs) and PCI-passthrough support depending on what cards are supported. For mediation there are two approaches - one that uses the NVIDIA GPU operator to do the mediation and one that relies on RH OpenShift Virtualization operator to do the setup. Need to read up on that and PCI passthrough:

computate · 2024-10-02T17:49:34Z

Note from the NERC: HU/BU Weekly Team Meeting that Dan McPherson would like to test GPUs on OpenShift Virtualization.

naved001 · 2024-10-03T19:17:56Z

@jtriley Is there a reason why the only GPU node in the test cluster has scheduling disabled?

➜  ~    oc get nodes -l 'nvidia.com/gpu.product=Tesla-V100-PCIE-32GB'
NAME    STATUS                     ROLES    AGE   VERSION
wrk-3   Ready,SchedulingDisabled   worker   22d   v1.28.12+396c881

I want to start testing PCI pass-through for GPUs.

jtriley · 2024-10-03T19:22:36Z

@jtriley Is there a reason why the only GPU node in the test cluster has scheduling disabled?

Not that I'm aware of - maybe @dystewart has it temporarily disabled? I think he's working on the GPU-scheduling (#495) on that cluster IIRC.

naved001 · 2024-10-03T19:26:57Z

@dystewart let me know once you are done with your testing and I can then proceed with this issue once the GPU is available.

naved001 · 2024-10-03T20:22:09Z

For mediation there are two approaches - one that uses the NVIDIA GPU operator to do the mediation and one that relies on RH OpenShift Virtualization operator to do the setup.

Apparently both of those method require the nvidia vGPU Software which requires a license to get. Do we have such a license for these GPUs?

computate · 2024-10-04T15:47:59Z

@hpdempsey are we able to get an NVIDIA vGPU Software license to test VMs with GPUs? See above ^.

naved001 · 2024-10-04T16:40:51Z

@computate just to be clear, that software is required if we want to test VMs with vGPUs which are partitioned nvidia GPUs. For PCI passthrough of a whole GPU we do not need that license (I plan to do that once the GPU becomes available).

dystewart · 2024-10-07T22:48:25Z

@naved001 yeah sorry, I'm still playing around with a couple things on the gpu so have it cordoned rn, very close to finished up though!

naved001 · 2024-10-07T23:36:32Z

@dystewart no rush, thank you for the heads up!

okrieg · 2024-10-08T11:15:33Z

This is all about functionality, but one of the things we will need to do is evaluate the performance when you use GPUs virtualized versus physical; Apoorve is working on this at IBM

joachimweyl · 2024-10-22T13:42:23Z

@naved001 please provide an update on how things are going and the next steps.

naved001 · 2024-10-22T14:19:18Z

@joachimweyl I am blocked on getting access to a GPU to test this. I have a draft PR which will enable GPU pass-through for V100.

joachimweyl · 2024-11-05T13:58:05Z

@naved001 are you still blocked on this or did the Nvidia fix and access to the V100 resolve this blockage?

naved001 · 2024-11-06T15:29:10Z

I merged the PR that should enable testing this but it appears that the machineconfig update hasn't rolled out and is stuck in updating state, so need to take a look at that.

naved001 · 2024-11-06T20:38:27Z

@computate The machineconfig didn't apply because the nodes can't be drained. I see this in the logs

[machine-config-controller-6c9484cd9-bgcmb machine-config-controller] I1106 19:40:05.732003       1 drain_controller.go:152] evicting pod knative-serving/activator-7cbf6b7785-sb4zv
[machine-config-controller-6c9484cd9-bgcmb machine-config-controller] E1106 19:40:05.751303       1 drain_controller.go:152] error when evicting pods/"activator-7cbf6b7785-sb4zv" -n "knative-serving" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
[machine-config-controller-6c9484cd9-bgcmb machine-config-controller] I1106 19:40:05.762227       1 drain_controller.go:152] evicting pod knative-serving/webhook-6cd8bdbdc7-n5g8w
[machine-config-controller-6c9484cd9-bgcmb machine-config-controller] E1106 19:40:05.779561       1 drain_controller.go:152] error when evicting pods/"webhook-6cd8bdbdc7-n5g8w" -n "knative-serving" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Do you know where those pods in knative-serving namespace come from and if we can do something about those?

computate · 2024-11-06T21:51:48Z

@naved001 I wouldn't worry about evicting knative-serving pods in the prod cluster. They just come with the operator. You can delete them if you want.

naved001 · 2024-11-07T18:36:44Z

After the machineconfig changes were applied. I can see that the GPU device is bound to the vfio driver

[core@wrk-3 ~]$ lspci -nnk -d 10de:
3b:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] [10de:1db6] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:124a]
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

And if we describe the node wrk-3 we can see that the 1 GPU device shows up as allocatable.

nvidia.com/GV100GL_Tesla_V100:  1

I will now test passing it to a VM.

naved001 · 2024-11-07T20:02:19Z

I can confirm that I can launch a VM with 1 GPU on wrk-3 (only it has 1 GPU).

You can SSH to the VM with virtctl -n virt-test ssh centos@naved-test-gpu

I launched my VM from a centos9 template so I edited it to have access to the GPU.

➜  ~ oc get vm -n virt-test naved-test-gpu -o yaml | yq .spec.template.spec.domain.devices.hostDevices
[
  {
    "deviceName": "nvidia.com/GV100GL_Tesla_V100",
    "name": "hostdevices1"
  }
]

Once the VM launched I could see the GPU device in lspci. After that I installed the nvidia drivers and I could run nvidia-smi.

[centos@naved-test-gpu ~]$ lspci |grep -i nvidia
09:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
[centos@naved-test-gpu ~]$ nvidia-smi
Thu Nov  7 14:41:31 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:09:00.0 Off |                    0 |
| N/A   35C    P0             25W /  250W |       1MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@jtriley @computate what other tests do we want to perform for this issue? I am thinking of maybe testing the A100 machine since it has multiple GPUs. In that case I'll reset this machine so that @dystewart could use this.

computate · 2024-11-08T16:53:59Z

@naved001 you could try a simple Tensorflow test.

# Test Python Tensorflow with GPU: 

pip install tensorflow numpy matplotlib torch --upgrade
python3 -m pip install tensorflow[and-cuda] --upgrade

# Make sure this command returns a tensor in the array: 
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

computate · 2024-11-08T16:54:13Z

@naved001 or you could try running InstructLab. Something like this:

git clone https://github.com/instructlab/instructlab.git
cd instructlab/
sudo dnf install python3.11 python3.11-devel
python3.11 -m venv venv
python3 -m venv --upgrade-deps venv
source venv/bin/activate
pip install packaging wheel torch
pip install 'instructlab[cuda]' \
   -C cmake.args="-DLLAMA_CUDA=on" \
   -C cmake.args="-DLLAMA_NATIVE=off"

CUDACXX=/usr/local/cuda-12/bin/nvcc CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=native" FORCE_CMAKE=1 CUDAHOSTCXX=$(which clang++-17) pip install --force-reinstall --no-deps llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_CUDA=on"

ilab data generate --pipeline=full --num-cpus 8 --gpus 1 --taxonomy-base=empty
ilab chat

sudo dnf install pciutils
lspci -n -n -k | grep -A 2 -e VGA -e 3D

ilab init
ilab download
ilab model serve
ilab data generate --pipeline=full --num-cpus 8 --gpus 1 --taxonomy-base=empty
ilab data generate --taxonomy-base=origin/cmb-run-2024-08-26

computate · 2024-11-08T16:55:40Z

@Milstein has an awesome model training Jupyter notebook with examples as well!

naved001 · 2024-11-08T19:41:30Z

@computate I did the simple test and can confirm that the GPU device is usable in tensorflow.

>>> gpus = tensorflow.config.list_physical_devices('GPU')
>>> gpus
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> print(tensorflow.config.experimental.get_device_details(gpus[0]))
{'compute_capability': (7, 0), 'device_name': 'Tesla V100-PCIE-32GB'}

naved001 · 2024-11-13T21:13:12Z

I tested the following configurations:

A single VM on a host with one GPU (V100 machine).
Two VMs running simultaneously on a single machine: one VM with two A100 GPUs and another with a single A100 GPU.
A VM using all GPUs on a machine, specifically four A100 GPUs in a single VM.

In all cases, I could view the GPU devices using spci and successfully ran nvidia-smi.

Observations and Concerns

The machineconfig policy I submitted was applied to all worker nodes. This means that in a cluster, all GPU nodes will operate in PCI passthrough mode. To apply this configuration to only a subset of nodes, we will need custom machine config pools to configure IOMMU and VFIO passthrough for specific nodes.
The GPUs are available as new allocatable resources based on the hyperconverged resource configuration. For instance, nvidia.com/A100_SXM4_40GB appears as a resource for A100 GPUs, and nvidia.com/GV100GL_Tesla_V100 for V100 GPUs. As a result, we will need to update the resource quotas for projects and adjust Coldfront configuration to manage access to these resources.
For billing and usage, various kubvirt_vmi* metrics are available:
- kubevirt_vmi_memory_domain_bytes represents the memory value in the domain XML file, which provides RAM usage.
- Each VM creates a launcher pod with some CPU (100m) and memory allocated that does not match the VM’s requested resources. However, the pod correctly requests the expected number of GPUs. We can track GPU allocation with a query like kube_pod_resource_request{namespace="virt-test", resource="nvidia.com/GV100GL_Tesla_V100"}.
- I am unsure how to retrieve CPU requests for a VM without examining the pod definition.

I did not test GPUs in VMs with mediate devices as I believe we need subscription to nvidia vGPU software to use it.

I am going to mark this issue as done and then undo the changes to the test cluster.

computate · 2024-11-14T23:12:45Z

@naved001 I got some feedback from @hpdempsey , can we still do a demo of GPUs on VMs with @waygil @jtriley and @aabaris with some GPUs from ESI?

naved001 · 2024-11-15T15:05:36Z

@computate I undid the changes I made to the test cluster (ocp-test) so that Dylan can use the GPUs for the scheduling issues. Once that's done we could reapply the changes for GPU testing.

with some GPUs from ESI?

What openshift cluster would these be a part of?

joachimweyl · 2024-12-04T13:41:03Z

@naved001, with your imminent parental leave, would you please break this into multiple issues, close out the parts you completed, and pass along the other issues to Chris and/or Thorsten?

naved001 · 2024-12-04T14:58:02Z

@joachimweyl the testing is actually complete. @computate only reopened this issue so that we could have a demo. I am going to create another issue just for the demo then.

computate added the openshift-virtualization label Sep 11, 2024

joachimweyl assigned computate, naved001 and schwesig Sep 25, 2024

schwesig mentioned this issue Oct 2, 2024

Create documentation for using VMs with OpenShift Virtualization and uploading images #709

Open

1 task

joachimweyl unassigned computate and schwesig Oct 2, 2024

dystewart self-assigned this Oct 2, 2024

naved001 added the blocked Include reason issue is blocked in the description label Oct 22, 2024

naved001 removed the blocked Include reason issue is blocked in the description label Nov 9, 2024

naved001 closed this as completed Nov 13, 2024

computate reopened this Nov 14, 2024

naved001 mentioned this issue Nov 18, 2024

OpenShift Virtualization: Update coldfront to create updated resourcequota for passthroughed GPUs. #820

Open

naved001 closed this as completed Dec 4, 2024

joachimweyl mentioned this issue Dec 11, 2024

OpenShift Virtualization testing - Consider if it's advantageous to build a separate cluster to provide VMs #764

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenShift Virtualization testing—VMs with GPUs #725

OpenShift Virtualization testing—VMs with GPUs #725

computate commented Sep 11, 2024 •

edited by naved001

Loading

naved001 commented Sep 11, 2024

jtriley commented Sep 11, 2024

jtriley commented Sep 11, 2024

computate commented Oct 2, 2024

naved001 commented Oct 3, 2024

jtriley commented Oct 3, 2024

naved001 commented Oct 3, 2024

naved001 commented Oct 3, 2024

computate commented Oct 4, 2024

naved001 commented Oct 4, 2024

dystewart commented Oct 7, 2024

naved001 commented Oct 7, 2024

okrieg commented Oct 8, 2024

joachimweyl commented Oct 22, 2024

naved001 commented Oct 22, 2024

joachimweyl commented Nov 5, 2024

naved001 commented Nov 6, 2024

naved001 commented Nov 6, 2024

computate commented Nov 6, 2024

naved001 commented Nov 7, 2024

naved001 commented Nov 7, 2024

computate commented Nov 8, 2024

computate commented Nov 8, 2024

computate commented Nov 8, 2024

naved001 commented Nov 8, 2024

naved001 commented Nov 13, 2024

computate commented Nov 14, 2024 •

edited

Loading

naved001 commented Nov 15, 2024

joachimweyl commented Dec 4, 2024

naved001 commented Dec 4, 2024

OpenShift Virtualization testing—VMs with GPUs #725

OpenShift Virtualization testing—VMs with GPUs #725

Comments

computate commented Sep 11, 2024 • edited by naved001 Loading

naved001 commented Sep 11, 2024

jtriley commented Sep 11, 2024

jtriley commented Sep 11, 2024

computate commented Oct 2, 2024

naved001 commented Oct 3, 2024

jtriley commented Oct 3, 2024

naved001 commented Oct 3, 2024

naved001 commented Oct 3, 2024

computate commented Oct 4, 2024

naved001 commented Oct 4, 2024

dystewart commented Oct 7, 2024

naved001 commented Oct 7, 2024

okrieg commented Oct 8, 2024

joachimweyl commented Oct 22, 2024

naved001 commented Oct 22, 2024

joachimweyl commented Nov 5, 2024

naved001 commented Nov 6, 2024

naved001 commented Nov 6, 2024

computate commented Nov 6, 2024

naved001 commented Nov 7, 2024

naved001 commented Nov 7, 2024

computate commented Nov 8, 2024

computate commented Nov 8, 2024

computate commented Nov 8, 2024

naved001 commented Nov 8, 2024

naved001 commented Nov 13, 2024

Observations and Concerns

computate commented Nov 14, 2024 • edited Loading

naved001 commented Nov 15, 2024

joachimweyl commented Dec 4, 2024

naved001 commented Dec 4, 2024

computate commented Sep 11, 2024 •

edited by naved001

Loading

computate commented Nov 14, 2024 •

edited

Loading