`operator-inventory` reports excessively large GPU number when `'nvidia.com/gpu' device marked unhealthy` by the `nvdp-nvidia-device-plugin` #244

andy108369 · 2024-08-09T15:04:28Z

operator-inventory reports excessively large GPU number when 'nvidia.com/gpu' device marked unhealthy by the nvdp-nvidia-device-plugin

nvdp-nvidia-device-plugin - 0.15.0

Restarting nvdp-nvidia-device-plugin-zcht7, waiting for it to fully init (it should show Registered device plugin for 'nvidia.com/gpu' with Kubelet line in its logs) followed by kubectl -n akash-services rollout restart deployment/operator-inventory restart helps as a workaround. However the Xid 94 errors on the GPU are bad ones and the node reboot must be performed. Reloading nvidia kernel module is not enough for the vllm app to start working again.

$ kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image'
NAME                                          IMAGE
akash-node-1-0                                ghcr.io/akash-network/node:0.36.0
akash-provider-0                              ghcr.io/akash-network/provider:0.6.2
operator-hostname-6dddc6db79-4vr45            ghcr.io/akash-network/provider:0.6.2
operator-inventory-5758fb6b8b-bgnnn           ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node1   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node2   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node3   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node4   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node5   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node6   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node7   ghcr.io/akash-network/provider:0.6.2
operator-inventory-hardware-discovery-node8   ghcr.io/akash-network/provider:0.6.2

$ provider_info2.sh provider.h100.mon.obl.akash.pub
PROVIDER INFO
BALANCE: 9251.955566
"hostname"                         "address"
"provider.h100.mon.obl.akash.pub"  "akash1g7az2pus6atgeufgttlcnl0wzlzwd0lrsy6d7s"

Total/Available/Used (t/a/u) per node:
"name"   "cpu(t/a/u)"          "gpu(t/a/u)"                                    "mem(t/a/u GiB)"          "ephemeral(t/a/u GiB)"
"node1"  "252/184.88/67.12"    "8/6/2"                                         "1417.21/1039.01/378.2"   "5756.74/5244.74/512"
"node2"  "252/246.45/5.55"     "8/8/0"                                         "1417.21/1410.04/7.17"    "5756.74/5756.74/0"
"node3"  "252/181.295/70.705"  "8/0/8"                                         "1417.21/1136.57/280.64"  "5756.74/5655.74/101"
"node4"  "252/185.175/66.825"  "6/18446744073709552000/-18446744073709552000"  "1417.21/1273.4/143.82"   "5756.74/5656.74/100"
"node5"  "252/183.675/68.325"  "8/0/8"                                         "1417.21/1271.3/145.92"   "5756.74/5656.74/100"
"node6"  "252/184.975/67.025"  "8/0/8"                                         "1417.21/1273.27/143.94"  "5756.74/5656.74/100"
"node7"  "252/249.675/2.325"   "8/8/0"                                         "1417.21/1411.44/5.77"    "5756.74/5756.74/0"
"node8"  "252/184.075/67.925"  "8/0/8"                                         "1417.21/1272.3/144.92"   "5756.74/5656.74/100"

ACTIVE TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"
384           42     1128.3      1013              0             0             4705

PERSISTENT STORAGE:
"storage class"  "available space(GiB)"
"beta3"          3580.81

PENDING TOTAL:
"cpu(cores)"  "gpu"  "mem(GiB)"  "ephemeral(GiB)"  "beta1(GiB)"  "beta2(GiB)"  "beta3(GiB)"

$ kubectl -n nvidia-device-plugin get pods -o wide 
NAME                              READY   STATUS    RESTARTS   AGE    IP               NODE    NOMINATED NODE   READINESS GATES
nvdp-nvidia-device-plugin-47kwj   1/1     Running   0          2d5h   10.233.90.31     node8   <none>           <none>
nvdp-nvidia-device-plugin-b9nt5   1/1     Running   1          12d    10.233.97.184    node5   <none>           <none>
nvdp-nvidia-device-plugin-fw2ft   1/1     Running   0          12d    10.233.75.39     node6   <none>           <none>
nvdp-nvidia-device-plugin-h9htp   1/1     Running   0          10m    10.233.100.159   node7   <none>           <none>
nvdp-nvidia-device-plugin-hpxns   1/1     Running   0          11d    10.233.74.113    node4   <none>           <none>
nvdp-nvidia-device-plugin-m6zmn   1/1     Running   0          14m    10.233.75.69     node2   <none>           <none>
nvdp-nvidia-device-plugin-q8jgd   1/1     Running   0          30h    10.233.102.189   node1   <none>           <none>
nvdp-nvidia-device-plugin-ts88k   1/1     Running   0          29h    10.233.71.48     node3   <none>           <none>

$ kubectl -n nvidia-device-plugin logs nvdp-nvidia-device-plugin-hpxns --timestamps
2024-07-29T10:22:37.959919504Z I0729 10:22:37.952312       1 main.go:178] Starting FS watcher.
2024-07-29T10:22:37.960005662Z I0729 10:22:37.959785       1 main.go:185] Starting OS watcher.
2024-07-29T10:22:37.960407223Z I0729 10:22:37.960253       1 main.go:200] Starting Plugins.
2024-07-29T10:22:37.960422136Z I0729 10:22:37.960354       1 main.go:257] Loading configuration.
2024-07-29T10:22:37.962864350Z I0729 10:22:37.962728       1 main.go:265] Updating config with default resource matching patterns.
2024-07-29T10:22:37.963189486Z I0729 10:22:37.963086       1 main.go:276] 
2024-07-29T10:22:37.963195946Z Running with config:
2024-07-29T10:22:37.963200493Z {
2024-07-29T10:22:37.963204889Z   "version": "v1",
2024-07-29T10:22:37.963209326Z   "flags": {
2024-07-29T10:22:37.963214494Z     "migStrategy": "none",
2024-07-29T10:22:37.963218991Z     "failOnInitError": true,
2024-07-29T10:22:37.963223476Z     "mpsRoot": "/run/nvidia/mps",
2024-07-29T10:22:37.963227813Z     "nvidiaDriverRoot": "/",
2024-07-29T10:22:37.963232129Z     "gdsEnabled": false,
2024-07-29T10:22:37.963236616Z     "mofedEnabled": false,
2024-07-29T10:22:37.963240933Z     "useNodeFeatureAPI": null,
2024-07-29T10:22:37.963245299Z     "plugin": {
2024-07-29T10:22:37.963249736Z       "passDeviceSpecs": false,
2024-07-29T10:22:37.963254162Z       "deviceListStrategy": [
2024-07-29T10:22:37.963258659Z         "volume-mounts"
2024-07-29T10:22:37.963263026Z       ],
2024-07-29T10:22:37.963267453Z       "deviceIDStrategy": "uuid",
2024-07-29T10:22:37.963271899Z       "cdiAnnotationPrefix": "cdi.k8s.io/",
2024-07-29T10:22:37.963276316Z       "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
2024-07-29T10:22:37.963280782Z       "containerDriverRoot": "/driver-root"
2024-07-29T10:22:37.963285299Z     }
2024-07-29T10:22:37.963289746Z   },
2024-07-29T10:22:37.963294203Z   "resources": {
2024-07-29T10:22:37.963298640Z     "gpus": [
2024-07-29T10:22:37.963303076Z       {
2024-07-29T10:22:37.963307543Z         "pattern": "*",
2024-07-29T10:22:37.963313592Z         "name": "nvidia.com/gpu"
2024-07-29T10:22:37.963318209Z       }
2024-07-29T10:22:37.963322756Z     ]
2024-07-29T10:22:37.963327242Z   },
2024-07-29T10:22:37.963333872Z   "sharing": {
2024-07-29T10:22:37.963338489Z     "timeSlicing": {}
2024-07-29T10:22:37.963343016Z   }
2024-07-29T10:22:37.963348284Z }
2024-07-29T10:22:37.963353431Z I0729 10:22:37.963105       1 main.go:279] Retrieving plugins.
2024-07-29T10:22:37.964383151Z I0729 10:22:37.964241       1 factory.go:104] Detected NVML platform: found NVML library
2024-07-29T10:22:37.964405124Z I0729 10:22:37.964307       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024-07-29T10:22:38.054499641Z I0729 10:22:38.054102       1 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
2024-07-29T10:22:38.055690683Z I0729 10:22:38.055492       1 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2024-07-29T10:22:38.060179679Z I0729 10:22:38.059947       1 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
2024-07-31T23:14:17.131437716Z I0731 23:14:17.130820       1 health.go:155] Skipping event {Device:{Handle:0x7e590d3d9e38} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-07-31T23:15:25.413451812Z I0731 23:15:25.413048       1 health.go:155] Skipping event {Device:{Handle:0x7e590d3d9e38} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-05T03:54:11.159647215Z I0805 03:54:11.158903       1 health.go:155] Skipping event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:31 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.871941955Z I0809 13:32:51.871343       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.871980814Z I0809 13:32:51.871484       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T13:32:51.871995175Z I0809 13:32:51.871641       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T13:32:51.872691177Z I0809 13:32:51.872513       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.872718788Z I0809 13:32:51.872548       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.872763415Z I0809 13:32:51.872668       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.873031828Z I0809 13:32:51.872914       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.873040991Z I0809 13:32:51.872928       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.873045217Z I0809 13:32:51.872955       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.873209885Z I0809 13:32:51.873133       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.873215353Z I0809 13:32:51.873143       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.873285708Z I0809 13:32:51.873196       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.873409684Z I0809 13:32:51.873348       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.873413149Z I0809 13:32:51.873357       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.873483545Z I0809 13:32:51.873411       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.873584285Z I0809 13:32:51.873532       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.873588972Z I0809 13:32:51.873541       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.873595882Z I0809 13:32:51.873567       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.873885446Z I0809 13:32:51.873768       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.874590633Z I0809 13:32:51.873784       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
2024-08-09T13:32:51.874594589Z I0809 13:32:51.873821       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
2024-08-09T13:32:51.874609421Z I0809 13:32:51.873981       1 health.go:159] Processing event {Device:{Handle:0x7e590d4986e8} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T13:32:51.874631013Z I0809 13:32:51.873991       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-3b6ec030-5adc-1847-e155-79f635584b4e; marking device as unhealthy.
...
...
2024-08-09T14:46:04.104808820Z I0809 14:46:04.104592       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.104839225Z I0809 14:46:04.104759       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.104846957Z I0809 14:46:04.104773       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.104860397Z I0809 14:46:04.104812       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.105095138Z I0809 14:46:04.104955       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.105125604Z I0809 14:46:04.104968       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.105189360Z I0809 14:46:04.105079       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.105194868Z I0809 14:46:04.105169       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.105199395Z I0809 14:46:04.105179       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.105255739Z I0809 14:46:04.105199       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.105629449Z I0809 14:46:04.105532       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.105635578Z I0809 14:46:04.105563       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.105705223Z I0809 14:46:04.105622       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.105845622Z I0809 14:46:04.105781       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.105853614Z I0809 14:46:04.105794       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.105905342Z I0809 14:46:04.105848       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.106141265Z I0809 14:46:04.106008       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.106184660Z I0809 14:46:04.106018       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.106194135Z I0809 14:46:04.106068       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
2024-08-09T14:46:04.106293753Z I0809 14:46:04.106233       1 health.go:159] Processing event {Device:{Handle:0x7e590d4f7b40} EventType:8 EventData:94 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
2024-08-09T14:46:04.106304028Z I0809 14:46:04.106242       1 health.go:185] XidCriticalError: Xid=94 on Device=GPU-c24840ec-8de1-83d5-b126-08000173ae32; marking device as unhealthy.
2024-08-09T14:46:04.106317909Z I0809 14:46:04.106264       1 server.go:292] 'nvidia.com/gpu' device marked unhealthy: GPU-c24840ec-8de1-83d5-b126-08000173ae32
arno@x1:~$

root@node4:~# dmesg -T |grep NVRM
[Sun Jul 28 09:15:27 2024] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01  Sun May 12 19:39:15 UTC 2024
[Wed Jul 31 23:14:18 2024] NVRM: GPU at PCI:0000:00:05: GPU-efbdfde9-5798-a6e7-4c46-12518fa15375
[Wed Jul 31 23:14:18 2024] NVRM: GPU Board Serial Number: 1650423013443
[Wed Jul 31 23:14:18 2024] NVRM: Xid (PCI:0000:00:05): 43, pid=1045242, name=pt_main_thread, Ch 00000008
[Wed Jul 31 23:15:26 2024] NVRM: Xid (PCI:0000:00:05): 43, pid=1084984, name=pt_main_thread, Ch 00000008
[Mon Aug  5 03:54:13 2024] NVRM: GPU at PCI:0000:00:07: GPU-3b6ec030-5adc-1847-e155-79f635584b4e
[Mon Aug  5 03:54:13 2024] NVRM: GPU Board Serial Number: 1652923017935
[Mon Aug  5 03:54:13 2024] NVRM: Xid (PCI:0000:00:07): 31, pid=2512125, name=python3.10, Ch 00000008, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x73ac_1c000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
[Fri Aug  9 13:32:56 2024] NVRM: GPU at PCI:0000:00:08: GPU-c24840ec-8de1-83d5-b126-08000173ae32
[Fri Aug  9 13:32:56 2024] NVRM: GPU Board Serial Number: 1652923018111
[Fri Aug  9 13:32:56 2024] NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
[Fri Aug  9 13:32:56 2024] NVRM: Xid (PCI:0000:00:07): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
[Fri Aug  9 13:32:56 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3780430, name=pt_main_thread, Ch 00000008
...
...
[Fri Aug  9 14:31:07 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=3847803, name=pt_main_thread, Ch 0000000e
[Fri Aug  9 14:31:07 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3847804, name=pt_main_thread, Ch 0000000f
[Fri Aug  9 14:31:07 2024] NVRM: Xid (PCI:0000:00:07): 94, pid=3847803, name=pt_main_thread, Ch 0000000f
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid='<unknown>', name=<unknown>, Contained: SM (0x1). RST: No, D-RST: No
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 00000008
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 00000009
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000a
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000b
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000c
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000d
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000e
[Fri Aug  9 14:46:08 2024] NVRM: Xid (PCI:0000:00:08): 94, pid=3860364, name=pt_main_thread, Ch 0000000f
root@node4:~#

The text was updated successfully, but these errors were encountered:

andy108369 mentioned this issue Aug 21, 2024

inventory-operator: doesn't detect when nvdp-nvidia-device-plugin marks GPU as unhealthy #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`operator-inventory` reports excessively large GPU number when `'nvidia.com/gpu' device marked unhealthy` by the `nvdp-nvidia-device-plugin` #244

`operator-inventory` reports excessively large GPU number when `'nvidia.com/gpu' device marked unhealthy` by the `nvdp-nvidia-device-plugin` #244

andy108369 commented Aug 9, 2024

operator-inventory reports excessively large GPU number when 'nvidia.com/gpu' device marked unhealthy by the nvdp-nvidia-device-plugin #244

operator-inventory reports excessively large GPU number when 'nvidia.com/gpu' device marked unhealthy by the nvdp-nvidia-device-plugin #244

Comments

andy108369 commented Aug 9, 2024

`operator-inventory` reports excessively large GPU number when `'nvidia.com/gpu' device marked unhealthy` by the `nvdp-nvidia-device-plugin` #244

`operator-inventory` reports excessively large GPU number when `'nvidia.com/gpu' device marked unhealthy` by the `nvdp-nvidia-device-plugin` #244