You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
operator-inventory reports excessively large GPU number when 'nvidia.com/gpu' device marked unhealthy by the nvdp-nvidia-device-plugin
nvdp-nvidia-device-plugin - 0.15.0
Restarting nvdp-nvidia-device-plugin-zcht7, waiting for it to fully init (it should show Registered device plugin for 'nvidia.com/gpu' with Kubelet line in its logs) followed by kubectl -n akash-services rollout restart deployment/operator-inventory restart helps as a workaround. However the Xid 94 errors on the GPU are bad ones and the node reboot must be performed. Reloading nvidia kernel module is not enough for the vllm app to start working again.
operator-inventory
reports excessively large GPU number when'nvidia.com/gpu' device marked unhealthy
by thenvdp-nvidia-device-plugin
nvdp-nvidia-device-plugin -
0.15.0
Restarting
nvdp-nvidia-device-plugin-zcht7
, waiting for it to fully init (it should showRegistered device plugin for 'nvidia.com/gpu' with Kubelet
line in its logs) followed bykubectl -n akash-services rollout restart deployment/operator-inventory
restart helps as a workaround. However the Xid 94 errors on the GPU are bad ones and the node reboot must be performed. Reloadingnvidia
kernel module is not enough for thevllm
app to start working again.The text was updated successfully, but these errors were encountered: