-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node Auto-Provisioning failing for certain GPU nodes (T4) #402
Comments
fwiw I'm seeing this on A100 nodes too now |
I'm seeing this on T4 as well |
for l4 gpus, it is also not working. I'm using COS image and didn't disabled automatic installation
|
Seems like this is still a problem.... T4 in London: % Total % Received % Xferd Average Speed Time Time Time Current stuck like this forever.... |
I'm seeing this for T4 GPU nodes on my end as well in
|
From the following init container's command: ...
initContainers:
- command:
- bash
- -c
- |
LABELS=$( curl --retry 5 -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/kube-labels || exit 1 )
IFS=,; for label in $LABELS; do
IFS==; read -r LABEL VALUE <<< "$label"
if [[ "${LABEL}" == "cloud.google.com/gke-gpu-driver-version" ]]; then
GPU_DRIVER_VERSION=$VALUE
fi
done
if [[ "${GPU_DRIVER_VERSION}" == "latest" ]]; then
echo "latest" > /etc/nvidia/gpu_driver_version_config.txt
/cos-gpu-installer install --version=latest || exit 1
elif [[ "${GPU_DRIVER_VERSION}" == "default" ]]; then
echo "default" > /etc/nvidia/gpu_driver_version_config.txt
/cos-gpu-installer install || exit 1
else
echo "disabled" > /etc/nvidia/gpu_driver_version_config.txt
echo "GPU driver auto installation is disabled."
fi
echo "Waiting for GPU driver libraries to be available."
while ! [[ -f /usr/local/nvidia/lib64/libcuda.so ]]; do
sleep 5
done
echo "GPU driver is installed."
echo "InitContainer succeeded. Start nvidia-gpu-device-plugin container."
exit 0 Seems like the key |
Workaround for me was to use Ubuntu OS and manually install the driver with applying their example: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers |
How to re-create
A job that is marked as requiring
nvidia.com/gpu
, if results in a new node being spun up in GKE, will fail to be scheduled on that node.Why is this bad
Details on error
The provisioned node has a
nvidia-device-plugin
podThis pod has a
nvidia-driver-installer
container which is aninit
containerThis container is stuck on startup
As a result, the kubelet never registers the
nvidia.com/gpu
resource, which means that the job (which triggered the node in the first place!) can't get its pods scheduled on it.Prior context:
This is based off the following issue, which is no longer fixed (but which I cannot reopen)
#356
The text was updated successfully, but these errors were encountered: