Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update nvidia-driver-installer pull policy for init container #354

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

konturn
Copy link

@konturn konturn commented Feb 28, 2024

I've run into an issue where node maintenance on GPU nodes prevents the driver installer daemonset from starting up again. Specifically, our issue looks like this:

  1. GCP schedules maintenance for our H100 node (we cannot prevent this)--we’re using the termination maintenance policy here, so the node gets stopped.
  2. Node gets restarted, and GCP tries attaching the local SSD’s from before but cannot. These local SSD’s are used for containerd image storage via a symlink and also Nvidia driver storage. So these means that all the images will be wiped from the node.
  3. The daemonset which exposes GPU’s on the node cannot start, since the image doesn’t exist and the pull policy is set to ‘Never’

The fix here entails self-managing a modified version of the daemonset that has the adjusted pull policy. The GKE documentation should link to a daemonset that's able to work properly after node maintenance events.

Copy link

google-cla bot commented Feb 28, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant