Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump nvidia-device-plugin to v0.16.1 #242

Open
2 of 4 tasks
andy108369 opened this issue Jul 29, 2024 · 5 comments
Open
2 of 4 tasks

bump nvidia-device-plugin to v0.16.1 #242

andy108369 opened this issue Jul 29, 2024 · 5 comments
Assignees
Labels
repo/helm-charts Akash Helm Chart repo issues repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Jul 29, 2024

k8s-device-plugin v0.16.1 got released 3 days ago:
They have updated CUDA base image version to 12.5.1 among the other changes https://github.com/NVIDIA/k8s-device-plugin/releases

Need to test the following:

  • whether we can upgrade the current nvidia-device-plugin helm chart up to 0.16.1 without impacting existing GPU deployments (can probably pick some provider with least used GPUs; probably sandbox will do best)
  • see if it changes the reported CUDA version upon nvidia-smi | grep Version (probably this isn't related, but still worth checking)
  • bump the 0.15.1 to 0.16.1 version in the docs https://akash.network/docs/providers/build-a-cloud-provider/gpu-resource-enablement/
  • upgrade nvidia-device-plugin across all the GPU providers
@andy108369 andy108369 added repo/provider Akash provider-services repo issues repo/helm-charts Akash Helm Chart repo issues labels Jul 29, 2024
@andy108369 andy108369 self-assigned this Jul 29, 2024
@andy108369
Copy link
Contributor Author

Testing this on Cato provider that had 0 leases since yesterday.
Currently am hitting this issue NVIDIA/k8s-device-plugin#856

@andy108369
Copy link
Contributor Author

Figured the issue is because new nvidia-device-plugin 0.16.x helm-charts (0.16.0 rc1, 0.16.0, 0.16.1) are dropping SYS_ADMIN capability leading to unable to create plugin manager: nvml init failed: ERROR_LIBRARY_NOT_FOUND error.

Let's keep using nvidia-device-plugin 0.15.1 until NVIDIA/k8s-device-plugin#856 gets fixed or a better workaround is found instead of modifying/customizing the helm-chart manually.

@andy108369
Copy link
Contributor Author

andy108369 commented Aug 1, 2024

For the record: Restarting nvidia-device-plugin/nvdp, even uninstalling it - does not impact on already existing & active GPU workloads. It will impact them if their pod will get restarted. It will go into Pending state until it finds a worker node with the GPU. If nvdp plugin is not running, the pod will go into Pending state forever.

And it does not change the reported CUDA version upon nvidia-smi | grep Version as expected. (since for that there are cuda-compat-<ver> packages + LD_LIBRARY_PATH method to load them up)

@andy108369
Copy link
Contributor Author

Workaround

The quick workaround is to pass securityContext.capabilities.add[0]=SYS_ADMIN to the chart, e.g.:

helm upgrade --install nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.16.1 \
  --set runtimeClassName="nvidia" \
  --set deviceListStrategy=volume-mounts \
  --set securityContext.capabilities.add[0]=SYS_ADMIN

@andy108369
Copy link
Contributor Author

Going to update our docs after a better fix is released to issue 856.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo/helm-charts Akash Helm Chart repo issues repo/provider Akash provider-services repo issues
Projects
None yet
Development

No branches or pull requests

1 participant