Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU device plugin deployment issue (non default namespace) #1840

Closed
pawel-gacek opened this issue Sep 18, 2024 · 4 comments · Fixed by #1850
Closed

GPU device plugin deployment issue (non default namespace) #1840

pawel-gacek opened this issue Sep 18, 2024 · 4 comments · Fixed by #1850
Labels
bug Something isn't working docs Documentation related issue

Comments

@pawel-gacek
Copy link

pawel-gacek commented Sep 18, 2024

Describe the bug
GPU device plugin will not work properly once NOT installed in default namespace. For the ClusterRoleBinding resource the ServiceAccount namespace is set to "default" once installed using kustomization tool regardless of namespace configured/used during GPU device plugin deployment:
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/gpu_plugin/overlays/fractional_resources/gpu-manager-rolebinding.yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: gpu-manager-rolebinding
subjects:

  • kind: ServiceAccount
    name: gpu-manager-sa
    namespace: default ->>> here
    roleRef:
    kind: ClusterRole
    name: gpu-manager-role
    apiGroup: rbac.authorization.k8s.io

To Reproduce
Install GPU device plugin in non default namespace with kustomization.

Expected behavior
For ClusterRoleBinding resource (name gpu-manager-rolebinding) the ServiceAccount namespace is set to desired namespace.

System (please complete the following information):

  • OS version: Ubuntu 22.04
  • Kernel version: Linux 5.15
  • Device plugins version: v0.30.0
  • Hardware info:
    Xeon 8360Y,
    System Information
    Manufacturer: Intel Corporation
    Product Name: M50CYP2SBSTD
    Version: M50CYP2UR208

Thank you
Pawel

@tkatila
Copy link
Contributor

tkatila commented Sep 25, 2024

Hi @pawel-gacek, yep, you are correct. This is a limitation of the deployment. We can't change the namespace name within the yaml file. The namespace is handled properly in our operator based deployment, though.

@pawel-gacek
Copy link
Author

hi @tkatila got it thanks. Cause it may cause some issues in plugin operation as deployment itself works fine. Would be good if such limitation can be documented somewhere as I believe there are still kustomization based deployments in use. In our case we simply have not noticed that GPU plugin did not work properly until we have seen the GPU resource allocation failure for one of our workload.

@tkatila
Copy link
Contributor

tkatila commented Sep 25, 2024

Sure. I'll add a note about it to the advanced deployments docs.

Off-topic: fractional resources is a sort of niche use case, how are you using it?

@pawel-gacek
Copy link
Author

We do use GPU Aware Scheduler extender that requires fractional resources to be enabled with GPU dev plugin.

@tkatila tkatila added bug Something isn't working docs Documentation related issue labels Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related issue
Projects
None yet
2 participants