-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu: Kubelet API use and less frequent cleanup #1363
Gpu: Kubelet API use and less frequent cleanup #1363
Conversation
0495667
to
b8c88bf
Compare
Codecov Report
@@ Coverage Diff @@
## main #1363 +/- ##
==========================================
- Coverage 51.17% 50.56% -0.62%
==========================================
Files 44 44
Lines 4879 4952 +73
==========================================
+ Hits 2497 2504 +7
- Misses 2239 2302 +63
- Partials 143 146 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
b8c88bf
to
56743e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some quick comments
Signed-off-by: Ukri Niemimuukko <[email protected]>
56743e9
to
d3a6f82
Compare
- apiGroups: | ||
- "" | ||
resources: | ||
- nodes/proxy | ||
verbs: | ||
- get | ||
- list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is wrong and should not be needed at all. For fractional resources, inteldeviceplugins-gpu-manager-role
is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it without initially, but the API server prevented operator spawning roles which the operator itself didn't have access to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I'll double check this. The roles are included in the reconciler.go and I can't recall anymore if this is really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already made and pushed the change to remove the snippet, as it seemed to work without it. But I then remembered to also test with resourcemanagement, there I got this error in the operator:
E0330 08:39:26.403430 1 reconciler.go:419] "intel-device-plugins-manager: unable to create ClusterRoleBinding" err=<
clusterrolebindings.rbac.authorization.k8s.io "gpu-manager-rolebinding" is forbidden: user "system:serviceaccount:inteldeviceplugins-system:default" (groups=["system:serviceaccounts" "system:serviceaccounts:inteldeviceplugins-system" "system:authenticated"]) is attempting to grant RBAC permissions not currently held:
{APIGroups:[""], Resources:["nodes/proxy"], Verbs:["get" "list"]}
So the snippet is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the snippet is required.
It means the operator serviceaccount always needs the same RBAC permissions required by resourcemanager and currently cluster admins have no way to drop them. This probably requires further thinking in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unfortunately. I suppose we can't reconfigure operator's roles when someone creates a gpu crd with resource manager enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative I see is that the elevated serviceaccount for gpu resource manager is created during operator deploy time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
b0d2b34
to
d3a6f82
Compare
deployments/gpu_plugin/overlays/fractional_resources/add-podresource-mount.yaml
Outdated
Show resolved
Hide resolved
In large clusters and with resource management, the load from gpu-plugins can become heavy for the api-server. This change will start fetching pod listings from kubelet and use api-server as a backup. Any other error than timeout will also move the logic back to using api-server. Signed-off-by: Tuomas Katila <[email protected]>
d3a6f82
to
974829f
Compare
In large clusters with resource management, gpu-plugins cause high load to api-server. These changes will move the load from api-server to kubelet. If kubelet API isn't available, fetching will be done from api-server, but for example the cleanup method isn't activated as often as before.