Gpu: Kubelet API use and less frequent cleanup #1363

tkatila · 2023-03-23T12:32:37Z

In large clusters with resource management, gpu-plugins cause high load to api-server. These changes will move the load from api-server to kubelet. If kubelet API isn't available, fetching will be done from api-server, but for example the cleanup method isn't activated as often as before.

codecov-commenter · 2023-03-23T14:21:13Z

Codecov Report

Merging #1363 (04d2600) into main (944df4e) will decrease coverage by 0.62%.
The diff coverage is 18.60%.

❗ Current head 04d2600 differs from pull request most recent head 974829f. Consider uploading reports for the commit 974829f to get more accurate results

@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
- Coverage   51.17%   50.56%   -0.62%     
==========================================
  Files          44       44              
  Lines        4879     4952      +73     
==========================================
+ Hits         2497     2504       +7     
- Misses       2239     2302      +63     
- Partials      143      146       +3

Impacted Files	Coverage Δ
pkg/controllers/reconciler.go	`4.25% <ø> (ø)`
cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go	`70.37% <16.66%> (-12.82%)`	⬇️
pkg/controllers/gpu/controller.go	`36.64% <100.00%> (+0.67%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

mythi

some quick comments

cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go

Signed-off-by: Ukri Niemimuukko <[email protected]>

mythi · 2023-03-29T10:34:17Z

deployments/operator/rbac/role.yaml

+- apiGroups:
+  - ""
+  resources:
+  - nodes/proxy
+  verbs:
+  - get
+  - list


I think this part is wrong and should not be needed at all. For fractional resources, inteldeviceplugins-gpu-manager-role is used.

I tried it without initially, but the API server prevented operator spawning roles which the operator itself didn't have access to.

Looking at this again, I'll double check this. The roles are included in the reconciler.go and I can't recall anymore if this is really needed.

I already made and pushed the change to remove the snippet, as it seemed to work without it. But I then remembered to also test with resourcemanagement, there I got this error in the operator:

E0330 08:39:26.403430 1 reconciler.go:419] "intel-device-plugins-manager: unable to create ClusterRoleBinding" err=< clusterrolebindings.rbac.authorization.k8s.io "gpu-manager-rolebinding" is forbidden: user "system:serviceaccount:inteldeviceplugins-system:default" (groups=["system:serviceaccounts" "system:serviceaccounts:inteldeviceplugins-system" "system:authenticated"]) is attempting to grant RBAC permissions not currently held: {APIGroups:[""], Resources:["nodes/proxy"], Verbs:["get" "list"]}

So the snippet is required.

So the snippet is required.

It means the operator serviceaccount always needs the same RBAC permissions required by resourcemanager and currently cluster admins have no way to drop them. This probably requires further thinking in the future.

Yeah, unfortunately. I suppose we can't reconfigure operator's roles when someone creates a gpu crd with resource manager enabled.

the alternative I see is that the elevated serviceaccount for gpu resource manager is created during operator deploy time

uniemimu

lgtm

deployments/gpu_plugin/overlays/fractional_resources/add-podresource-mount.yaml

In large clusters and with resource management, the load from gpu-plugins can become heavy for the api-server. This change will start fetching pod listings from kubelet and use api-server as a backup. Any other error than timeout will also move the logic back to using api-server. Signed-off-by: Tuomas Katila <[email protected]>

mythi · 2023-03-31T05:44:02Z

@tkatila do we plan to backport both commits in release-0.26? If only 974829f is needed then we may want to swap the order to make it apply cleanly?

tkatila · 2023-03-31T06:41:20Z

@tkatila do we plan to backport both commits in release-0.26? If only 974829f is needed then we may want to swap the order to make it apply cleanly?

Yes, both would go to the 0.26.1 release.

tkatila requested review from hj-johannes-lee, bart0sh, mythi and uniemimu as code owners March 23, 2023 12:32

tkatila force-pushed the gpu-kubelet-and-less-frequent-cleanup branch from 0495667 to b8c88bf Compare March 23, 2023 14:05

tkatila force-pushed the gpu-kubelet-and-less-frequent-cleanup branch from b8c88bf to 56743e9 Compare March 23, 2023 15:30

mythi reviewed Mar 24, 2023

View reviewed changes

cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go Show resolved Hide resolved

cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go Outdated Show resolved Hide resolved

cmd/gpu_plugin/rm/gpu_plugin_resource_manager.go Outdated Show resolved Hide resolved

randomize cleanup interval and increase it to 20 minutes

3feb185

Signed-off-by: Ukri Niemimuukko <[email protected]>

tkatila force-pushed the gpu-kubelet-and-less-frequent-cleanup branch from 56743e9 to d3a6f82 Compare March 24, 2023 08:52

mythi reviewed Mar 29, 2023

View reviewed changes

uniemimu previously approved these changes Mar 29, 2023

View reviewed changes

tkatila dismissed uniemimu’s stale review via b0d2b34 March 30, 2023 07:45

tkatila force-pushed the gpu-kubelet-and-less-frequent-cleanup branch 2 times, most recently from b0d2b34 to d3a6f82 Compare March 30, 2023 08:49

hj-johannes-lee reviewed Mar 30, 2023

View reviewed changes

deployments/gpu_plugin/overlays/fractional_resources/add-podresource-mount.yaml Outdated Show resolved Hide resolved

tkatila force-pushed the gpu-kubelet-and-less-frequent-cleanup branch from d3a6f82 to 974829f Compare March 30, 2023 09:48

mythi approved these changes Apr 3, 2023

View reviewed changes

mythi merged commit 85b6795 into intel:main Apr 3, 2023

tkatila mentioned this pull request Apr 25, 2023

backport qat and gpu commits to 0.26 branch #1394

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpu: Kubelet API use and less frequent cleanup #1363

Gpu: Kubelet API use and less frequent cleanup #1363

tkatila commented Mar 23, 2023

codecov-commenter commented Mar 23, 2023 •

edited

Loading

mythi left a comment

mythi Mar 29, 2023

tkatila Mar 29, 2023

tkatila Mar 29, 2023

tkatila Mar 30, 2023

mythi Mar 31, 2023

tkatila Mar 31, 2023

mythi Mar 31, 2023

uniemimu left a comment

mythi commented Mar 31, 2023

tkatila commented Mar 31, 2023 •

edited

Loading

Gpu: Kubelet API use and less frequent cleanup #1363

Gpu: Kubelet API use and less frequent cleanup #1363

Conversation

tkatila commented Mar 23, 2023

codecov-commenter commented Mar 23, 2023 • edited Loading

Codecov Report

mythi left a comment

Choose a reason for hiding this comment

mythi Mar 29, 2023

Choose a reason for hiding this comment

tkatila Mar 29, 2023

Choose a reason for hiding this comment

tkatila Mar 29, 2023

Choose a reason for hiding this comment

tkatila Mar 30, 2023

Choose a reason for hiding this comment

mythi Mar 31, 2023

Choose a reason for hiding this comment

tkatila Mar 31, 2023

Choose a reason for hiding this comment

mythi Mar 31, 2023

Choose a reason for hiding this comment

uniemimu left a comment

Choose a reason for hiding this comment

mythi commented Mar 31, 2023

tkatila commented Mar 31, 2023 • edited Loading

codecov-commenter commented Mar 23, 2023 •

edited

Loading

tkatila commented Mar 31, 2023 •

edited

Loading