Avoid expesive pointer copy in capi nodegroup #6796

enxebre · 2024-05-07T11:46:11Z

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

Autoscaler only chooses and scales up a nodegroup each run. Meaning that if you have multiple unschedulable pods targeting different groups each of them (e.g. via nodeSelector), autoscaler would only be able to find a place for them sequentially, therefore having a penalty time (that keeps degrading as the number of pods/nodegroups increases) for the latest pods in "the queue". When having ~90 scalable resources with capi provider this results in long times to provision nodes for latests pods in the "queue" (>40min).

The func (p *provider) NodeGroups() []cloudprovider.NodeGroup { is called around 16 times every runOnce loop. Currently it takes ~20 seconds with ~90 MachineSets. This avoids expensive loop and copy pointers resulting in ~5 seconds each NodeGroups call, and so in ~4min saving each run.

Which issue(s) this PR fixes:

Mitigates #6784

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

enxebre · 2024-05-07T11:46:36Z

/hold
a bit

elmiko

nice optimization @enxebre , i hesitated a little thinking about the log line but reviewing my old log files i think that line is not as useful as we might have thought originally.

/approve
/lgtm

k8s-ci-robot · 2024-05-07T13:29:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/clusterapi/OWNERS~~ [elmiko,enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2024-05-07T14:19:00Z

/lgtm

enxebre · 2024-05-09T14:36:40Z

some benchMarks
before

➜  cluster-autoscaler git:(release-4.16) ✗ go test -v -bench=. -count 5 -run=NodeGroupsBenchmark ./cloudprovider/clusterapi/...
goos: darwin
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkNodeGroups
I0509 16:35:53.758484   50802 clusterapi_controller.go:426] Using version "v1alpha3" for API group "cluster.x-k8s.io"
I0509 16:35:53.758676   50802 clusterapi_controller.go:541] Resource "machinedeployments" available
I0509 16:35:53.758740   50802 clusterapi_controller.go:541] Resource "machinedeployments" available
I0509 16:35:53.758749   50802 clusterapi_controller.go:541] Resource "machinesets" available
I0509 16:35:53.758754   50802 clusterapi_controller.go:541] Resource "machines" available
I0509 16:35:53.758760   50802 clusterapi_controller.go:541] Resource "machinepools" available
BenchmarkNodeGroups/NodeGroups
BenchmarkNodeGroups/NodeGroups-12         	     798	   1417817 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     814	   1447729 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     711	   1454166 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     777	   1422128 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     788	   1486356 ns/op
PASS
ok  	k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi	7.960s

After

➜  cluster-autoscaler git:(capi-nodegroups) ✗ go test -v -bench=. -count 5 -run=NodeGroupsBenchmark ./cloudprovider/clusterapi/...
goos: darwin
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkNodeGroups
I0509 16:32:44.577941   50440 clusterapi_controller.go:426] Using version "v1alpha3" for API group "cluster.x-k8s.io"
I0509 16:32:44.578106   50440 clusterapi_controller.go:543] Resource "machinedeployments" available
I0509 16:32:44.578137   50440 clusterapi_controller.go:543] Resource "machinedeployments" available
I0509 16:32:44.578142   50440 clusterapi_controller.go:543] Resource "machinesets" available
I0509 16:32:44.578145   50440 clusterapi_controller.go:543] Resource "machines" available
I0509 16:32:44.578147   50440 clusterapi_controller.go:543] Resource "machinepools" available
BenchmarkNodeGroups/NodeGroups
BenchmarkNodeGroups/NodeGroups-12         	    1215	    948669 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1120	   1011799 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1159	   1052832 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     999	   1023880 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1122	   1020194 ns/op
PASS
ok  	k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi	7.701s

enxebre · 2024-05-09T14:42:26Z

/hold cancel

elmiko

very cool addition of the benchmarks, thanks!

/lgtm

Avoid expesive pointer copy in capi nodegroup

k8s-ci-robot requested a review from detiber May 7, 2024 11:46

k8s-ci-robot added the area/cluster-autoscaler label May 7, 2024

k8s-ci-robot requested a review from hardikdr May 7, 2024 11:46

k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 7, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2024

enxebre mentioned this pull request May 7, 2024

cluster-autoscaler clusterapi provider performance degrades when there are a high number of node groups #6784

Open

elmiko reviewed May 7, 2024

View reviewed changes

k8s-ci-robot assigned elmiko May 7, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024

enxebre force-pushed the capi-nodegroups branch from c6a2cd0 to 04ff5eb Compare May 7, 2024 13:43

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2024

Avoid expesive pointer copy in capi nodegroup

31fdc39

enxebre force-pushed the capi-nodegroups branch from 04ff5eb to 31fdc39 Compare May 7, 2024 13:47

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 7, 2024

Add benchmark for capi nodeGroup

8cfe11c

enxebre force-pushed the capi-nodegroups branch from 761fd1e to 8cfe11c Compare May 9, 2024 14:42

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024

elmiko reviewed May 9, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2024

k8s-ci-robot merged commit 423e6c5 into kubernetes:master May 9, 2024
6 checks passed

enxebre pushed a commit to enxebre/autoscaler that referenced this pull request May 9, 2024

Merge pull request kubernetes#6796 from enxebre/capi-nodegroups

f789a93

Avoid expesive pointer copy in capi nodegroup

enxebre mentioned this pull request May 9, 2024

NO-JIRA: UPSTREAM: 6796: Avoid expesive pointer copy in capi nodegroup openshift/kubernetes-autoscaler#298

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid expesive pointer copy in capi nodegroup #6796

Avoid expesive pointer copy in capi nodegroup #6796

enxebre commented May 7, 2024 •

edited

Loading

enxebre commented May 7, 2024

elmiko left a comment

k8s-ci-robot commented May 7, 2024

elmiko commented May 7, 2024

enxebre commented May 9, 2024

enxebre commented May 9, 2024

elmiko left a comment

Avoid expesive pointer copy in capi nodegroup #6796

Avoid expesive pointer copy in capi nodegroup #6796

Conversation

enxebre commented May 7, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

enxebre commented May 7, 2024

elmiko left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 7, 2024

elmiko commented May 7, 2024

enxebre commented May 9, 2024

enxebre commented May 9, 2024

elmiko left a comment

Choose a reason for hiding this comment

enxebre commented May 7, 2024 •

edited

Loading