Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid expesive pointer copy in capi nodegroup #6796

Merged
merged 2 commits into from
May 9, 2024

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented May 7, 2024

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

Autoscaler only chooses and scales up a nodegroup each run. Meaning that if you have multiple unschedulable pods targeting different groups each of them (e.g. via nodeSelector), autoscaler would only be able to find a place for them sequentially, therefore having a penalty time (that keeps degrading as the number of pods/nodegroups increases) for the latest pods in "the queue". When having ~90 scalable resources with capi provider this results in long times to provision nodes for latests pods in the "queue" (>40min).

The func (p *provider) NodeGroups() []cloudprovider.NodeGroup { is called around 16 times every runOnce loop. Currently it takes ~20 seconds with ~90 MachineSets. This avoids expensive loop and copy pointers resulting in ~5 seconds each NodeGroups call, and so in ~4min saving each run.

Which issue(s) this PR fixes:

Mitigates #6784

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2024
@k8s-ci-robot k8s-ci-robot requested a review from detiber May 7, 2024 11:46
@k8s-ci-robot k8s-ci-robot requested a review from hardikdr May 7, 2024 11:46
@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 7, 2024
@enxebre
Copy link
Member Author

enxebre commented May 7, 2024

/hold
a bit

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2024
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice optimization @enxebre , i hesitated a little thinking about the log line but reviewing my old log files i think that line is not as useful as we might have thought originally.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enxebre enxebre force-pushed the capi-nodegroups branch from c6a2cd0 to 04ff5eb Compare May 7, 2024 13:43
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 7, 2024
@enxebre enxebre force-pushed the capi-nodegroups branch from 04ff5eb to 31fdc39 Compare May 7, 2024 13:47
@elmiko
Copy link
Contributor

elmiko commented May 7, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 7, 2024
@enxebre
Copy link
Member Author

enxebre commented May 9, 2024

some benchMarks
before

➜  cluster-autoscaler git:(release-4.16) ✗ go test -v -bench=. -count 5 -run=NodeGroupsBenchmark ./cloudprovider/clusterapi/...
goos: darwin
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkNodeGroups
I0509 16:35:53.758484   50802 clusterapi_controller.go:426] Using version "v1alpha3" for API group "cluster.x-k8s.io"
I0509 16:35:53.758676   50802 clusterapi_controller.go:541] Resource "machinedeployments" available
I0509 16:35:53.758740   50802 clusterapi_controller.go:541] Resource "machinedeployments" available
I0509 16:35:53.758749   50802 clusterapi_controller.go:541] Resource "machinesets" available
I0509 16:35:53.758754   50802 clusterapi_controller.go:541] Resource "machines" available
I0509 16:35:53.758760   50802 clusterapi_controller.go:541] Resource "machinepools" available
BenchmarkNodeGroups/NodeGroups
BenchmarkNodeGroups/NodeGroups-12         	     798	   1417817 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     814	   1447729 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     711	   1454166 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     777	   1422128 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     788	   1486356 ns/op
PASS
ok  	k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi	7.960s

After

➜  cluster-autoscaler git:(capi-nodegroups) ✗ go test -v -bench=. -count 5 -run=NodeGroupsBenchmark ./cloudprovider/clusterapi/...
goos: darwin
goarch: amd64
pkg: k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkNodeGroups
I0509 16:32:44.577941   50440 clusterapi_controller.go:426] Using version "v1alpha3" for API group "cluster.x-k8s.io"
I0509 16:32:44.578106   50440 clusterapi_controller.go:543] Resource "machinedeployments" available
I0509 16:32:44.578137   50440 clusterapi_controller.go:543] Resource "machinedeployments" available
I0509 16:32:44.578142   50440 clusterapi_controller.go:543] Resource "machinesets" available
I0509 16:32:44.578145   50440 clusterapi_controller.go:543] Resource "machines" available
I0509 16:32:44.578147   50440 clusterapi_controller.go:543] Resource "machinepools" available
BenchmarkNodeGroups/NodeGroups
BenchmarkNodeGroups/NodeGroups-12         	    1215	    948669 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1120	   1011799 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1159	   1052832 ns/op
BenchmarkNodeGroups/NodeGroups-12         	     999	   1023880 ns/op
BenchmarkNodeGroups/NodeGroups-12         	    1122	   1020194 ns/op
PASS
ok  	k8s.io/autoscaler/cluster-autoscaler/cloudprovider/clusterapi	7.701s

@enxebre enxebre force-pushed the capi-nodegroups branch from 761fd1e to 8cfe11c Compare May 9, 2024 14:42
@enxebre
Copy link
Member Author

enxebre commented May 9, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool addition of the benchmarks, thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 9, 2024
@k8s-ci-robot k8s-ci-robot merged commit 423e6c5 into kubernetes:master May 9, 2024
6 checks passed
enxebre pushed a commit to enxebre/autoscaler that referenced this pull request May 9, 2024
Avoid expesive pointer copy in capi nodegroup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants