When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444

LochanRn · 2022-07-01T19:57:45Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Resolves #2443

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2443

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

None

devigned

Great catch and solution! Just a bit of feedback on how to surface the behavior to the operator.

devigned · 2022-07-01T20:29:13Z

azure/services/agentpools/agentpools.go

+		// When autoscaling is set, the count of the nodes differ based on the autoscalar and should not depend on the
+		// count present in machinepool, azuremanagedmachinepool, hence we should not make an update api call based
+		// on difference in count.
+		if profile.EnableAutoScaling != nil && existingProfile.Count != nil {


This seems completely reasonable to me.

However, I wonder if we could add a condition or something to let the user know what is going on. There are sure to be questions like, "Why is my cluster scaled up so high when I gave it the desired state of X?". I could imagine a user getting quite irritated by CAPZ ignoring their desired state.

+1

I agree we should try to surface something observable to the user to let them know that the actual count is different as autoscaling is enabled. I think this is the same thing that kubernetes-sigs/cluster-api#5685 is trying to solve.

also related to kubernetes-sigs/cluster-api-provider-aws#2022

we should try to solve the issue consistently across providers as much as possible to ensure a consistent user experience across CAPI providers

After reading the above two threads, I feel along with the existing changes and the suggestions made above to surface something to the user to let know that the nodes have scaled up or scaled down because of the autoscaler we must also adjust the machinepool's desired replica count according to the node count present in cluster.
In this way even the machinepools desired count and the phase spec will be accurate.

Jont828 · 2022-07-22T22:56:13Z

Hey thanks for making this change! I'm working on an agent pool service refactor in #2479. Are you planning on merging this PR soon or do you want to hold it and add additional changes to surface info to the user? If you plan on merging soon, I can hold my PR and rebase after it merges.

LochanRn · 2022-07-24T21:34:08Z

@Jont828 Sorry for the delay, will try to update the PR by tomorrow.

LochanRn · 2022-07-27T05:48:48Z

@devigned @CecileRobertMichon can you please take a look at this PR again ?

azure/services/agentpools/agentpools.go

jackfrancis · 2022-07-27T10:18:23Z

Overall lgtm, added some minor comments to address before final merging.

Thanks!

jackfrancis · 2022-07-27T14:07:03Z

azure/scope/managedmachinepool.go

@@ -219,6 +219,16 @@ func (s *ManagedMachinePoolScope) DeleteLongRunningOperationState(name, service
 	futures.Delete(s.ControlPlane, name, service)
 }

+// UpdateMachinePoolReplicas updates capi machinepool replica count.
+func (s *ManagedMachinePoolScope) UpdateMachinePoolReplicas(replicas *int32) error {
+	patch := client.MergeFrom(s.MachinePool.DeepCopy())


I'm unable to comment on whether or not this patch implementation is what we want, everything else lgtm in this PR.

Thanks again!

Maybe @CecileRobertMichon can give this a final thumbs up or offer feedback for improvement.

CecileRobertMichon · 2022-07-27T22:40:38Z

api/v1beta1/conditions_consts.go

@@ -82,6 +82,8 @@ const (
 	ManagedClusterRunningCondition clusterv1.ConditionType = "ManagedClusterRunning"
 	// AgentPoolsReadyCondition means the AKS agent pools exist and are ready to be used.
 	AgentPoolsReadyCondition clusterv1.ConditionType = "AgentPoolsReady"
+	// AutoScalerUpdatedMachinePoolsCondition means the Azure autoscaler scaledup or scaledDown the machinepools.
+	AutoScalerUpdatedMachinePoolReplicas clusterv1.ConditionType = "AutoScalerUpdatedMachinePoolReplicas"


I'm not sure if a Condition is the best way to signal this to the user. Cluster API condition conventions state:

Condition types MUST have a consistent polarity (i.e. "True = good");

Condition types SHOULD have one of the following suffix:

Ready, for resources which represent an ongoing status, like ControlplaneReady or MachineDeploymentsReady.

Succeeded, for resources which run-to-completion, e.g. CreateVPCSucceeded

When the above suffix are not adequate for a specific condition type, other suffix with positive meaning COULD be used (e.g. Completed, Healthy); however, it is recommended to balance this flexibility with the objective to provide a consistent condition naming across all the Cluster API objects

In this case AutoScalerUpdatedMachinePoolReplicas isn't really a good/bad type of condition. In fact, we're never setting the condition to False.

How about adding an annotation to the MachinePool directly (as opposed to a condition on the AzureManagedMachinePool object) that indicates to the user that the MachinePool replica count is being managed by cluster-autoscaler (similar idea to kubernetes-sigs/cluster-api#5685). That way it's directly on the object that the user might go modify to edit the replica count which is being ignored.

@Jont828 @jackfrancis thoughts?

I like the idea of an annotation on the MachinePool (sort of mimics the comments masthead on generated code: // DO NOT MODIFY, WILL BE OVERWRITTEN!!!)

So the above would be something we'd continually reconcile so long as autoscaling was set to true on the underlying AKS node pool.

And then, additionally, I think a v2 info logging statement would be apropos for every time we inherit a different value from AKS and then update capi/capz accordingly. Is that log statement already happening somewhere in the flow as a result of the update, or do we need to add that additional verbosity?

Need some help in naming the annotation.
Is this ok for the same. ?
sigs.k8s.io/autoscaler-is-managing-machinepool-replica: true

An annotation prefix that comes up a lot in the kubernetes/autoscaler project is cluster-autoscaler.kubernetes.io/

How about cluster-api.cluster-autoscaler.kubernetes.io/replicas: true?

my 2 cents: replicas is typically a number and it seems counter-intuitive to use it as a boolean. We should also not be using the autoscaler annotation prefix since this annotation is not actually set by the kubernetes/autoscaler controllers.

How about

cluster.x-k8s.io/replicas-managed-by-autoscaler: true

(cluster.x-k8s.io is the same prefix used by all CAPI annotations)

also we should probably align with whatever annotation CAPI decides to go with in kubernetes-sigs/cluster-api#6991

One POC used cluster.x-k8s.io/externally-managed-replicas

CecileRobertMichon · 2022-07-27T22:41:29Z

azure/scope/managedmachinepool.go

+func (s *ManagedMachinePoolScope) UpdateMachinePoolReplicas(replicas *int32) error {
+	patch := client.MergeFrom(s.MachinePool.DeepCopy())
+	s.MachinePool.Spec.Replicas = replicas
+	if err := s.Client.Patch(context.Background(), s.MachinePool, patch); err != nil {


we shouldn't use context.Background() here, we have an actual context we can pass in from the parent func

I assume you mean that the calling func (func (s *Service) Reconcile) has its own context object, and we should pass it along to UpdateMachinePoolReplicas here. I agree.

So we'd update the interface accordingly:

UpdateMachinePoolReplicas(ctx context.Context, replicas *int32) error

And then the actual method definition and its invocation in the flow of Reconcile() w/ the updated signature.

azure/scope/managedmachinepool.go

azure/services/agentpools/agentpools.go

CecileRobertMichon · 2022-07-28T21:12:05Z

api/v1beta1/conditions_consts.go

@@ -82,6 +82,8 @@ const (
 	ManagedClusterRunningCondition clusterv1.ConditionType = "ManagedClusterRunning"
 	// AgentPoolsReadyCondition means the AKS agent pools exist and are ready to be used.
 	AgentPoolsReadyCondition clusterv1.ConditionType = "AgentPoolsReady"
+	// AutoScalerUpdatedMachinePoolsCondition means the Azure autoscaler scaledup or scaledDown the machinepools.
+	AutoScalerUpdatedMachinePoolReplicas clusterv1.ConditionType = "AutoScalerUpdatedMachinePoolReplicas"


my 2 cents: replicas is typically a number and it seems counter-intuitive to use it as a boolean. We should also not be using the autoscaler annotation prefix since this annotation is not actually set by the kubernetes/autoscaler controllers.

How about

cluster.x-k8s.io/replicas-managed-by-autoscaler: true

(cluster.x-k8s.io is the same prefix used by all CAPI annotations)

LochanRn · 2022-07-30T06:40:13Z

@CecileRobertMichon @jackfrancis updated the PR with new changes, could you please review it, I need to still test it though.

@Jont828 not sure if this PR can be merged soon, if this is a blocker or keeping your PR waiting please go ahead and merge your PR I will rebase :)

CecileRobertMichon · 2022-08-19T16:55:42Z

@LochanRn sorry for the delay on this, would you be able to rebase so we can merge it?

LochanRn · 2022-08-21T10:04:59Z

@CecileRobertMichon @jackfrancis can you please make a final review on this :)

CecileRobertMichon

code lgtm, will let @jackfrancis confirm he was able to test this and it's ready to go

/lgtm

azure/services/agentpools/agentpools.go

jackfrancis

/lgtm
/approve

thank you @LochanRn!

jackfrancis · 2022-08-23T17:40:46Z

/cherry-pick release-1.4

k8s-infra-cherrypick-robot · 2022-08-23T17:40:48Z

@jackfrancis: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-08-23T17:40:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LochanRn · 2022-08-23T17:48:10Z

/lgtm /approve

thank you @LochanRn!

yay thank you as well :)

k8s-infra-cherrypick-robot · 2022-08-23T20:14:00Z

@jackfrancis: new pull request created: #2594

In response to this:

/cherry-pick release-1.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 1, 2022

k8s-ci-robot requested review from alexeldeib and Jont828 July 1, 2022 19:57

devigned reviewed Jul 1, 2022

View reviewed changes

LochanRn force-pushed the fix-autoscalar-issue branch from 8674677 to 084bd5c Compare July 26, 2022 20:18

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 26, 2022

LochanRn force-pushed the fix-autoscalar-issue branch from 084bd5c to 8ad08fd Compare July 26, 2022 20:45

jackfrancis added the area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type label Jul 27, 2022

jackfrancis reviewed Jul 27, 2022

View reviewed changes

azure/services/agentpools/agentpools.go Outdated Show resolved Hide resolved

jackfrancis reviewed Jul 27, 2022

View reviewed changes

azure/services/agentpools/agentpools.go Outdated Show resolved Hide resolved

LochanRn force-pushed the fix-autoscalar-issue branch from 8ad08fd to 75fa84f Compare July 27, 2022 11:07

jackfrancis reviewed Jul 27, 2022

View reviewed changes

CecileRobertMichon reviewed Jul 27, 2022

View reviewed changes

azure/scope/managedmachinepool.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Jul 27, 2022

View reviewed changes

azure/scope/managedmachinepool.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Jul 28, 2022

View reviewed changes

CecileRobertMichon mentioned this pull request Jul 28, 2022

✨ Add externallyManagedReplicaCount field to MachinePool kubernetes-sigs/cluster-api#5685

Closed

LochanRn force-pushed the fix-autoscalar-issue branch from 8e6ff7e to f01bb87 Compare July 30, 2022 06:45

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 4, 2022

LochanRn force-pushed the fix-autoscalar-issue branch 2 times, most recently from 8ecdb8b to 9d1b58d Compare August 21, 2022 07:33

This was referenced Aug 22, 2022

allow annotation on MachinePool to set externally managed kubernetes-sigs/cluster-api-provider-aws#3683

Merged

feat: respect externally managed annotation on unmanaged MachinePools #2588

Merged

CecileRobertMichon reviewed Aug 22, 2022

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Aug 22, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2022

LochanRn force-pushed the fix-autoscalar-issue branch from 9d1b58d to 38acb3f Compare August 22, 2022 20:53

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 22, 2022

jackfrancis reviewed Aug 23, 2022

View reviewed changes

azure/services/agentpools/agentpools.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Aug 23, 2022

View reviewed changes

azure/services/agentpools/agentpools.go Outdated Show resolved Hide resolved

Updated the check for updating machinepool when using autoscalar

a8e8b31

LochanRn force-pushed the fix-autoscalar-issue branch from d3b7980 to a8e8b31 Compare August 23, 2022 09:33

jackfrancis approved these changes Aug 23, 2022

View reviewed changes

k8s-ci-robot assigned jackfrancis Aug 23, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 23, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 23, 2022

k8s-ci-robot merged commit 1e9a1b1 into kubernetes-sigs:main Aug 23, 2022

k8s-ci-robot added this to the v1.5 milestone Aug 23, 2022

k8s-infra-cherrypick-robot mentioned this pull request Aug 23, 2022

[release-1.4] When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2594

Merged

AverageMarcus mentioned this pull request Aug 28, 2022

AWSMachinePool does not work in combination with cluster-autoscaler kubernetes-sigs/cluster-api-provider-aws#2022

Closed

jackfrancis mentioned this pull request Nov 3, 2022

✨ MachinePool annotation for externally managed autoscaler kubernetes-sigs/cluster-api#7107

Merged

Rotfuks mentioned this pull request Dec 13, 2022

Autoscaling CAPZ Clusters giantswarm/roadmap#1793

Closed

1 task

Rotfuks mentioned this pull request Jun 26, 2023

CAPI Cluster Autoscaling with Machine Deployments giantswarm/roadmap#1376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444

When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444

LochanRn commented Jul 1, 2022 •

edited

Loading

devigned left a comment

devigned Jul 1, 2022

CecileRobertMichon Jul 5, 2022

CecileRobertMichon Jul 5, 2022

LochanRn Jul 6, 2022

Jont828 commented Jul 22, 2022

LochanRn commented Jul 24, 2022

LochanRn commented Jul 27, 2022

jackfrancis commented Jul 27, 2022

jackfrancis Jul 27, 2022

CecileRobertMichon Jul 27, 2022

jackfrancis Jul 28, 2022

LochanRn Jul 28, 2022

jackfrancis Jul 28, 2022

CecileRobertMichon Jul 28, 2022

CecileRobertMichon Jul 28, 2022

CecileRobertMichon Jul 27, 2022

jackfrancis Jul 28, 2022

CecileRobertMichon Jul 28, 2022

LochanRn commented Jul 30, 2022

CecileRobertMichon commented Aug 19, 2022

LochanRn commented Aug 21, 2022

CecileRobertMichon left a comment

jackfrancis left a comment

jackfrancis commented Aug 23, 2022

k8s-infra-cherrypick-robot commented Aug 23, 2022

k8s-ci-robot commented Aug 23, 2022

LochanRn commented Aug 23, 2022

k8s-infra-cherrypick-robot commented Aug 23, 2022

When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444

When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444

Conversation

LochanRn commented Jul 1, 2022 • edited Loading

devigned left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jont828 commented Jul 22, 2022

LochanRn commented Jul 24, 2022

LochanRn commented Jul 27, 2022

jackfrancis commented Jul 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LochanRn commented Jul 30, 2022

CecileRobertMichon commented Aug 19, 2022

LochanRn commented Aug 21, 2022

CecileRobertMichon left a comment

Choose a reason for hiding this comment

jackfrancis left a comment

Choose a reason for hiding this comment

jackfrancis commented Aug 23, 2022

k8s-infra-cherrypick-robot commented Aug 23, 2022

k8s-ci-robot commented Aug 23, 2022

LochanRn commented Aug 23, 2022

k8s-infra-cherrypick-robot commented Aug 23, 2022

LochanRn commented Jul 1, 2022 •

edited

Loading