-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When creating AKS clusters using autoscaler enabled, do not make an update api call to agentpool service based on difference in node count #2444
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch and solution! Just a bit of feedback on how to surface the behavior to the operator.
// When autoscaling is set, the count of the nodes differ based on the autoscalar and should not depend on the | ||
// count present in machinepool, azuremanagedmachinepool, hence we should not make an update api call based | ||
// on difference in count. | ||
if profile.EnableAutoScaling != nil && existingProfile.Count != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems completely reasonable to me.
However, I wonder if we could add a condition or something to let the user know what is going on. There are sure to be questions like, "Why is my cluster scaled up so high when I gave it the desired state of X?". I could imagine a user getting quite irritated by CAPZ ignoring their desired state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I agree we should try to surface something observable to the user to let them know that the actual count is different as autoscaling is enabled. I think this is the same thing that kubernetes-sigs/cluster-api#5685 is trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also related to kubernetes-sigs/cluster-api-provider-aws#2022
we should try to solve the issue consistently across providers as much as possible to ensure a consistent user experience across CAPI providers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the above two threads, I feel along with the existing changes and the suggestions made above to surface something to the user to let know that the nodes have scaled up or scaled down because of the autoscaler we must also adjust the machinepool's desired replica count according to the node count present in cluster.
In this way even the machinepools desired count and the phase spec will be accurate.
Hey thanks for making this change! I'm working on an agent pool service refactor in #2479. Are you planning on merging this PR soon or do you want to hold it and add additional changes to surface info to the user? If you plan on merging soon, I can hold my PR and rebase after it merges. |
@Jont828 Sorry for the delay, will try to update the PR by tomorrow. |
8674677
to
084bd5c
Compare
084bd5c
to
8ad08fd
Compare
@devigned @CecileRobertMichon can you please take a look at this PR again ? |
Overall lgtm, added some minor comments to address before final merging. Thanks! |
8ad08fd
to
75fa84f
Compare
azure/scope/managedmachinepool.go
Outdated
@@ -219,6 +219,16 @@ func (s *ManagedMachinePoolScope) DeleteLongRunningOperationState(name, service | |||
futures.Delete(s.ControlPlane, name, service) | |||
} | |||
|
|||
// UpdateMachinePoolReplicas updates capi machinepool replica count. | |||
func (s *ManagedMachinePoolScope) UpdateMachinePoolReplicas(replicas *int32) error { | |||
patch := client.MergeFrom(s.MachinePool.DeepCopy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unable to comment on whether or not this patch implementation is what we want, everything else lgtm in this PR.
Thanks again!
Maybe @CecileRobertMichon can give this a final thumbs up or offer feedback for improvement.
api/v1beta1/conditions_consts.go
Outdated
@@ -82,6 +82,8 @@ const ( | |||
ManagedClusterRunningCondition clusterv1.ConditionType = "ManagedClusterRunning" | |||
// AgentPoolsReadyCondition means the AKS agent pools exist and are ready to be used. | |||
AgentPoolsReadyCondition clusterv1.ConditionType = "AgentPoolsReady" | |||
// AutoScalerUpdatedMachinePoolsCondition means the Azure autoscaler scaledup or scaledDown the machinepools. | |||
AutoScalerUpdatedMachinePoolReplicas clusterv1.ConditionType = "AutoScalerUpdatedMachinePoolReplicas" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if a Condition is the best way to signal this to the user. Cluster API condition conventions state:
Condition types MUST have a consistent polarity (i.e. "True = good");
Condition types SHOULD have one of the following suffix:
- Ready, for resources which represent an ongoing status, like ControlplaneReady or MachineDeploymentsReady.
- Succeeded, for resources which run-to-completion, e.g. CreateVPCSucceeded
When the above suffix are not adequate for a specific condition type, other suffix with positive meaning COULD be used (e.g. Completed, Healthy); however, it is recommended to balance this flexibility with the objective to provide a consistent condition naming across all the Cluster API objects
In this case AutoScalerUpdatedMachinePoolReplicas
isn't really a good/bad type of condition. In fact, we're never setting the condition to False.
How about adding an annotation to the MachinePool directly (as opposed to a condition on the AzureManagedMachinePool object) that indicates to the user that the MachinePool replica count is being managed by cluster-autoscaler (similar idea to kubernetes-sigs/cluster-api#5685). That way it's directly on the object that the user might go modify to edit the replica count which is being ignored.
@Jont828 @jackfrancis thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of an annotation on the MachinePool
(sort of mimics the comments masthead on generated code: // DO NOT MODIFY, WILL BE OVERWRITTEN!!!)
So the above would be something we'd continually reconcile so long as autoscaling was set to true on the underlying AKS node pool.
And then, additionally, I think a v2 info logging statement would be apropos for every time we inherit a different value from AKS and then update capi/capz accordingly. Is that log statement already happening somewhere in the flow as a result of the update, or do we need to add that additional verbosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some help in naming the annotation.
Is this ok for the same. ?
sigs.k8s.io/autoscaler-is-managing-machinepool-replica: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An annotation prefix that comes up a lot in the kubernetes/autoscaler project is cluster-autoscaler.kubernetes.io/
How about cluster-api.cluster-autoscaler.kubernetes.io/replicas: true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents: replicas is typically a number and it seems counter-intuitive to use it as a boolean. We should also not be using the autoscaler annotation prefix since this annotation is not actually set by the kubernetes/autoscaler controllers.
How about
cluster.x-k8s.io/replicas-managed-by-autoscaler: true
(cluster.x-k8s.io
is the same prefix used by all CAPI annotations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also we should probably align with whatever annotation CAPI decides to go with in kubernetes-sigs/cluster-api#6991
One POC used cluster.x-k8s.io/externally-managed-replicas
azure/scope/managedmachinepool.go
Outdated
func (s *ManagedMachinePoolScope) UpdateMachinePoolReplicas(replicas *int32) error { | ||
patch := client.MergeFrom(s.MachinePool.DeepCopy()) | ||
s.MachinePool.Spec.Replicas = replicas | ||
if err := s.Client.Patch(context.Background(), s.MachinePool, patch); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't use context.Background()
here, we have an actual context we can pass in from the parent func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you mean that the calling func (func (s *Service) Reconcile
) has its own context object, and we should pass it along to UpdateMachinePoolReplicas
here. I agree.
So we'd update the interface accordingly:
UpdateMachinePoolReplicas(ctx context.Context, replicas *int32) error
And then the actual method definition and its invocation in the flow of Reconcile()
w/ the updated signature.
api/v1beta1/conditions_consts.go
Outdated
@@ -82,6 +82,8 @@ const ( | |||
ManagedClusterRunningCondition clusterv1.ConditionType = "ManagedClusterRunning" | |||
// AgentPoolsReadyCondition means the AKS agent pools exist and are ready to be used. | |||
AgentPoolsReadyCondition clusterv1.ConditionType = "AgentPoolsReady" | |||
// AutoScalerUpdatedMachinePoolsCondition means the Azure autoscaler scaledup or scaledDown the machinepools. | |||
AutoScalerUpdatedMachinePoolReplicas clusterv1.ConditionType = "AutoScalerUpdatedMachinePoolReplicas" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 cents: replicas is typically a number and it seems counter-intuitive to use it as a boolean. We should also not be using the autoscaler annotation prefix since this annotation is not actually set by the kubernetes/autoscaler controllers.
How about
cluster.x-k8s.io/replicas-managed-by-autoscaler: true
(cluster.x-k8s.io
is the same prefix used by all CAPI annotations)
@CecileRobertMichon @jackfrancis updated the PR with new changes, could you please review it, I need to still test it though. @Jont828 not sure if this PR can be merged soon, if this is a blocker or keeping your PR waiting please go ahead and merge your PR I will rebase :) |
8e6ff7e
to
f01bb87
Compare
@LochanRn sorry for the delay on this, would you be able to rebase so we can merge it? |
8ecdb8b
to
9d1b58d
Compare
@CecileRobertMichon @jackfrancis can you please make a final review on this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm, will let @jackfrancis confirm he was able to test this and it's ready to go
/lgtm
9d1b58d
to
38acb3f
Compare
d3b7980
to
a8e8b31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
thank you @LochanRn!
/cherry-pick release-1.4 |
@jackfrancis: once the present PR merges, I will cherry-pick it on top of release-1.4 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
yay thank you as well :) |
@jackfrancis: new pull request created: #2594 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Resolves #2443
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2443
Special notes for your reviewer:
Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
TODOs:
Release note: