Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/kubernetes_cluster: autoscaling-related improvements #4256

Closed
wants to merge 2 commits into from
Closed

r/kubernetes_cluster: autoscaling-related improvements #4256

wants to merge 2 commits into from

Conversation

invidian
Copy link
Contributor

@invidian invidian commented Sep 5, 2019

Currently, on every update, agentPoolProfiles are build from scratch
based on pool definition from Terraform. With autoscaled clusters, this
attempts to remove 'Count' property, which triggers manual scaling
action, which is forbidden for clusters with autoscaling enabled. That
breaks any updates to the cluster, including adding tags or updating
kubernetes version.

With this PR, we always try to fetch defintion of the cluster from
the API and if we update, we only modify profile parameters using Terraform
configuration, rather than building profiles from scratch. This makes
sure that all parameters set by API will be preserved.

To enforce this behavior, we should fail when we are updating the
resource, but we are not able to fetch it from the API.

'expandKubernetesClusterAgentPoolProfiles' function now accepts profiles
slice, which it will work on and returns modified copy, rather than
newly build profiles slice.

Also currently autoscaling cannot be disabled gracefully on the cluster and
cluster cannot be scaled when disabling autoscaling either.

This PR adds validation that when autoscaling is disabled, min_count
and max_count parametes must also be unset.

When Terraform disables autoscaling (enable_auto_scaling = false), then
it also unsets MinCount and MaxCount fields in profile obtained from
API.

This PR also adds documentation note, that cluster cannot be
manually scaled when autoscaling is enabled and suggests how to ignore
changes to the 'count' parameter, which will by dynamically changed by
cluster autoscaler.

Closes #4075

@invidian
Copy link
Contributor Author

Any updates on that?

@invidian
Copy link
Contributor Author

invidian commented Oct 4, 2019

Ugh, there've been some big changes to way how cluster is created and updated... Will rebase.

@invidian
Copy link
Contributor Author

invidian commented Oct 4, 2019

Rebased.

@invidian
Copy link
Contributor Author

invidian commented Oct 4, 2019

@tombuildsstuff and chance you could have a look? 🤞

@invidian
Copy link
Contributor Author

invidian commented Oct 8, 2019

Resolved conflicts again.

@WodansSon WodansSon requested a review from katbyte October 9, 2019 21:53
@WodansSon WodansSon added this to the v1.36.0 milestone Oct 9, 2019
@WodansSon WodansSon requested a review from mbfrahry October 9, 2019 22:10
@tombuildsstuff tombuildsstuff modified the milestones: v1.36.0, v1.37.0 Oct 24, 2019
@wagnst
Copy link
Contributor

wagnst commented Oct 24, 2019

Any update on this one?

@brianxieseattle
Copy link

Can someone review this change and help this change to be into the master as soon as possible?

@invidian
Copy link
Contributor Author

invidian commented Oct 31, 2019

Rebased again. Can someone review?

@kim0
Copy link

kim0 commented Nov 3, 2019

Desperately needing this one! .. Hope @tombuildsstuff can get to it soon

@invidian
Copy link
Contributor Author

invidian commented Nov 3, 2019

To everyone interested in the patch, can you please try it out for your scenarios to make sure it's working as expected and report back here?

Currently, on every update, agentPoolProfiles are build from scratch
based on pool definition from Terraform. With autoscaled clusters, this
attempts to remove 'Count' property, which triggers manual scaling
action, which is forbidden for clusters with autoscaling enabled. That
breaks any updates to the cluster, including adding tags or updating
kubernetes version.

With this commit, we always try to fetch definition of the cluster from
the API and if we update, we only modify profile parameters using Terraform
configuration, rather than building profiles from scratch. This makes
sure that all parameters set by API will be preserved.

To enforce this behavior, we should fail when we are updating the
resource, but we are not able to fetch it from the API.

'expandKubernetesClusterAgentPoolProfiles' function now accepts profiles
slice, which it will work on and returns modified copy, rather than
newly build profiles slice.

Closes #4075

Signed-off-by: Mateusz Gozdek <[email protected]>
Currently autoscaling cannot be disabled gracefully on the cluster and
cluster cannot be scaled when disabling autoscaling either.

This commit adds validation that when autoscaling is disabled, min_count
and max_count parameters must also be unset.

When Terraform disables autoscaling (enable_auto_scaling = false), then
it also unsets MinCount and MaxCount fields in profile obtained from
API.

This commit also adds documentation note, that cluster cannot be
manually scaled when autoscaling is enabled and suggests how to ignore
changes to the 'count' parameter, which will by dynamically changed by
cluster autoscaler.

Signed-off-by: Mateusz Gozdek <[email protected]>
@invidian
Copy link
Contributor Author

Any updates on that?

@tombuildsstuff
Copy link
Contributor

hi @invidian

Thanks for this PR - apologies for the delay reviewing this!

I've spent some time over the last couple of weeks trying to work out how to consolidate this PR, #4543, #4676, #4046 and #4472, since they're all attempting to solve the same problem (a breaking behavioural change in the AKS API, where non-default node pools now have to be managed via a separate API) in different ways.

After spending some time investigating/experimenting with this I believe the best approach going forward is to introduce a replacement default_node_pool block which is limited to a single element and then deprecate the existing agent_pool_profiles block which can then be removed in 2.0.

This allows existing users to continue to use the agent_pool_profiles field if they need to and migrate across to the default_node_pool object on their own timeline. In addition this allows for Azure regions which haven't rolled these changes out (e.g. China/Germany/Government) to continue to use the existing functionality if necessary. The default_node_pool block can then become Required in 2.0 (at which point the existing agent_pool_profiles block will be removed). At the same time we can handle the other breaking behavioural change mentioned in #4465 by switching the default Node Pool type to VirtualMachineScaleSets.

Whilst this isn't ideal since users will need to migrate at some point - it seems preferable from a UX perspective to manage these as separate resources, rather than inline (which also allows users to order node pool creation/destruction if necessary).

As such whilst we'd like to thank you for this contribution (and apologise again for the delay reviewing it!); ultimately we're going to take a different direction here and thus I'm going to close this PR in favour of #4898 which introduces the new default_node_pool block mentioned above. Once #4898 is merged we should be able to push/rebase #4046 which introduces the new Node Pool resource; at which point we can then rebase #4472 - collectively allowing us to add support for all of this functionality.

Thanks!

@invidian
Copy link
Contributor Author

HI @tombuildsstuff, thanks for your feedback :) I understand the decision about closing this PR. I hope the original issue which I was addressing, #4075, will still be addressed with the changes you proposed. I'll try to review #4899.

In the future, it would be great to receive some feedback earlier. Even not about the changes itself, but knowing, that there are some things happening in parallel, which prevents such PRs being addressed, would be great.

@invidian invidian deleted the invidian/fix-updating-autoscaled-aks-cluster branch November 17, 2019 16:05
@tombuildsstuff
Copy link
Contributor

@invidian

In the future, it would be great to receive some feedback earlier. Even not about the changes itself, but knowing, that there are some things happening in parallel, which prevents such PRs being addressed, would be great.

Agreed - apologies that's my bad here - we've been mostly heads-down on the 2.0 work and so haven't had as much time as we'd like for Github Notifications - from our side we're coming to the end of this work, so I'm hopeful we should be able to do this for all PR's going forward.

@ghost
Copy link

ghost commented Nov 26, 2019

This has been released in version 1.37.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 1.37.0"
}
# ... other configuration ...

@ghost
Copy link

ghost commented Mar 29, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscaling enabled AKS Cluster leads to error on terraform apply, tough no changes on aks planned
6 participants