AKS Vertical down scaling deletes all AKS infrastructure #1939

markbangert · 2020-11-03T12:45:03Z

What happened:
I just tried to scale down my AKS cluster which was running on two DS2v2 machines to two DS1v2 machines with a terraform infrastructure as code deployment. The deployment failed with the following error message:

"Message="System node pool must use VM sku with more than 2 cores and 4GB memory. Nodepool name: default."

So far so good... Apparently it is not possible to use AKS with the DS1v2 machines. However, what happened to the existing AKS infrastructure is alarming. All resources, i.e., the AKS instance and the associated VM scale set were automatically deleted. This should def not be possible.

What you expected to happen:
Either the vertical scaling should be performed as desired or only the error message should pop up without any changes to the infrastructure.

How to reproduce it (as minimally and precisely as possible):
Run a terraform AKS update on an existing cluster that changes the VM size to DS1v2.

Environment:

Kubernetes version: 1.12.0
Terraform version: 0.13.2
Size of cluster: 2 worker nodes
General description of workloads in the cluster: HTTP microservices

ghost · 2020-11-03T12:45:06Z

Hi markbangert, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

yangl900 · 2020-11-05T07:11:38Z

hi @markbangert ,

I looked at the history and based on your terraform version and error message I can see the operations performed on your cluster.

On UTC time 2020-11-03T10:42:29Z there was a delete cluster operation from terraform, and it finished on 2020-11-03T10:49:36Z.

Then the mentioned failed terraform deployment happened on 2020-11-03T10:50:36Z, at that time the cluster has been deleted by pervious operation already. So in fact it was a create operation and rejected because of the VM size.

As you expected, failed operation won't change any state of your cluster.

markbangert · 2020-11-05T08:02:04Z

Thank you for looking into this. I did some other runs aswell and I now understand that any change of the VM size will delete and recreate the node pool. This in turn means that any terraform deployment with changed VM size will not just upgrade the cluster and keep existing workloads (as I expected) but it will recreate the entire cluster from scratch - which leaves me puzzled how to deal with this... If we want to prevent the cluster from being deleted (especially on the production environment) we should not allow dynamic adjustments of the VM size in our deployment pipelines: However, we want to run different (smaller and cheaper) setups in development environments so it would be cool to simply have this as pipeline environment variable. Any ideas?

yangl900 · 2020-11-05T23:45:36Z

The behavior you described sounds like a terraform behavior at client side. That doesn't sound right to me too. @palma21 do you have any idea?

@markbangert typically you just need create a new node pool with different VM size, then delete the old node pool, without needing to re-create the cluster. Not all VM sizes can switch in-place, e.g. they may run on different hardwares, this is Azure API limitation. That's the reason you cannot do vertical scaling in-place.

palma21 · 2020-11-06T03:52:06Z

I believe that's TF behavior @tombuildsstuff @grayzu could confirm

tombuildsstuff · 2020-11-06T08:37:52Z

@markbangert due to historical limitations within AKS, Terraform doesn't support cycling the default node pool at this time, but does allow updating of external node pools (for some fields) via the separate resource.

At this point in time the VM SKU can't be updated in place for external node pools either, if that's possible we can look to support that in the future. Do you have the terraform plan showing the changes which'd be applied here?

FWIW ultimately we'd like to remove the inline/default node pool altogether to be able to update/cycle all of these fields on the node pool without destroying the cluster. However my understanding is the service still requires one at initial provisioning time, so unfortunately this isn't possible to model at this time. As such whilst we may look to support cycling the default node pool in the future - we have a lot of questions to be able to do so in practice - but this'd allow users to lean on the default behaviour of the AKS Service itself.

markbangert · 2020-11-06T11:48:34Z

Thank you all for your feedback. I really appreciate your time!

@tombuildsstuff Just to confirm that I got you right: You are saying that the vertical scaling operation would work on a non default node pool? Is this the case because the default node pool would jump in to take over the workloads while the non-default node pool is deleted and recreated afterwards with a modified VM SKU or is there a fundamentally different update mechanism at work for the non-default pools?

And regarding the terraform plan... is there any way I can send you this privately. It is from a customer project and I am not 100% happy sharing this right here.

tombuildsstuff · 2020-11-06T13:36:20Z

@markbangert at this point the vm_size field is ForceNew in both the inline/default node pool and the separate resource - what I'm suggesting is that whilst we can't fix this for the inline/default node pool (for various reasons), we should be able to fix this for the separate resource, by updating the size of these virtual machines - but that's not supported today.

Taking a look through, this issue appears to be tracking this, as such would you mind subscribing to this one for updates: hashicorp/terraform-provider-azurerm#7093

With regards to the plan, in retrospect I don't think it's necessary since both of these fields are ForceNew - I think we can infer that's the only reason this is being replaced/cycled at this point, so I think we can ignore that for now 👍

markbangert · 2020-11-06T13:54:20Z

Thank you all again. Will subscribe to hashicorp/terraform-provider-azurerm#7093 and close this issue for the time being.

ghost added the triage label Nov 3, 2020

yangl900 added the resolution/answer-provided Provided answer to issue, question or feedback. label Nov 5, 2020

ghost removed the triage label Nov 5, 2020

yangl900 added the Terraform label Nov 5, 2020

markbangert closed this as completed Nov 6, 2020

ghost locked as resolved and limited conversation to collaborators Dec 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS Vertical down scaling deletes all AKS infrastructure #1939

AKS Vertical down scaling deletes all AKS infrastructure #1939

markbangert commented Nov 3, 2020

ghost commented Nov 3, 2020

yangl900 commented Nov 5, 2020

markbangert commented Nov 5, 2020

yangl900 commented Nov 5, 2020

palma21 commented Nov 6, 2020

tombuildsstuff commented Nov 6, 2020

markbangert commented Nov 6, 2020

tombuildsstuff commented Nov 6, 2020

markbangert commented Nov 6, 2020

AKS Vertical down scaling deletes all AKS infrastructure #1939

AKS Vertical down scaling deletes all AKS infrastructure #1939

Comments

markbangert commented Nov 3, 2020

ghost commented Nov 3, 2020

yangl900 commented Nov 5, 2020

markbangert commented Nov 5, 2020

yangl900 commented Nov 5, 2020

palma21 commented Nov 6, 2020

tombuildsstuff commented Nov 6, 2020

markbangert commented Nov 6, 2020

tombuildsstuff commented Nov 6, 2020

markbangert commented Nov 6, 2020