Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093

thomas-riccardi · 2020-05-26T18:11:41Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform v0.12.25
+ provider.azuread v0.9.0
+ provider.azurerm v2.10.0

Affected Resource(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "k8s" {
  name                = var.aks_name
  location            = var.location
  dns_prefix          = var.aks_dns_prefix
  resource_group_name = var.resource_group_app
  kubernetes_version  = "1.16.7"

  linux_profile {
    admin_username = var.vm_user_name
    ssh_key {
      key_data = file(var.public_ssh_key_path)
    }
  }

  addon_profile {
    http_application_routing {
      enabled = false
    }
  }

  default_node_pool {
    name           = "default"
    vm_size        = "Standard_D3_v2"
    vnet_subnet_id = data.azurerm_subnet.app_aks.id

    type = "VirtualMachineScaleSets"

    enable_auto_scaling = true
    min_count           = 2
    max_count           = 4
  }

  network_profile {
    network_plugin    = "azure"
    load_balancer_sku = "standard"
  }

  service_principal {
    client_id     = data.azuread_service_principal.aks.application_id
    client_secret = var.aks_service_principal_client_secret
  }
}

Expected Behavior

Replacing the default node-pool vm_size somehow updates/replaces the default node-pool with new vm_size, ideally with some seamless rolling replacement of nodes.

Actual Behavior

Replacing the default node-pool vm_size actually forces the replacement of the whole resource, i.e. the whole cluster.

Steps to Reproduce

terraform apply
change default_node_pool.vm_size: I want bigger machines on my default node pool
terraform apply

result:

# azurerm_kubernetes_cluster.k8s must be replaced
-/+ resource "azurerm_kubernetes_cluster" "k8s" {
...
      ~ default_node_pool {
...
          ~ vm_size               = "Standard_D3_v2" -> "Standard_D4_v2" # forces replacement
...

Important Factoids

I initially created my cluster with the 2.10.0 azurerm terraform provider, with the new default_node_pool property, so it's not the old transition/breaking change issue from agents pools.
the AKS documentation explicitly says we can delete the system node pool as long as we already have a second one. https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools
I tried to manually do the operation:
- create a new node pool: OK
- delete the old default: the portal says I must create another "system node pool" first, but the portal doesn't provide a way to specify that.
- az aks nodepool list doesn't talk about "system" for any node pool of the cluster, all have "mode": "User",, even the default... there may be a strange edge case here.
- az aks nodepool update -g myResourceGroup --cluster-name myAKSCluster -n default2 --mode system sets my new node pool as system. Result: now I have "mode": "System", on that node pool
- finally deleting the old default node pool worked!

The text was updated successfully, but these errors were encountered:

thomas-riccardi · 2020-05-28T15:09:20Z

I have tested the manual operation on a test cluster, to be sure it's technically doable. I have no idea if it can be resynced with the tfstate, and I don't recommend it on a production tfstate.

…

On Wed, May 27, 2020 at 4:40 AM Dennis Xu ***@***.***> wrote: I have the same problem, and after I manually do the operation, I found a drift on terraform which can not be fixed, when I run terraform plan: Error: flattening default_node_pool: The Default Agent Pool "default" was not found Do you have the same drift? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7093 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANGN6NGFZMDPNTUFJ3YW2DRTR4TFANCNFSM4NK23YYA> .

thomas-riccardi · 2020-06-10T16:56:32Z

So this is in fact probably a duplicate of #6058, which has been fixed in #7233, which should be shipped in v2.14.0 if I understand correctly.

If my analysis is correct, don't hesitate to close the issue in favor of #6058.

Maybe let's wait v2.14.0 to test if it works.

tombuildsstuff · 2020-06-11T06:05:45Z

@thomas-riccardi this isn’t a duplicate of #6058 (although that allows for new node pools, the default one isn’t cycled) - but is something we’re looking into.

jochenrichter · 2020-06-17T07:41:16Z

It would be great if we can do the same as on GKE. There you have the option to remove the default node pool during cluster creation.

From the terraform GKE docs:
remove_default_node_pool - (Optional) If true, deletes the default node pool upon cluster creation. If you're using google_container_node_pool resources with no default node pool, this should be set to true, alongside setting initial_node_count to at least 1.

jemag · 2020-08-12T22:20:28Z

just to be sure, right now this is only doable through azure cli? So if I were to create a new node pool with terraform, set is as mode = "System" and then try to change de VM size or OS disk size of the default_node_pool, this situation would still be detected as needing to recreate the cluster?

PPACI · 2020-09-25T14:22:28Z

Report from my experience.

I was in the following situation:

One AKS created with a default node pool called default
One additional node pool called foo in system mode
One additional node pool called bar in system mode

Those three objects were created with terraform.

I've manually deleted (using Azure portal) default and foo. Only bar remained.

What I've done in terraform:

terraform state rm azurerm_kubernetes_cluster_node_pool.foo.
terraform state rm azurerm_kubernetes_cluster_node_pool.bar.
- We must also remove this node pool from the state. Otherwise, terraform will want to remove it when we will move it to the default node pool in the cluster object.
Update the config of aks_we.azurerm_kubernetes_cluster.main to have the default_node_pool block matching the configuration of bar.

In the end, terraform is happy about my manual migration and do not report further needed change 👍

brannondorsey · 2020-10-07T20:32:43Z

This sounds like a great temporary solution. We need to make a few changes to our default node pool but can't afford to have the cluster recreated in the process. Are there any adverse consequences to performing this manual migration via terraform state that could come back to bite us in the future that anyone would no of? For instance, if we made a cluster or node pool k8s version upgrade in the future, or something like this.

PPACI · 2020-10-07T20:38:21Z

If you mean using terraform state rm as I did. Then it should be ok.
Terraform will not distinguish "migrated" cluster from native one. Updating or scaling node pool will be fine.

brannondorsey · 2020-10-07T22:41:09Z

Gotcha, thanks for that! Yeah I was mostly wondering if whatever design decision that has lead to the azurerm_kubernetes_cluster resource behaving the way it currently does, where recreating the default_node_pool triggers a cluster recreation, might be somehow undermined by this manual state change via terraform state rm. Or if this is simply an implementation detail that can be largely ignored.

Perhaps someone from the Azurerm provider team or the AKS team could shed some light on whether this is safe to do to a production cluster (after testing in a staging cluster of course) and there won't be long-term negative effects or incompatibilities with the cluster or how the azurerm provider treats it in the future.

evenh · 2021-09-27T13:11:29Z

Is there any changes planned here for the near future? I would also like (if possible) to get someone with low-level expertise to say something about the safety around doing this to a production cluster.

marcoboffi · 2021-12-21T14:12:04Z

We deleted a production cluster changing os_disk_type on default node pool.
I think that default nodepool concept should be deleted because it's no longer present on AKS.

adityasundaramurthy-maersk · 2022-03-29T09:34:06Z

Are there any plans to change this? AKS (via CLI) only expects one of the nodepools to be a System nodepool. Why does Terraform not have parity with the CLI/ARM API?

alicyn · 2022-05-25T17:16:26Z

We needed to increase VM size due to higher load on our AKS cluster. As others have mentioned, AKS documentation clearly walks through the procedure to create a new System nodepool, drain the old node pool, and then simply delete it. Now our terraform state can no longer be used, because it demands to destroy the entire cluster. In a production environment, it seems like a bizarre workaround to be forced to create a new nodepool that simply has the correct "default" nodepool name just so that Terraform will accept it.

zerodayyy · 2022-06-22T09:24:42Z

At the moment we have to manually create second system node pool, delete the "default" one and let Terraform recreate it to change its settings without re-creating the whole cluster. Since not having a "default" node pool is legal in AKS, and furthermore this concept does not even apply any longer, why does this provider still enforce it?

tombuildsstuff · 2022-06-22T11:13:42Z

@zerodayyy because unfortunately the API still requires one to create the cluster at this point in time, and the AKS Team have stated they don't plan to change that behaviour/requirement, meaning the default_node_pool is required.

We should look to support cycling the default node pool in the future, but unfortunately that's currently blocked on this issue, where the API doesn't provide a means of reconciling clusters/node pools: Azure/AKS#1972

dwrusse · 2022-10-18T14:47:18Z

Our team mistakenly replaced one of our our AKS clusters due to this as well. The most damaging part of this for us is that this also replaces the nodepool's resource group, which contains other Azure resources provisioned by AKS, most notably the Azure Storage Account that our PVs were provisioned in. So, no only is your AKS cluster at risk, but external resources that you never intended to be ephemeral may be as well.

melzayet · 2022-11-14T09:54:33Z

Hi @tombuildsstuff, will the cycling of default node you mentioned -current blocked by Azure/AKS#1972 - fix this issue in this current thread?

In other words, will cycling help in changing the node size or disk size in a node pool? Is it part of the fix? My understanding is it will help in Terraform catching the correct state, but not change the default node pool configuration

alex0ptr · 2022-11-14T16:18:16Z

Maybe I'm missing something but I think we can keep the default node pool and just need to allow to overwrite/implement the lifecycle: create_before_destroy for this subresource.

The code then needs to handle this. If it is create before destroy then do exactly that. This way we satisfy the current API and support changing e.g. node size.

alex0ptr · 2022-11-18T10:49:04Z

We looked into this slightly more deeply. I think we can implement this without API changes. We would consider to change the state to references the node pool id and then have the implementation check which properties would be changed and either run createOrUpgrade or create a new nodepool with a hash of the config as postfix and then delete the old node pool.

Of course the schema of the default_nodepool would not be the same as the one of the nodepoll resources as the properties that force recreation are not the same anymore. This could lead to a slightly increased maintenance effort.

If we provide a PR would is it likely you would approve the solution? @tombuildsstuff

tombuildsstuff · 2022-11-18T13:42:52Z

@alex0ptr unfortunately we can't do that until this issue is resolved, since at this point in time it's not possible to know when the new node pool is reliably available in an automated manner (without polling the cluster itself, which the provider may not have access to).

Once the above upstream issue is resolved we can look into solving this via introducing a temporary node pool, deleting the old default node pool and then reintroducing a new default node pool. At this time, given the bug above, we can't reliably know when the new node pool becomes available such that we could tear down the old one.

The alternative here would be for the AKS API to not require using a default node pool. Unfortunately the AKS Team have previously said that's not something that's planned, as such we're kind of in limbo until the above issue is resolved (which will also let us poll more reliably during creation, until the cluster becomes available).

Apologies that there's not much more we can do to support this until the upstream issue is resolved. As such unfortunately we wouldn't accept a PR for this at this time due to the limitations described above which could lead to larger issues. Once the upstream issue's resolved we can take a look into this - so I'd recommend reaching out to the AKS Team about this if you'd like to see this functionality.

Thanks!

alex0ptr · 2022-11-18T15:56:51Z

via introducing a temporary node pool, deleting the old default node pool and then reintroducing a new default node pool.

I think we can go just with a new pool? My proposal more clearly: We string-append all values that would create a new node pool normally, hash it and add it as postfix to the name of the node pool. That way we have no name/id conflict and just can use this one as the default node pool.

unfortunately we can't do that until Azure/AKS#1972 is resolved

AHA! Yes, we see this regulariliy too. I'll try to get some attention to this...

tombuildsstuff · 2022-11-18T16:40:55Z

@alex0ptr that still would encounter the issues outlined above (and then be subject to other issues, like failing deployments where Azure Policy is configured for various naming conventions etc) - it's unfortunate but ultimately there's not much we can do with this until Azure/AKS#1972 is resolved

marcindulak · 2023-03-09T16:41:44Z

@stephybun the issue is closed now, should Azure/AKS#1972 be looked at again, and closed?

stephybun · 2023-03-09T16:59:03Z

@marcindulak updates on that issue suggest that we're still waiting for the manual reconciliation feature to go GA, I think that issue should remain open until it does and we've had a chance to take a look and see how/if this can be incorporated into the resource to prevent a state mismatch when something goes wrong.

github-actions · 2023-03-10T05:32:15Z

This functionality has been released in v3.47.0 of the Terraform Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

github-actions · 2023-04-10T02:07:11Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mybayern1974 added the service/kubernetes-cluster label May 27, 2020

tombuildsstuff added the enhancement label May 27, 2020

tombuildsstuff self-assigned this May 27, 2020

This was referenced Jun 5, 2020

Change of azurerm_kubernetes_cluster.default_node_pool.vm_size destroys the cluster #7231

Closed

Azure Kubernetes Service: 2020-06 updates #7233

Merged

tombuildsstuff mentioned this issue Jul 10, 2020

Make default_node_pool in azurerm_kubernetes_cluster optional #7686

Closed

tombuildsstuff mentioned this issue Aug 18, 2020

Support for azurerm_kubernetes_cluster, block default_node_pool, attr. mode needed #8161

Closed

Bessonov mentioned this issue Sep 30, 2020

Resizing the Managed AKS cluster node size or type destroys and recreates the cluster #7801

Closed

palmerabollo mentioned this issue Oct 8, 2020

Support mode = "System" in azurerm_kubernetes_cluster's default_node_pool #8791

Closed

ksandermann mentioned this issue Oct 8, 2020

[Feature] AKS Support for Ubuntu 18.04 Azure/AKS#1487

Closed

viveklak mentioned this issue Oct 13, 2020

Cannot delete "default" node pool on Kubernetes pulumi/pulumi-azure#732

Closed

tombuildsstuff mentioned this issue Nov 6, 2020

AKS Vertical down scaling deletes all AKS infrastructure Azure/AKS#1939

Closed

viveklak mentioned this issue Jan 22, 2021

Changing defaultNodePool in running AKS clusters pulumi/pulumi-azure#819

Closed

miqm mentioned this issue Jan 24, 2022

Check if resource exists Azure/bicep#4023

Open

tombuildsstuff mentioned this issue Jan 31, 2022

Update default node pool without re-installing k8s? #15175

Closed

tombuildsstuff added upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR sdk/requires-newer-api-version This requires upgrading the version of the API being used labels Nov 18, 2022

stephybun mentioned this issue Feb 23, 2023

azurerm_kubernetes_cluster - allow resizing of default_node_pool #20628

Merged

stephybun closed this as completed in #20628 Mar 9, 2023

github-actions bot added this to the v3.47.0 milestone Mar 9, 2023

github-actions bot locked as resolved and limited conversation to collaborators Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093

Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093

thomas-riccardi commented May 26, 2020

thomas-riccardi commented May 28, 2020 via email

thomas-riccardi commented Jun 10, 2020

tombuildsstuff commented Jun 11, 2020

jochenrichter commented Jun 17, 2020

jemag commented Aug 12, 2020

PPACI commented Sep 25, 2020

brannondorsey commented Oct 7, 2020

PPACI commented Oct 7, 2020

brannondorsey commented Oct 7, 2020

evenh commented Sep 27, 2021

marcoboffi commented Dec 21, 2021

adityasundaramurthy-maersk commented Mar 29, 2022

alicyn commented May 25, 2022

zerodayyy commented Jun 22, 2022

tombuildsstuff commented Jun 22, 2022

dwrusse commented Oct 18, 2022

melzayet commented Nov 14, 2022

alex0ptr commented Nov 14, 2022

alex0ptr commented Nov 18, 2022

tombuildsstuff commented Nov 18, 2022

alex0ptr commented Nov 18, 2022

tombuildsstuff commented Nov 18, 2022

marcindulak commented Mar 9, 2023

stephybun commented Mar 9, 2023

github-actions bot commented Mar 10, 2023

github-actions bot commented Apr 10, 2023

Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093

Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093

Comments

thomas-riccardi commented May 26, 2020

Community Note

Terraform (and AzureRM Provider) Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

thomas-riccardi commented May 28, 2020 via email

thomas-riccardi commented Jun 10, 2020

tombuildsstuff commented Jun 11, 2020

jochenrichter commented Jun 17, 2020

jemag commented Aug 12, 2020

PPACI commented Sep 25, 2020

brannondorsey commented Oct 7, 2020

PPACI commented Oct 7, 2020

brannondorsey commented Oct 7, 2020

evenh commented Sep 27, 2021

marcoboffi commented Dec 21, 2021

adityasundaramurthy-maersk commented Mar 29, 2022

alicyn commented May 25, 2022

zerodayyy commented Jun 22, 2022

tombuildsstuff commented Jun 22, 2022

dwrusse commented Oct 18, 2022

melzayet commented Nov 14, 2022

alex0ptr commented Nov 14, 2022

alex0ptr commented Nov 18, 2022

tombuildsstuff commented Nov 18, 2022

alex0ptr commented Nov 18, 2022

tombuildsstuff commented Nov 18, 2022

marcindulak commented Mar 9, 2023

stephybun commented Mar 9, 2023

github-actions bot commented Mar 10, 2023

github-actions bot commented Apr 10, 2023