-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing "azurerm_kubernetes_cluster" default_node_pool.vm_size forces replacement of the whole kubernetes cluster #7093
Comments
I have tested the manual operation on a test cluster, to be sure it's
technically doable. I have no idea if it can be resynced with the tfstate,
and I don't recommend it on a production tfstate.
…On Wed, May 27, 2020 at 4:40 AM Dennis Xu ***@***.***> wrote:
I have the same problem, and after I manually do the operation, I found a
drift on terraform which can not be fixed, when I run terraform plan:
Error: flattening default_node_pool: The Default Agent Pool "default" was
not found
Do you have the same drift?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#7093 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANGN6NGFZMDPNTUFJ3YW2DRTR4TFANCNFSM4NK23YYA>
.
|
@thomas-riccardi this isn’t a duplicate of #6058 (although that allows for new node pools, the default one isn’t cycled) - but is something we’re looking into. |
It would be great if we can do the same as on GKE. There you have the option to remove the default node pool during cluster creation. From the terraform GKE docs: |
just to be sure, right now this is only doable through azure cli? So if I were to create a new node pool with terraform, set is as mode = "System" and then try to change de VM size or OS disk size of the default_node_pool, this situation would still be detected as needing to recreate the cluster? |
Report from my experience. I was in the following situation:
Those three objects were created with terraform. I've manually deleted (using Azure portal) default and foo. Only bar remained. What I've done in terraform:
In the end, terraform is happy about my manual migration and do not report further needed change 👍 |
This sounds like a great temporary solution. We need to make a few changes to our default node pool but can't afford to have the cluster recreated in the process. Are there any adverse consequences to performing this manual migration via |
If you mean using |
Gotcha, thanks for that! Yeah I was mostly wondering if whatever design decision that has lead to the Perhaps someone from the Azurerm provider team or the AKS team could shed some light on whether this is safe to do to a production cluster (after testing in a staging cluster of course) and there won't be long-term negative effects or incompatibilities with the cluster or how the azurerm provider treats it in the future. |
Is there any changes planned here for the near future? I would also like (if possible) to get someone with low-level expertise to say something about the safety around doing this to a production cluster. |
We deleted a production cluster changing os_disk_type on default node pool. |
Are there any plans to change this? AKS (via CLI) only expects one of the nodepools to be a System nodepool. Why does Terraform not have parity with the CLI/ARM API? |
We needed to increase VM size due to higher load on our AKS cluster. As others have mentioned, AKS documentation clearly walks through the procedure to create a new System nodepool, drain the old node pool, and then simply delete it. Now our terraform state can no longer be used, because it demands to destroy the entire cluster. In a production environment, it seems like a bizarre workaround to be forced to create a new nodepool that simply has the correct "default" nodepool name just so that Terraform will accept it. |
At the moment we have to manually create second system node pool, delete the "default" one and let Terraform recreate it to change its settings without re-creating the whole cluster. Since not having a "default" node pool is legal in AKS, and furthermore this concept does not even apply any longer, why does this provider still enforce it? |
@zerodayyy because unfortunately the API still requires one to create the cluster at this point in time, and the AKS Team have stated they don't plan to change that behaviour/requirement, meaning the We should look to support cycling the default node pool in the future, but unfortunately that's currently blocked on this issue, where the API doesn't provide a means of reconciling clusters/node pools: Azure/AKS#1972 |
Our team mistakenly replaced one of our our AKS clusters due to this as well. The most damaging part of this for us is that this also replaces the nodepool's resource group, which contains other Azure resources provisioned by AKS, most notably the Azure Storage Account that our PVs were provisioned in. So, no only is your AKS cluster at risk, but external resources that you never intended to be ephemeral may be as well. |
Hi @tombuildsstuff, will the cycling of default node you mentioned -current blocked by Azure/AKS#1972 - fix this issue in this current thread? In other words, will cycling help in changing the node size or disk size in a node pool? Is it part of the fix? My understanding is it will help in Terraform catching the correct state, but not change the default node pool configuration |
Maybe I'm missing something but I think we can keep the default node pool and just need to allow to overwrite/implement the lifecycle: create_before_destroy for this subresource. The code then needs to handle this. If it is create before destroy then do exactly that. This way we satisfy the current API and support changing e.g. node size. |
We looked into this slightly more deeply. I think we can implement this without API changes. We would consider to change the state to references the node pool id and then have the implementation check which properties would be changed and either run Of course the schema of the If we provide a PR would is it likely you would approve the solution? @tombuildsstuff |
@alex0ptr unfortunately we can't do that until this issue is resolved, since at this point in time it's not possible to know when the new node pool is reliably available in an automated manner (without polling the cluster itself, which the provider may not have access to). Once the above upstream issue is resolved we can look into solving this via introducing a temporary node pool, deleting the old default node pool and then reintroducing a new default node pool. At this time, given the bug above, we can't reliably know when the new node pool becomes available such that we could tear down the old one. The alternative here would be for the AKS API to not require using a default node pool. Unfortunately the AKS Team have previously said that's not something that's planned, as such we're kind of in limbo until the above issue is resolved (which will also let us poll more reliably during creation, until the cluster becomes available). Apologies that there's not much more we can do to support this until the upstream issue is resolved. As such unfortunately we wouldn't accept a PR for this at this time due to the limitations described above which could lead to larger issues. Once the upstream issue's resolved we can take a look into this - so I'd recommend reaching out to the AKS Team about this if you'd like to see this functionality. Thanks! |
I think we can go just with a new pool? My proposal more clearly: We string-append all values that would create a new node pool normally, hash it and add it as postfix to the name of the node pool. That way we have no name/id conflict and just can use this one as the default node pool.
AHA! Yes, we see this regulariliy too. I'll try to get some attention to this... |
@alex0ptr that still would encounter the issues outlined above (and then be subject to other issues, like failing deployments where Azure Policy is configured for various naming conventions etc) - it's unfortunate but ultimately there's not much we can do with this until Azure/AKS#1972 is resolved |
@stephybun the issue is closed now, should Azure/AKS#1972 be looked at again, and closed? |
@marcindulak updates on that issue suggest that we're still waiting for the manual reconciliation feature to go GA, I think that issue should remain open until it does and we've had a chance to take a look and see how/if this can be incorporated into the resource to prevent a state mismatch when something goes wrong. |
This functionality has been released in v3.47.0 of the Terraform Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Terraform (and AzureRM Provider) Version
Affected Resource(s)
azurerm_kubernetes_cluster
Terraform Configuration Files
Expected Behavior
Replacing the default node-pool
vm_size
somehow updates/replaces the default node-pool with newvm_size
, ideally with some seamless rolling replacement of nodes.Actual Behavior
Replacing the default node-pool
vm_size
actually forces the replacement of the whole resource, i.e. the whole cluster.Steps to Reproduce
terraform apply
default_node_pool.vm_size
: I want bigger machines on my default node poolterraform apply
result:
Important Factoids
default_node_pool
property, so it's not the old transition/breaking change issue from agents pools.az aks nodepool list
doesn't talk about "system" for any node pool of the cluster, all have"mode": "User",
, even the default... there may be a strange edge case here.az aks nodepool update -g myResourceGroup --cluster-name myAKSCluster -n default2 --mode system
sets my new node pool as system. Result: now I have"mode": "System",
on that node pooldefault
node pool worked!The text was updated successfully, but these errors were encountered: