Deal with azurerm updates #12

batpad · 2022-04-29T21:58:38Z

Recently, when deploying an unrelated change, the deploy process failed because of updates to the azurerm provider that caused things to break. One, was a deprecation to the addon_profile property for the azurerm kubernetes_cluster resource, and one was a change in default properties for the Public IP resource.

Azure does not deal very well with updating properties in place, and in both these cases, would try to delete and re-create these resources for ANY change to its properties. For now, have set the lifecycle property for the cluster as well as the IP address to ignore any changes: https://github.com/developmentseed/pearl-backend/blob/develop/deployment/terraform/resources/aks.tf#L2

For the Public IP address, I feel setting the lifcycle property to always ignore changes is fine - I'm okay treating the Public IP address as immutable after creation - we can add a comment in the file that if one needs to change properties of the public IP address, one just needs to do it in the azure console and replicate the same changes in the tf file.

For the AKS resource, this is not fine - not managing the cluster via terraform after initial creation sounds like it would cause a lot of headache down the road. The current problem with terraform re-creating the cluster is:

Fear of the unknown: we're not fully sure what might break if the entire cluster is re-created. This should be fine, but is something we should perhaps test separately.
Practically, terraform currently fails because of the way we are passing credentials of the cluster to the Helm terraform provider. Essentially, when using a kubernetes or helm provider, those terraform resources should not be applied in the same terraform module as is creating the kubernetes_cluster resource. This is the Warning on this page: https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs . The suggested fix is to perform the terraform apply in two steps - the first applies just the cluster resource, and the second applies the kubernetes / helm provider that depends on values outputted by the cluster resource for auth. So, we should figure out testing splitting our terraform apply into two steps like in the example here: https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/_examples/aks/README.md

So, there's a few Next Actions here:

Pin the version of azurerm we are using so that these issues don't bite us randomly / we're a bit more in control of upgrade paths.
Add a comment to the Public IP resource indicating it is immutable after creation.
Split up the terraform apply step as per above.
Remove the lifecycle_changes argument to the azurerm_kubernetes_cluster resource and test that it all works.

This is all a bit non-ideal - I think we missed this because while the big yellow Warning exists on the docs for the kubernetes provider, it does not exist on the docs for the helm provider here: https://registry.terraform.io/providers/hashicorp/helm/latest/docs#in-cluster-config and this behaviour is definitely non-intuitive.

@geohacker I can probably work through these.

The text was updated successfully, but these errors were encountered:

batpad · 2022-04-29T22:39:34Z

Dropping this link here for reference: https://discuss.hashicorp.com/t/how-should-we-be-using-dependency-lock-files-in-ci/31728/2

batpad · 2022-04-29T23:11:31Z

@geohacker created a PR lock the azurerm version: #13

Splitting up the terraform apply into two steps is likely a bit more work. We'd need the aks-cluster to be a separate "module" in the terraform setup, so that we can execute that module separately. I don't think I understand how dependencies and passing variables works across modules well enough to make this change confidently. You can see how it's done in the example repo: https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/_examples/aks/ - this defines the aks-cluster as a separate sub-module, so then you can run terraform plan / apply in two steps.

Let's chat about this - I can probably continue this on Monday, or maybe there's some way to short-circuit the complexity here.

geohacker · 2022-05-02T09:06:34Z

we're not fully sure what might break if the entire cluster is re-created. This should be fine, but is something we should perhaps test separately.

I think stack recreating is very dangerous as we would lose all models, checkpoints and aois that are stored in the storage container.

geohacker · 2022-05-02T09:08:13Z

Thank you for the detailed ticket @batpad. Yeah agree that this is non-ideal, hopefully splitting the apply phase won't be too complex. We still continue to run the dev stack so we can use that for further tests if you like.

batpad · 2022-05-03T00:32:19Z

I think stack recreating is very dangerous as we would lose all models, checkpoints and aois that are stored in the storage container.

So this wouldn't be stack recreation, just recreation of the aks cluster. I think this should be fine - the storage containers, etc. should still exist and the cluster should just mount them correctly when it is re-created.

geohacker · 2022-05-04T15:57:17Z

Ah right ok that makes sense!

batpad self-assigned this Apr 29, 2022

batpad mentioned this issue Apr 29, 2022

lock azurerm version in providers.tf #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with azurerm updates #12

Deal with azurerm updates #12

batpad commented Apr 29, 2022 •

edited

Loading

batpad commented Apr 29, 2022

batpad commented Apr 29, 2022

geohacker commented May 2, 2022

geohacker commented May 2, 2022

batpad commented May 3, 2022

geohacker commented May 4, 2022

Deal with azurerm updates #12

Deal with azurerm updates #12

Comments

batpad commented Apr 29, 2022 • edited Loading

batpad commented Apr 29, 2022

batpad commented Apr 29, 2022

geohacker commented May 2, 2022

geohacker commented May 2, 2022

batpad commented May 3, 2022

geohacker commented May 4, 2022

batpad commented Apr 29, 2022 •

edited

Loading