Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with azurerm updates #12

Open
1 of 4 tasks
batpad opened this issue Apr 29, 2022 · 6 comments
Open
1 of 4 tasks

Deal with azurerm updates #12

batpad opened this issue Apr 29, 2022 · 6 comments
Assignees

Comments

@batpad
Copy link
Member

batpad commented Apr 29, 2022

Recently, when deploying an unrelated change, the deploy process failed because of updates to the azurerm provider that caused things to break. One, was a deprecation to the addon_profile property for the azurerm kubernetes_cluster resource, and one was a change in default properties for the Public IP resource.

Azure does not deal very well with updating properties in place, and in both these cases, would try to delete and re-create these resources for ANY change to its properties. For now, have set the lifecycle property for the cluster as well as the IP address to ignore any changes: https://github.com/developmentseed/pearl-backend/blob/develop/deployment/terraform/resources/aks.tf#L2

For the Public IP address, I feel setting the lifcycle property to always ignore changes is fine - I'm okay treating the Public IP address as immutable after creation - we can add a comment in the file that if one needs to change properties of the public IP address, one just needs to do it in the azure console and replicate the same changes in the tf file.

For the AKS resource, this is not fine - not managing the cluster via terraform after initial creation sounds like it would cause a lot of headache down the road. The current problem with terraform re-creating the cluster is:

  • Fear of the unknown: we're not fully sure what might break if the entire cluster is re-created. This should be fine, but is something we should perhaps test separately.
  • Practically, terraform currently fails because of the way we are passing credentials of the cluster to the Helm terraform provider. Essentially, when using a kubernetes or helm provider, those terraform resources should not be applied in the same terraform module as is creating the kubernetes_cluster resource. This is the Warning on this page: https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs . The suggested fix is to perform the terraform apply in two steps - the first applies just the cluster resource, and the second applies the kubernetes / helm provider that depends on values outputted by the cluster resource for auth. So, we should figure out testing splitting our terraform apply into two steps like in the example here: https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/_examples/aks/README.md

So, there's a few Next Actions here:

  • Pin the version of azurerm we are using so that these issues don't bite us randomly / we're a bit more in control of upgrade paths.
  • Add a comment to the Public IP resource indicating it is immutable after creation.
  • Split up the terraform apply step as per above.
  • Remove the lifecycle_changes argument to the azurerm_kubernetes_cluster resource and test that it all works.

This is all a bit non-ideal - I think we missed this because while the big yellow Warning exists on the docs for the kubernetes provider, it does not exist on the docs for the helm provider here: https://registry.terraform.io/providers/hashicorp/helm/latest/docs#in-cluster-config and this behaviour is definitely non-intuitive.

@geohacker I can probably work through these.

@batpad batpad self-assigned this Apr 29, 2022
@batpad
Copy link
Member Author

batpad commented Apr 29, 2022

@batpad
Copy link
Member Author

batpad commented Apr 29, 2022

@geohacker created a PR lock the azurerm version: #13

Splitting up the terraform apply into two steps is likely a bit more work. We'd need the aks-cluster to be a separate "module" in the terraform setup, so that we can execute that module separately. I don't think I understand how dependencies and passing variables works across modules well enough to make this change confidently. You can see how it's done in the example repo: https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/_examples/aks/ - this defines the aks-cluster as a separate sub-module, so then you can run terraform plan / apply in two steps.

Let's chat about this - I can probably continue this on Monday, or maybe there's some way to short-circuit the complexity here.

@geohacker
Copy link
Member

we're not fully sure what might break if the entire cluster is re-created. This should be fine, but is something we should perhaps test separately.

I think stack recreating is very dangerous as we would lose all models, checkpoints and aois that are stored in the storage container.

@geohacker
Copy link
Member

Thank you for the detailed ticket @batpad. Yeah agree that this is non-ideal, hopefully splitting the apply phase won't be too complex. We still continue to run the dev stack so we can use that for further tests if you like.

@batpad
Copy link
Member Author

batpad commented May 3, 2022

I think stack recreating is very dangerous as we would lose all models, checkpoints and aois that are stored in the storage container.

So this wouldn't be stack recreation, just recreation of the aks cluster. I think this should be fine - the storage containers, etc. should still exist and the cluster should just mount them correctly when it is re-created.

@geohacker
Copy link
Member

Ah right ok that makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants