Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

andyshinn · 2019-06-11T22:50:40Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.12.1
+ provider.datadog v1.9.0
+ provider.google v2.8.0
+ provider.google-beta v2.8.0
+ provider.kubernetes v1.7.0
+ provider.ns1 v1.4.0
+ provider.random v2.1.2

Affected Resource(s)

google_container_node_pool
google_container_cluster
google_compute_backend_service

Terraform Configuration Files

I can provide additional config if this doesn't appear relevant enough.

resource "google_container_cluster" "application" {
  name               = "application"
  location           = "us-east1"
  min_master_version = "1.13.6-gke.6"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {
    use_ip_aliases           = true
    cluster_ipv4_cidr_block  = "10.0.0.0/14"
    services_ipv4_cidr_block = "10.8.0.0/20"
  }
}

resource "google_container_node_pool" "api" {
  name       = "api"
  location   = "us-east1"
  cluster    = google_container_cluster.application.name
  node_count = 1
  version    = "1.13.6-gke.6"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

resource "google_compute_https_health_check" "nginx-ingress" {
  name                = "nginx-ingress"
  request_path        = "/healthz"
  check_interval_sec  = 5
  timeout_sec         = 5
  healthy_threshold   = 2
  unhealthy_threshold = 2
}

resource "google_compute_backend_service" "api" {
  name          = "api-backend"
  port_name     = "https"
  protocol      = "HTTPS"
  timeout_sec   = 40
  health_checks = [google_compute_https_health_check.nginx-ingress.self_link]

  dynamic "backend" {
    for_each = google_container_node_pool.api.instance_group_urls

    content {
      group = replace(backend.value, "Manager", "")
    }
  }
}

Debug Output

https://gist.github.com/andyshinn/25d4cb0a37b9c0a5788cbfd09d58401d

Panic Output

Expected Behavior

When changing a google_container_node_pool that forces recreation (such as adding new scopes), the node pool should be recreated without error (possibly forcing recreation of google_container_cluster and google_compute_backend_service).

Actual Behavior

The google_container_node_pool fails with the following error when adding a new auth scope:

Error: Error waiting for deleting GKE NodePool: 
	(1) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-1bec71ec' is already being used by 'projects/default-3aef9459/zones/us-east1-c/instanceGroupManagers/gke-application-api-1bec71ec-grp'
	(2) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-dd23bfc9' is already being used by 'projects/default-3aef9459/zones/us-east1-d/instanceGroupManagers/gke-application-api-dd23bfc9-grp'
	(3) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-e2ca978c' is already being used by 'projects/default-3aef9459/zones/us-east1-b/instanceGroupManagers/gke-application-api-e2ca978c-grp'.

Which appears due to google_compute_backend_service using the instance groups.

Steps to Reproduce

terraform apply
Add a new auth scope to google_container_node_pool resource.
terraform apply

Important Factoids

References

I think this is the same as or similar to #1000. But I didn't see any headway on that issue.

Destroying google_container_node_pool fails because instance group persists #1000

The text was updated successfully, but these errors were encountered:

emilymye · 2019-06-12T22:59:49Z

Is there a reason you're trying to specifically create LB resources with Terraform and not with the LoadBalancer K8s service?

Otherwise, I'm not sure I have a good solution. It would require some knowledge of the link between node pool and backend services that is exclusive to this situation, and we can't force replacement of the backend service when updating its list of backends. Even if we managed to find a workaround, this is bound to cause issues at some point because GKE/k8s assumes it will be managing any linked LB resources, and this breaks that pattern.

@rileykarson mentioned that create_before_destroy might solve your issue but I'm not sure whether it will update the backend URLs before destroying the final node pool.

andyshinn · 2019-06-12T23:33:13Z

Is there a reason you're trying to specifically create LB resources with Terraform and not with the LoadBalancer K8s service?

Mostly that I am migrating / splitting some traffic to a new service that will run on GKE through an existing HTTP load balancer managed with Terraform.

Is there another way to use an HTTP load balancer when not all your services are in GKE?

emilymye · 2019-06-13T14:52:18Z

I'm not sure if you'll be able to manage a HTTP load balancer properly in Terraform if it is also going to be used for k8s/GKE. Are the services (overloaded term) still k8s services? I think you could use Ingress to create the GCE HTTP(S) LB and configure it to handle traffic, though I can't say I know exactly what this looks like for your setup.

andyshinn · 2019-06-13T15:03:28Z

Are the services (overloaded term) still k8s services?

No, sorry. I meant that we are expanding into Kubernetes and GKE. But our existing services are applications that run on instances managed in instance groups. We have existing HTTP load balancers that we use to direct traffic to these instance groups. Our hope was to be able to treat GKE in a similar way by adding the GKE instance groups to a load balancer backend.

I am actually using the nginx Ingress in this scenario. But I am ignoring the LoadBalancer service and essentially using it the same way it would be with a bare metal deployment instead. This almost works pretty well. The other major issue with this (in addition to this issue) is #1480.

I'm happy to do something else. But I am struggling to understand how someone with existing endpoints can start migrating to GKE / Kubernetes without serious traffic shuffling with intermediate load balancers.

andyshinn · 2019-06-13T17:11:44Z

The more I think about this the more it seems like a feature request for GKE (probably under https://github.com/kubernetes/ingress-gce). I just tried to do a similar pattern with NEGs. But a similar issue exists in that there is no way for Terraform to know the NEGs to add them to the backend service. A data resource for NEGs wouldn't work because the NEGs can change and are created at Kubernetes runtime which won't be available when Terraform runs.

The closest I could find is kubernetes/ingress-gce#33. My idea would be a controller similar to neg-controller that adds the NEGs to an existing backend service defined in Terraform. I'm thinking of a flow something like:

Terraform would create the empty backend named backend (is this even possible?). Lifecycle would have to ignore_changes of the group.
The service would get deployed (kubectl or Terraform) with a proper annotation (imagine something like cloud.google.com/neg-backend: backend.
The NEGs get added to backend by the controller.
🤷‍♂

emilymye · 2019-06-13T21:43:59Z

But I am struggling to understand how someone with existing endpoints can start migrating to GKE / Kubernetes without serious traffic shuffling with intermediate load balancers.

Yeah, intermediate load balancers is what I was thinking of.

The more I think about this the more it seems like a feature request for GKE (probably under https://github.com/kubernetes/ingress-gce)

Yeah, we're pretty limited by what is exposed by the GKE APIs and resources - if GKE decides to add dependencies or generate new resources that we can't 'import' into Terraform, the provider is not going to handle it well, since it's essentially two infrastructure managers trying to manage the same things.

If you want to file an issue against the k8s team, that would be great, since they would probably be able to provide more k8s/GKE-specific advice.

andyshinn · 2019-06-14T18:28:36Z

I'm closing this as I think it is ultimately encompassed by kubernetes/ingress-gce#33. It is a broad ask but is basically the same as "allow Ingress to use an existing load balancer that has other backends and buckets".

ghost · 2019-07-15T13:51:25Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

ghost added the bug label Jun 11, 2019

andyshinn changed the title ~~Recreating google_container_node_pool fails to delete instance_template~~ Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service Jun 11, 2019

andyshinn closed this as completed Jun 14, 2019

ghost locked and limited conversation to collaborators Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

andyshinn commented Jun 11, 2019 •

edited

Loading

emilymye commented Jun 12, 2019 •

edited

Loading

andyshinn commented Jun 12, 2019

emilymye commented Jun 13, 2019

andyshinn commented Jun 13, 2019

andyshinn commented Jun 13, 2019

emilymye commented Jun 13, 2019

andyshinn commented Jun 14, 2019

ghost commented Jul 15, 2019

Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

Comments

andyshinn commented Jun 11, 2019 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

emilymye commented Jun 12, 2019 • edited Loading

andyshinn commented Jun 12, 2019

emilymye commented Jun 13, 2019

andyshinn commented Jun 13, 2019

andyshinn commented Jun 13, 2019

emilymye commented Jun 13, 2019

andyshinn commented Jun 14, 2019

ghost commented Jul 15, 2019

andyshinn commented Jun 11, 2019 •

edited

Loading

emilymye commented Jun 12, 2019 •

edited

Loading