Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service #3838

Closed
andyshinn opened this issue Jun 11, 2019 · 8 comments
Labels

Comments

@andyshinn
Copy link

andyshinn commented Jun 11, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

Terraform v0.12.1
+ provider.datadog v1.9.0
+ provider.google v2.8.0
+ provider.google-beta v2.8.0
+ provider.kubernetes v1.7.0
+ provider.ns1 v1.4.0
+ provider.random v2.1.2

Affected Resource(s)

  • google_container_node_pool
  • google_container_cluster
  • google_compute_backend_service

Terraform Configuration Files

I can provide additional config if this doesn't appear relevant enough.

resource "google_container_cluster" "application" {
  name               = "application"
  location           = "us-east1"
  min_master_version = "1.13.6-gke.6"

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  remove_default_node_pool = true
  initial_node_count       = 1

  ip_allocation_policy {
    use_ip_aliases           = true
    cluster_ipv4_cidr_block  = "10.0.0.0/14"
    services_ipv4_cidr_block = "10.8.0.0/20"
  }
}

resource "google_container_node_pool" "api" {
  name       = "api"
  location   = "us-east1"
  cluster    = google_container_cluster.application.name
  node_count = 1
  version    = "1.13.6-gke.6"

  node_config {
    machine_type = "n1-standard-1"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/cloud-platform",
    ]
  }
}

resource "google_compute_https_health_check" "nginx-ingress" {
  name                = "nginx-ingress"
  request_path        = "/healthz"
  check_interval_sec  = 5
  timeout_sec         = 5
  healthy_threshold   = 2
  unhealthy_threshold = 2
}

resource "google_compute_backend_service" "api" {
  name          = "api-backend"
  port_name     = "https"
  protocol      = "HTTPS"
  timeout_sec   = 40
  health_checks = [google_compute_https_health_check.nginx-ingress.self_link]

  dynamic "backend" {
    for_each = google_container_node_pool.api.instance_group_urls

    content {
      group = replace(backend.value, "Manager", "")
    }
  }
}

Debug Output

https://gist.github.com/andyshinn/25d4cb0a37b9c0a5788cbfd09d58401d

Panic Output

Expected Behavior

When changing a google_container_node_pool that forces recreation (such as adding new scopes), the node pool should be recreated without error (possibly forcing recreation of google_container_cluster and google_compute_backend_service).

Actual Behavior

The google_container_node_pool fails with the following error when adding a new auth scope:

Error: Error waiting for deleting GKE NodePool: 
	(1) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-1bec71ec' is already being used by 'projects/default-3aef9459/zones/us-east1-c/instanceGroupManagers/gke-application-api-1bec71ec-grp'
	(2) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-dd23bfc9' is already being used by 'projects/default-3aef9459/zones/us-east1-d/instanceGroupManagers/gke-application-api-dd23bfc9-grp'
	(3) Google Compute Engine: The instance_template resource 'projects/default-3aef9459/global/instanceTemplates/gke-application-api-e2ca978c' is already being used by 'projects/default-3aef9459/zones/us-east1-b/instanceGroupManagers/gke-application-api-e2ca978c-grp'.

Which appears due to google_compute_backend_service using the instance groups.

Steps to Reproduce

  1. terraform apply
  2. Add a new auth scope to google_container_node_pool resource.
  3. terraform apply

Important Factoids

References

I think this is the same as or similar to #1000. But I didn't see any headway on that issue.

@ghost ghost added the bug label Jun 11, 2019
@andyshinn andyshinn changed the title Recreating google_container_node_pool fails to delete instance_template Recreating google_container_node_pool fails to delete instance_template when in use by google_compute_backend_service Jun 11, 2019
@emilymye
Copy link
Contributor

emilymye commented Jun 12, 2019

Is there a reason you're trying to specifically create LB resources with Terraform and not with the LoadBalancer K8s service?

Otherwise, I'm not sure I have a good solution. It would require some knowledge of the link between node pool and backend services that is exclusive to this situation, and we can't force replacement of the backend service when updating its list of backends. Even if we managed to find a workaround, this is bound to cause issues at some point because GKE/k8s assumes it will be managing any linked LB resources, and this breaks that pattern.

@rileykarson mentioned that create_before_destroy might solve your issue but I'm not sure whether it will update the backend URLs before destroying the final node pool.

@andyshinn
Copy link
Author

Is there a reason you're trying to specifically create LB resources with Terraform and not with the LoadBalancer K8s service?

Mostly that I am migrating / splitting some traffic to a new service that will run on GKE through an existing HTTP load balancer managed with Terraform.

Is there another way to use an HTTP load balancer when not all your services are in GKE?

@emilymye
Copy link
Contributor

I'm not sure if you'll be able to manage a HTTP load balancer properly in Terraform if it is also going to be used for k8s/GKE. Are the services (overloaded term) still k8s services? I think you could use Ingress to create the GCE HTTP(S) LB and configure it to handle traffic, though I can't say I know exactly what this looks like for your setup.

@andyshinn
Copy link
Author

Are the services (overloaded term) still k8s services?

No, sorry. I meant that we are expanding into Kubernetes and GKE. But our existing services are applications that run on instances managed in instance groups. We have existing HTTP load balancers that we use to direct traffic to these instance groups. Our hope was to be able to treat GKE in a similar way by adding the GKE instance groups to a load balancer backend.

I am actually using the nginx Ingress in this scenario. But I am ignoring the LoadBalancer service and essentially using it the same way it would be with a bare metal deployment instead. This almost works pretty well. The other major issue with this (in addition to this issue) is #1480.

I'm happy to do something else. But I am struggling to understand how someone with existing endpoints can start migrating to GKE / Kubernetes without serious traffic shuffling with intermediate load balancers.

@andyshinn
Copy link
Author

The more I think about this the more it seems like a feature request for GKE (probably under https://github.com/kubernetes/ingress-gce). I just tried to do a similar pattern with NEGs. But a similar issue exists in that there is no way for Terraform to know the NEGs to add them to the backend service. A data resource for NEGs wouldn't work because the NEGs can change and are created at Kubernetes runtime which won't be available when Terraform runs.

The closest I could find is kubernetes/ingress-gce#33. My idea would be a controller similar to neg-controller that adds the NEGs to an existing backend service defined in Terraform. I'm thinking of a flow something like:

  • Terraform would create the empty backend named backend (is this even possible?). Lifecycle would have to ignore_changes of the group.
  • The service would get deployed (kubectl or Terraform) with a proper annotation (imagine something like cloud.google.com/neg-backend: backend.
  • The NEGs get added to backend by the controller.
  • 🤷‍♂

@emilymye
Copy link
Contributor

But I am struggling to understand how someone with existing endpoints can start migrating to GKE / Kubernetes without serious traffic shuffling with intermediate load balancers.

Yeah, intermediate load balancers is what I was thinking of.

The more I think about this the more it seems like a feature request for GKE (probably under https://github.com/kubernetes/ingress-gce)

Yeah, we're pretty limited by what is exposed by the GKE APIs and resources - if GKE decides to add dependencies or generate new resources that we can't 'import' into Terraform, the provider is not going to handle it well, since it's essentially two infrastructure managers trying to manage the same things.

If you want to file an issue against the k8s team, that would be great, since they would probably be able to provide more k8s/GKE-specific advice.

@andyshinn
Copy link
Author

I'm closing this as I think it is ultimately encompassed by kubernetes/ingress-gce#33. It is a broad ask but is basically the same as "allow Ingress to use an existing load balancer that has other backends and buckets".

@ghost
Copy link

ghost commented Jul 15, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jul 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants