Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot delete instance group because it's being used by a backend service #6376

Open
kustodian opened this issue May 14, 2020 · 26 comments
Open
Labels
forward/linked new-resource persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/compute-l7-load-balancer service/compute-networking-ig size/m
Milestone

Comments

@kustodian
Copy link

kustodian commented May 14, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.12.24

  • provider.google v3.21.0
  • provider.google-beta v3.21.0

Affected Resource(s)

  • google_compute_region_backend_service
  • google_compute_instance_group

Terraform Configuration Files

locals {
  project         = "<project-id>"
  network         = "<vpc-name>"
  network_project = "<vpc-project>"
  zones           = ["europe-west1-b", "europe-west1-c", "europe-west1-d"]
  s1_count        = 3
}

provider "google" {
  project = local.project
  version = "~> 3.0"
}

data "google_compute_network" "network" {
  name    = local.network
  project = local.network_project
}

resource "google_compute_region_backend_service" "s1" {
  name = "s1"

  dynamic "backend" {
    for_each = google_compute_instance_group.s1
    content {
      group = backend.value.self_link
    }
  }
  health_checks = [
    google_compute_health_check.default.self_link,
  ]
}

resource "google_compute_health_check" "default" {
  name = "s1"
  tcp_health_check {
    port = "80"
  }
}

resource "google_compute_instance_group" "s1" {
  count   = local.s1_count
  name    = format("s1-%02d", count.index + 1)
  zone    = element(local.zones, count.index)
  network = data.google_compute_network.network.self_link
}

I'm not sure is this a general TF problem or a Google provider problem, but here it goes.
Currently it's not possible to lover the number of google_compute_instance_group that are used in a google_compute_region_backend_service. In the code above if we lower the number of google_compute_instance_group resources and try to apply the configuration, TF will first try to delete the not needed instance groups and then update the backend configuration, but that order doesn't work because you cannot delete an instance group that is used by the backend service, the order should be the other way around.

So to sum it up, when I lower the number of the instance group resources TF does this:

  1. delete surplus google_compute_instance_group -> this fails
  2. update google_compute_region_backend_service

It should do this the other way around:

  1. update google_compute_region_backend_service
  2. delete surplus google_compute_instance_group -> this fails

Here is the output it generates:

google_compute_instance_group.s1[2]: Destroying... [id=projects/<project-id>/zones/europe-west1-d/instanceGroups/s1-03]

Error: Error deleting InstanceGroup: googleapi: Error 400: The instance_group resource 'projects/<project-id>/zones/europe-west1-d/instanceGroups/s1-03' is already being used by 'projects/<project-id>/regions/europe-west1/backendServices/s1', resourceInUseByAnotherResource

Expected Behavior

TF should first update the google_compute_region_backend_service, then delete the instance group.

Actual Behavior

TF tried to delete the instance group first, which resulted in an error.

Steps to Reproduce

  1. terraform apply
  2. Set s1_count = 2
  3. terraform apply

Important Factoids

It's not a simple task to fix this. One "workaround" is to change the dynamic for_each to have a slice() function like this:

  dynamic "backend" {
    for_each = slice(google_compute_instance_group.s1, 0, 2)
    content {
      group = backend.value.self_link
    }
  }

So you first set the second number of slice() to the new number of the instanca groups run apply, then lower the s1_count to that same number and run apply again, but that's just to complicated for a simple task like this.

b/308569276

@c2thorn
Copy link
Collaborator

c2thorn commented May 19, 2020

Unfortunately, this is an upstream Terraform issue. The provider doesn't have access to the update/destroy order. This is a similar to the scenario outlined here: #3008
I believe multiple apply's is the only way to go for this case.

@c2thorn c2thorn closed this as completed May 19, 2020
@kustodian
Copy link
Author

kustodian commented May 19, 2020 via email

@c2thorn
Copy link
Collaborator

c2thorn commented May 19, 2020

Sorry, that's what I meant. We don't have access to enable a solution for just one apply.

@pdecat
Copy link
Contributor

pdecat commented May 20, 2020

Hi, here's a somewhat work-around for this specific use-case using an intermediate datasource (needs two applies):

provider google {
  version = "3.22.0"
  region  = "europe-west1"
  project = "myproject"
}

locals {
  #zones = []
  zones = ["europe-west1-b"]
}

data "google_compute_network" "network" {
  name = "default"
}

data "google_compute_instance_group" "s1" {
  for_each = toset(local.zones)
  name     = format("s1-%s", each.key)
  zone     = each.key
}

resource "google_compute_region_backend_service" "s1" {
  name = "s1"

  dynamic "backend" {
    for_each = [for group in data.google_compute_instance_group.s1 : group.self_link if group.self_link != null]
    content {
      group = backend.value
    }
  }
  health_checks = [
    google_compute_health_check.default.self_link,
  ]
}

resource "google_compute_health_check" "default" {
  name = "s1"
  tcp_health_check {
    port = "80"
  }
}

resource "google_compute_instance_group" "s1" {
  for_each = toset(local.zones)
  name     = format("s1-%s", each.key)
  zone     = each.key
  network  = data.google_compute_network.network.self_link
}

@kustodian
Copy link
Author

@pdecat your suggestion removes the dependency between google_compute_region_backend_service and google_compute_instance_group so this will probably always require two applies, even when starting from scratch.

@pdecat
Copy link
Contributor

pdecat commented May 20, 2020

so this will probably always require two applies, even when starting from scratch.

I can confirm it does.

But at least it does not need manual intervention out of band to fix the situation.

@pdecat
Copy link
Contributor

pdecat commented May 20, 2020

Maybe something the google provider could do to fix this situation would be to manage backends of a google_compute_region_backend_service as a separate resource:

# NOT A WORKING EXAMPLE
locals {
  project         = "<project-id>"
  network         = "<vpc-name>"
  network_project = "<vpc-project>"
  zones           = ["europe-west1-b", "europe-west1-c", "europe-west1-d"]
  s1_count        = 3
}

provider "google" {
  project = local.project
  version = "~> 3.0"
}

data "google_compute_network" "network" {
  name    = local.network
  project = local.network_project
}

resource "google_compute_region_backend_service" "s1" {
  name = "s1"

  health_checks = [
    google_compute_health_check.default.self_link,
  ]
}

# WARNING: this resource type does not exist
resource "google_compute_region_backend_service_backend" "s1" {
  for_each = google_compute_instance_group.s1

  backend_service = google_compute_region_backend_service.s1.self_link
  group = backend.value.self_link
}

resource "google_compute_health_check" "default" {
  name = "s1"
  tcp_health_check {
    port = "80"
  }
}

resource "google_compute_instance_group" "s1" {
  count   = local.s1_count
  name    = format("s1-%02d", count.index + 1)
  zone    = element(local.zones, count.index)
  network = data.google_compute_network.network.self_link
}

As a side note, I feel like hashicorp/terraform#8099 is not really about the same issue. It is about replacing or updating a resource when another resource it depends on changes (and not being destroyed).

@StephenWithPH
Copy link

I added a comment on the Terraform core issue (hashicorp/terraform#25010 (comment))

Based on that comment (terraform taint up the dependency chain until a single-pass apply works), I think there's a provider-specific fix.

If ForceNew was part of the schema here ...

https://github.com/terraform-providers/terraform-provider-google/blob/c87e414b028becc33f64183a9bd52c92c9b49737/google/resource_compute_region_backend_service.go#L173-L179

... wouldn't that have the same effect as my manual terraform taint?

@c2thorn c2thorn added persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work and removed upstream-terraform bug labels May 26, 2020
@c2thorn
Copy link
Collaborator

c2thorn commented May 26, 2020

@pdecat that should work, and requires implementing a new fine-grained resource google_compute_region_backend_service_backend.

Reopening the issue since a solution is possible, and this will be tracked similarly to other feature-requests.

@c2thorn c2thorn reopened this May 26, 2020
@c2thorn
Copy link
Collaborator

c2thorn commented May 26, 2020

@StephenWithPH ForceNew would have the same effect, but make every change (addition as well as removal) to the backend set destructive. Providing a new fine-grained resource is the cleaner option here.

@freeseacher
Copy link

lack of pretty essential features and bugs like this makes me very disappointed with all the terraform and GCP

@cagataygurturk
Copy link
Contributor

Providing a new fine-grained resource is the cleaner option here.

The question is when :)

@derhally
Copy link

This issue is actually quite problematic

I get these errors trying to destroy the whole module. It requires multiple targeted terraform destroys to complete


Error: Error when reading or editing HealthCheck: googleapi: Error 400: The health_check resource 'projects/test-proj/global/healthChecks/atlantis-healthcheck' is already being used by 'projects/test-proj/global/backendServices/atlantis-backend-service', resourceInUseByAnotherResource

Error: Error waiting for Deleting SecurityPolicy: The security_policy resource 'projects/test-proj/global/securityPolicies/atlantis-security-policy' is already being used by 'projects/test-proj/global/backendServices/atlantis-backend-service'

Error: Error deleting InstanceGroup: googleapi: Error 400: The instance_group resource 'projects/test-proj/zones/us-central1-a/instanceGroups/instance-group-all' is already being used by 'projects/test-proj/global/backendServices/atlantis-backend-service', resourceInUseByAnotherResource

@konturn
Copy link

konturn commented Jul 11, 2021

I actually just ran into this issue a couple of days ago, and I was able to resolve it by appending a random string to the end of the group manager's name and using the create_before_destroy lifecycle policy for the instance group manager resource. For whatever reason, doing so leads Terraform to modify the backend service before destroying the original instance group. Still not the prettiest hack in the world, but better than having to issue multiple applies.

@husseyd
Copy link

husseyd commented Oct 5, 2021

This has been driving me nuts for months.
Using Cloud Run behind external GCLB. Backend services for the Serverless NEGs are in use by the URL map.

Once all this config/infra is in place, the service / backend service cannot be deleted even if removing the URL map in the same change. It's becomes a two step of removing URL map, then removing service and backend service.

In an enterprise setting with ~10 environments each receiving different releases at different schedules, having repeat CI pipelines is not okay and is basically unmanageable.

@bluemalkin
Copy link

I can relate to this, GCP doesn't update the URL map before destroying backend services. Very frustrating.

@c2thorn c2thorn removed their assignment Apr 19, 2022
modular-magician added a commit to modular-magician/terraform-provider-google that referenced this issue Aug 4, 2022
modular-magician added a commit that referenced this issue Aug 4, 2022
@PranavSathy
Copy link

Can confirm that this is the case with manual global load balancing setup on Google Provider as well. Definitely super annoying that we need to manually need to:

  1. Update our terraform config to remove a desired deployment region (e.g. `us-central1).
  2. Run the following command manually:
$ gcloud beta compute backend-services remove-backend --global revere-backend \
    --network-endpoint-group-region=<region> \
    --network-endpoint-group=revere-neg-<region>
  1. terraform apply to achieve desired state.

This means anytime we turn down on a region some administrator is going to have to do this instead of simply relying on CI/CD. What's worse is that it makes proving certain security/compliance certifications harder as our CI/CD + pull request process is audited and logged; but random CLI commands from an administrator's shell environment is harder to track (i.e. we need to involve GCP Audit Logging in the business justifications).

Looking forward to an elegant solution by the provider here.

@pedromiranda-telus
Copy link

pedromiranda-telus commented Oct 21, 2022

I can relate to this, GCP doesn't update the URL map before destroying backend services. Very frustrating.

I had the same problem. My workaround was to run following command (IT PROVOKES UNAVAILABILITY):

# This will delete the URL map, then the backend service and finally create them again
terraform apply -replace="google_compute_region_url_map.name_of_your_url_map"

Hope it helps.

@Unichron
Copy link

I think it's fundamentally a terraform core issue, but it could be fixed in the provider if there was a standalone resource to manage a backend of a backend service. In this case the deletion of the instance group/neg/whatever would naturally involve the deletion of the backend resource, and deletes in this case would be properly ordered. Of course the same then should be done for all analogous cases, which is a hassle and spans across most terraform providers (and maybe even impossible in some cases), but these would provide extra flexibility as well on top of being a workaround for this issue.

@m00lecule
Copy link

m00lecule commented Feb 5, 2023

Keeping fingers crossed for somebody to solve this issue. Today I have faced it when trying to increase google_compute_region_instance_group_manager.distribution_policy_zones field with additional zone. I have learned that common operations are not possible in GCP.

@luismendezescobar
Copy link

I actually just ran into this issue a couple of days ago, and I was able to resolve it by appending a random string to the end of the group manager's name and using the create_before_destroy lifecycle policy for the instance group manager resource. For whatever reason, doing so leads Terraform to modify the backend service before destroying the original instance group. Still not the prettiest hack in the world, but better than having to issue multiple applies.

hi could you paste an example of what you did with the create_before_destroy ?

@djsmiley2k
Copy link

Disappointing this exists for 2+ years and still no fix.

How come terraform doesn't understand it can't delete a managed instance group without first removing the load balancer (i.e. backend) depending on it? Seems a pretty simple idea, which for some reason isn't implemented?

@levid0s
Copy link

levid0s commented Jul 21, 2023

I'm having the same issue.

I tried fixing it by adding a manual dependency using lifecycle.replace_triggered_by, but you have to do this on every single dependent resource, otherwise I keep getting the 'resource already used by' error.

@github-actions github-actions bot added forward/review In review; remove label to forward service/compute-l7-load-balancer labels Oct 25, 2023
@roaks3 roaks3 removed the forward/review In review; remove label to forward label Oct 27, 2023
@cen1
Copy link

cen1 commented May 14, 2024

So.. this is a top 11 issue by likes, 4 years later we still have to do painful workarounds. create_before_destroy is not always feasible if you run a singleton..

@plexus
Copy link

plexus commented Jun 4, 2024

This seems to work reasonably well as a workaround:

resource "random_id" "group-manager-suffix" {
  byte_length = 4
}

resource "google_compute_instance_group_manager" "my-group" {
  name = "my-instance-group-manager-${random_id.group-manager-suffix.hex}"

  ...

  lifecycle {
    create_before_destroy = true
  }
}

resource "google_compute_backend_service" "my-backend" {
  ...
  backend {
    group = google_compute_instance_group_manager.my-group.instance_group
    ...
  }
}

By randomizing the name it's possibly to create_before_destroy, so this will first create a second instance_group_manager, update the backend, then destroy the first instance_group_manager. Single pass apply and no intermediate downtime.

@maxi-cit
Copy link

maxi-cit commented Jun 5, 2024

Hello folks, I started working on adding this new resource google_compute_region_backend_service_backend. Hopefully this should be enough to close this issue. I am opening a PR in few days (just making sure tests works fine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forward/linked new-resource persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/compute-l7-load-balancer service/compute-networking-ig size/m
Projects
None yet
Development

Successfully merging a pull request may close this issue.