Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google container node pool, clobber duplicated resource and destroying wrong resource #9402

Closed
Assignees
Labels

Comments

@lucasteligioridis
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.14.6

Affected Resource(s)

  • google_container_node_pool

Terraform Configuration Files

resource "google_container_node_pool" "pool_1" {
  name = "pool_1"
  project = "my-project"
...
}

# Exact duplicate configuration from 'pool_1' except the terraform name has a `temp_` suffix.
# Keep in mind the node pool name was mistakenly left as the same.
resource "google_container_node_pool" "temp_pool_1" {
  name = "pool_1"
  project = "my-project"
...
}

Expected Behavior

When creating the new "temp" node pools after running a terraform apply, the expectation was that there would have been a conflict with the name.
Since the original pool_1 resource was already created and running.

There was a conflict error that was exactly like this:

Error: error creating NodePool: googleapi: Error 409: Already exists: projects/my-project/zones/my-zone/clusters/my-cluster/nodePools/pool_1., alreadyExists

  on temp_pool_1.tf line 1, in resource "google_container_node_pool" "temp_pool_1":
   1: resource "google_container_node_pool" "temp_pool_1" {

After this, I changed the name within the resource to temp_pool_1, so the resource would have looked like:

resource "google_container_node_pool" "temp_pool_1" {
  name = "temp_pool_1"
  project = "my-project"
...
}

After this, I would have expected new resources to be created with the above settings and would have been left at that.

Actual Behavior

The actual behavior was when the terraform apply was done after the first initial 409 error and the name change of the resource was made was that there was a diff when I expected a new resource that looked like this:

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # temp_pool_1 is tainted, so must be replaced
-/+ resource "google_container_node_pool" "temp_api" {
      ~ id                  = "projects/my-project/locations/my-location/clusters/my-cluster/nodePools/pool_1" -> (known after apply)
      ~ name                = "pool_1" -> "temp_pool_1"

As you can see here, the id is the important aspect to probably look at. It looks like the state was saved previously in the terraform apply that failed with the 409 error and clobbered my existing resource. So now terraform thinks I have completely renamed that resource and will proceed to actually destroy pool_1 and replace with temp_pool_1 instead of creating the resources on the side.

Since I was in development I went ahead and went through the apply and that's exactly what happened.

After that successfully went through and did what I predicted, I saw that my original node pool, pool_1 was now destroyed. Running a terraform apply immediately after that now wants to create my pool_1 resources again from scratch.

I'm sure there is a race condition or a clobbering of the state resources where there shouldn't be in this specific case, but just glad I caught this one in development and the diff clearly explained an issue anyway. But this could definitely trip someone up in production or accidentally cause a wrong node pool to be destroyed unintentionally.

Look forward to hearing back :)

@edwardmedia edwardmedia self-assigned this Jun 21, 2021
@edwardmedia
Copy link
Contributor

@lucasteligioridis this is an interesting use case. Because the way the gke & node pool resources are designed, it needs a series of calls in order to fix the problem caused by the initial apply in your case. You were just stopped in the middle of these steps. If you continue tf apply, it will fix the problem. I am not sure if there is a good way at the provider level we can shorten the steps.
@ScottSuarez what do you think?

@ScottSuarez
Copy link
Collaborator

I think we can do a prefetch on the resource to make sure it doesn't exist during create before tainting the resource. @edwardmedia

https://github.com/GoogleCloudPlatform/magic-modules/pull/4904/files

@edwardmedia
Copy link
Contributor

@ScottSuarez good idea

@lucasteligioridis
Copy link
Author

Thanks for the quick turn around team ❤️ 👏🏼

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.