Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

mancaus · 2020-05-11T07:32:03Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.12.24

provider.google v3.20.0
provider.random v2.2.1

Affected Resource(s)

google_container_node_pool

Terraform Configuration Files

# Create and apply two deployments, then destroy them concurrently 
provider "google" {

}

resource random_id id {
    byte_length = 4
}

resource "google_container_node_pool" "test" {
  name       = "test-${random_id.id.hex}"
  project    = "[redacted]"
  location   = "us-east1-d"
  cluster    = "[existing cluster]"
  node_count = 1

  node_config {
    machine_type = "n1-standard-1"
  }
}

Debug Output

Trace output from failing destroy:
https://gist.github.com/mancaus/014b4d730b442ba17adea09da41a5f42

Expected Behavior

Destroy either:

completes, (resource destroyed after first destroy), or
resource deletion fails after configured delete timeout

Actual Behavior

Fails with retryable error after 30 seconds.

Note - concurrent apply works - node pools are created serially within configured timeout.

Steps to Reproduce

Create two deployments and run each step concurrently for each deployment:

terraform init
terraform apply
terraform destroy

Important Factoids

N/A

References

None

The text was updated successfully, but these errors were encountered:

…igured timeout

emilymye · 2020-05-15T16:57:11Z

I'm not against adding this retry and will review the PR because we probably also can't delete node pools if the cluster is being updated in another way (e.g. automatic updates).

However, I'm not doing this to fix the issue of deleting across different Terraform runs and would strongly advise you not to do this. May I ask why you're editing one GKE cluster with two different Terraform instances?

Almost all of our resources that must be edited synchronously have built-in mutex support to handle this, but mutex state doesn't get shared across different Terraform runs. This retry behavior won't be supported for any other resource, and retries are both more costly (quota), are more error prone, and usually take longer.

mancaus · 2020-05-18T06:41:20Z

Hi, Emily - thanks for the update.

We have two use cases that require management of kubernetes resources with different terraform instances.

A) Typically our clusters host multiple distinct sets of services with different lifetimes. Some of these manage their own node pool as they have their own specific requirements. These services are deployed independently (different teams, to different logical tenants) and therefore it's possible to have concurrent instances of Terraform running, including in production.

B) We test various configurations in CI and, to save cost and management overhead, these use a shared cluster. We won't serialize operations in CI as it would be quite onerous to do, and would slow our CI. We can't split these across clusters as it would require many clusters, making it expensive and difficult to manage. This error is most prevalent in CI for us.

I would propose that anything advertised as retryable in the APIs should be retried within the limit of the configured timeout, which I think covers this case. Per the PR, the logic for cluster create does this, but the delete does not, and it looks like that is also the intent in the code for this resource and elsewhere.

Thanks again

#6335)

emilymye · 2020-05-27T19:13:59Z

Fixed by #6334

ghost · 2020-06-27T13:50:31Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

ghost added the bug label May 11, 2020

mancaus added a commit to mancaus/terraform-provider-google that referenced this issue May 11, 2020

hashicorp#6334 Node pool destroy retry on retryable error within conf…

e6fe3cc

…igured timeout

mancaus mentioned this issue May 11, 2020

#6334 Node pool destroy retry on retryable error within configured ti… #6335

Merged

venkykuberan assigned venkykuberan and emilymye and unassigned venkykuberan May 11, 2020

emilymye pushed a commit that referenced this issue May 27, 2020

#6334 Node pool destroy retry on retryable error within configured ti… (

e17e98b

#6335)

emilymye closed this as completed May 27, 2020

ghost locked and limited conversation to collaborators Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

mancaus commented May 11, 2020 •

edited

Loading

emilymye commented May 15, 2020

mancaus commented May 18, 2020

emilymye commented May 27, 2020

ghost commented Jun 27, 2020

Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

Comments

mancaus commented May 11, 2020 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

emilymye commented May 15, 2020

mancaus commented May 18, 2020

emilymye commented May 27, 2020

ghost commented Jun 27, 2020

mancaus commented May 11, 2020 •

edited

Loading