Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334

Closed
mancaus opened this issue May 11, 2020 · 4 comments
Assignees
Labels

Comments

@mancaus
Copy link
Contributor

mancaus commented May 11, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v0.12.24

  • provider.google v3.20.0
  • provider.random v2.2.1

Affected Resource(s)

  • google_container_node_pool

Terraform Configuration Files

# Create and apply two deployments, then destroy them concurrently 
provider "google" {

}

resource random_id id {
    byte_length = 4
}

resource "google_container_node_pool" "test" {
  name       = "test-${random_id.id.hex}"
  project    = "[redacted]"
  location   = "us-east1-d"
  cluster    = "[existing cluster]"
  node_count = 1

  node_config {
    machine_type = "n1-standard-1"
  }
}

Debug Output

Trace output from failing destroy:
https://gist.github.com/mancaus/014b4d730b442ba17adea09da41a5f42

Expected Behavior

Destroy either:

  • completes, (resource destroyed after first destroy), or
  • resource deletion fails after configured delete timeout

Actual Behavior

Fails with retryable error after 30 seconds.

Note - concurrent apply works - node pools are created serially within configured timeout.

Steps to Reproduce

Create two deployments and run each step concurrently for each deployment:

  1. terraform init
  2. terraform apply
  3. terraform destroy

Important Factoids

N/A

References

None

@ghost ghost added the bug label May 11, 2020
mancaus added a commit to mancaus/terraform-provider-google that referenced this issue May 11, 2020
@emilymye
Copy link
Contributor

I'm not against adding this retry and will review the PR because we probably also can't delete node pools if the cluster is being updated in another way (e.g. automatic updates).

However, I'm not doing this to fix the issue of deleting across different Terraform runs and would strongly advise you not to do this. May I ask why you're editing one GKE cluster with two different Terraform instances?

Almost all of our resources that must be edited synchronously have built-in mutex support to handle this, but mutex state doesn't get shared across different Terraform runs. This retry behavior won't be supported for any other resource, and retries are both more costly (quota), are more error prone, and usually take longer.

@mancaus
Copy link
Contributor Author

mancaus commented May 18, 2020

Hi, Emily - thanks for the update.

We have two use cases that require management of kubernetes resources with different terraform instances.

A) Typically our clusters host multiple distinct sets of services with different lifetimes. Some of these manage their own node pool as they have their own specific requirements. These services are deployed independently (different teams, to different logical tenants) and therefore it's possible to have concurrent instances of Terraform running, including in production.

B) We test various configurations in CI and, to save cost and management overhead, these use a shared cluster. We won't serialize operations in CI as it would be quite onerous to do, and would slow our CI. We can't split these across clusters as it would require many clusters, making it expensive and difficult to manage. This error is most prevalent in CI for us.

I would propose that anything advertised as retryable in the APIs should be retried within the limit of the configured timeout, which I think covers this case. Per the PR, the logic for cluster create does this, but the delete does not, and it looks like that is also the intent in the code for this resource and elsewhere.

Thanks again

@emilymye
Copy link
Contributor

Fixed by #6334

@ghost
Copy link

ghost commented Jun 27, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jun 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants