-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent destroy of node pools - one fails within configured timeout, with retryable error #6334
Comments
I'm not against adding this retry and will review the PR because we probably also can't delete node pools if the cluster is being updated in another way (e.g. automatic updates). However, I'm not doing this to fix the issue of deleting across different Terraform runs and would strongly advise you not to do this. May I ask why you're editing one GKE cluster with two different Terraform instances? Almost all of our resources that must be edited synchronously have built-in mutex support to handle this, but mutex state doesn't get shared across different Terraform runs. This retry behavior won't be supported for any other resource, and retries are both more costly (quota), are more error prone, and usually take longer. |
Hi, Emily - thanks for the update. We have two use cases that require management of kubernetes resources with different terraform instances. A) Typically our clusters host multiple distinct sets of services with different lifetimes. Some of these manage their own node pool as they have their own specific requirements. These services are deployed independently (different teams, to different logical tenants) and therefore it's possible to have concurrent instances of Terraform running, including in production. B) We test various configurations in CI and, to save cost and management overhead, these use a shared cluster. We won't serialize operations in CI as it would be quite onerous to do, and would slow our CI. We can't split these across clusters as it would require many clusters, making it expensive and difficult to manage. This error is most prevalent in CI for us. I would propose that anything advertised as retryable in the APIs should be retried within the limit of the configured timeout, which I think covers this case. Per the PR, the logic for cluster create does this, but the delete does not, and it looks like that is also the intent in the code for this resource and elsewhere. Thanks again |
Fixed by #6334 |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
Terraform v0.12.24
Affected Resource(s)
Terraform Configuration Files
Debug Output
Trace output from failing destroy:
https://gist.github.com/mancaus/014b4d730b442ba17adea09da41a5f42
Expected Behavior
Destroy either:
Actual Behavior
Fails with retryable error after 30 seconds.
Note - concurrent apply works - node pools are created serially within configured timeout.
Steps to Reproduce
Create two deployments and run each step concurrently for each deployment:
terraform init
terraform apply
terraform destroy
Important Factoids
N/A
References
None
The text was updated successfully, but these errors were encountered: