-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833
Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833
Comments
@preston-hf I have tried your code multiple times (all in the same apply) and none yielded |
@preston-hf How long are you seeing terraform deadlock on creating the IAM member for? It should be constrained by the timeout for creating the resource, which defaults to 6 or 10 minutes I believe. |
Yeah I am dealing with a pretty complicated config, I tried to provide one that would trigger it but maybe there's something further going on. I started out with a local-exec sleep on the bucket and it didn't work 100% of the time but it didn't deadlock that I can remember, it's been a couple of months now IIRC. It will run until I kill it, easily 15m+. It will not terminate gracefully either, it must be force-killed (via TF Cloud UI, but also C-c twice locally). |
The logs are accurate I just sanitized them. It will just loop doing that until you kill it. |
I got debug logging turned on and caught this again. Is it possible that it's actually two different resources fighting over the same IAM policy? One of them changes it in parallel after the other resource has already fetched the old policy as part of the previous apply or plan, and it looks like it's making the same request over and over with the same etag. Is there some way to resolve this? Make one resource dependent on the other one manually? I think there's still a bug here even if I am able to work around the race like that. If the PUT request is made to change to an endpoint that returns a 412, ideally it should replan that resource or fail outright, not loop doing the same request forever. |
I think the fix in #6235 actually caused this bug. It now retries the request when it gets a 412, but it retries without refreshing the etag. I think it should actually fail in this case and we should serialize changes to this resource in terraform. Not sure that's something the provider needs to handle or can we specify it in our tf config? I don't really want to make various iam_members arbitrarily dependent on each other just to serialize. I'm not running into this issue with any other resources either. |
It makes sense that modifying the IAM policy changes the etag which causes it to continue to fail trying to apply with the old etag. Generally two IAM resources on the same parent should use a mutex to prevent simultaneous updates, but with somewhat asynchronous APIs I could see something going wrong there. Maybe we should revert that PR and no longer poll on 412 as it seems to be an unretryable error in some cases I'm concerned that this ends up deadlocking though, that shouldn't be possible due to the timeouts used with the retry functionality |
Yes, I would support rolling that back, it's better to have an error and we are considering just polling with curl before moving on, but it's not clear that would resolve this issue. Based on my reading of the docs we should be getting a 409 Conflict not a 412 though. Is there any way you could work with GCP to find out exactly why this request is sending a 412? I can send you logs via sharing with gdocs or something. |
We also see this quite frequently. I think it occurs both when applying multiple resource bindings to the same policy, and when running concurrent deployments that apply bindings to the same policy. The latter is a common use case in projects that host many deployments, especially when applying project level policy bindings, but also when applying bindings to the artifacts bucket that backs gcr.io access. Our CI/CD pipelines frequently fail with this error. Per this, I believe the 412 is the correct response form the API when an etag condition is not met. That can occur during concurrent updates, but per your note above the retry logic must update the etag to retry, but does not currently. Does that make sense? And, is this issue the correct one to track that, or should we raise a new one? |
I think we tracked ours down to actually being a race with the storage notification. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
0.12.28
Affected Resource(s)
Terraform Configuration Files
Debug Output
https://gist.github.com/preston-hf/d3872d1a8bdf7914bceab2cc83fa3b42
Expected Behavior
Should have created the bucket and added access as specified
Actual Behavior
It creates the bucket, then never successfully is able to add the
google_storage_bucket_iam_member
. It gets stuck in a loop where it gets a 412, retries, then gets a 412 again. From the GCP Console, it seems like it's working, the activity log shows it setting bucket iam, and the logs show it setting it as well. The 412 in the debug logs seems to indicate that it has an issue with the If-Match header, but the request that I see doesn't show it being sent.Steps to Reproduce
terraform apply
Important Factoids
I believe the fix to #6212 is causing this. It would be better for it to be fail due to eventual consistency than the state it is in now where it totally deadlocks. If kill terraform (it will not ever gracefully terminate) and apply again, it works just fine. This issue only occurs when you're trying to create a bucket and grant access to it in the same apply.
References
The text was updated successfully, but these errors were encountered: