Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

preston-hf · 2020-07-22T17:47:31Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

0.12.28

Affected Resource(s)

google_storage_bucket

Terraform Configuration Files

resource "google_storage_bucket" "my_bucket" {
  name = "mybucket"
  labels = {
    controlled-by-terraform = true
  }
  project = var.project_id
  force_destroy = true
}

resource "google_storage_bucket_iam_member" "bkt_access" {
  bucket = google_storage_bucket.my_bucket.name
  member = "serviceAccount:[email protected]"
  role = "roles/storage.objectAdmin"
}

Debug Output

https://gist.github.com/preston-hf/d3872d1a8bdf7914bceab2cc83fa3b42

Expected Behavior

Should have created the bucket and added access as specified

Actual Behavior

It creates the bucket, then never successfully is able to add the google_storage_bucket_iam_member. It gets stuck in a loop where it gets a 412, retries, then gets a 412 again. From the GCP Console, it seems like it's working, the activity log shows it setting bucket iam, and the logs show it setting it as well. The 412 in the debug logs seems to indicate that it has an issue with the If-Match header, but the request that I see doesn't show it being sent.

Steps to Reproduce

terraform apply

Important Factoids

I believe the fix to #6212 is causing this. It would be better for it to be fail due to eventual consistency than the state it is in now where it totally deadlocks. If kill terraform (it will not ever gracefully terminate) and apply again, it works just fine. This issue only occurs when you're trying to create a bucket and grant access to it in the same apply.

References

Bucket resource returns as completed before it actually exists #6212

The text was updated successfully, but these errors were encountered:

edwardmedia · 2020-07-22T18:33:55Z

@preston-hf I have tried your code multiple times (all in the same apply) and none yielded 412 error code. I think adding retry should be fine. To workaround this, users could always add sleep between two.
@slevenick what do you think of this issue?

slevenick · 2020-07-22T18:41:23Z

@preston-hf How long are you seeing terraform deadlock on creating the IAM member for? It should be constrained by the timeout for creating the resource, which defaults to 6 or 10 minutes I believe.

preston-hf · 2020-07-22T18:43:11Z

Yeah I am dealing with a pretty complicated config, I tried to provide one that would trigger it but maybe there's something further going on. I started out with a local-exec sleep on the bucket and it didn't work 100% of the time but it didn't deadlock that I can remember, it's been a couple of months now IIRC. It will run until I kill it, easily 15m+. It will not terminate gracefully either, it must be force-killed (via TF Cloud UI, but also C-c twice locally).

preston-hf · 2020-07-22T18:43:48Z

The logs are accurate I just sanitized them. It will just loop doing that until you kill it.

preston-hf · 2020-08-06T14:17:08Z

I got debug logging turned on and caught this again. Is it possible that it's actually two different resources fighting over the same IAM policy? One of them changes it in parallel after the other resource has already fetched the old policy as part of the previous apply or plan, and it looks like it's making the same request over and over with the same etag. Is there some way to resolve this? Make one resource dependent on the other one manually? I think there's still a bug here even if I am able to work around the race like that. If the PUT request is made to change to an endpoint that returns a 412, ideally it should replan that resource or fail outright, not loop doing the same request forever.

preston-hf · 2020-08-06T14:47:01Z

I think the fix in #6235 actually caused this bug. It now retries the request when it gets a 412, but it retries without refreshing the etag. I think it should actually fail in this case and we should serialize changes to this resource in terraform. Not sure that's something the provider needs to handle or can we specify it in our tf config? I don't really want to make various iam_members arbitrarily dependent on each other just to serialize. I'm not running into this issue with any other resources either.

slevenick · 2020-08-06T16:28:40Z

It makes sense that modifying the IAM policy changes the etag which causes it to continue to fail trying to apply with the old etag. Generally two IAM resources on the same parent should use a mutex to prevent simultaneous updates, but with somewhat asynchronous APIs I could see something going wrong there.

Maybe we should revert that PR and no longer poll on 412 as it seems to be an unretryable error in some cases

I'm concerned that this ends up deadlocking though, that shouldn't be possible due to the timeouts used with the retry functionality

preston-hf · 2020-08-06T17:44:59Z

Yes, I would support rolling that back, it's better to have an error and we are considering just polling with curl before moving on, but it's not clear that would resolve this issue. Based on my reading of the docs we should be getting a 409 Conflict not a 412 though. Is there any way you could work with GCP to find out exactly why this request is sending a 412? I can send you logs via sharing with gdocs or something.

mancaus · 2020-08-13T09:37:02Z

We also see this quite frequently. I think it occurs both when applying multiple resource bindings to the same policy, and when running concurrent deployments that apply bindings to the same policy. The latter is a common use case in projects that host many deployments, especially when applying project level policy bindings, but also when applying bindings to the artifacts bucket that backs gcr.io access. Our CI/CD pipelines frequently fail with this error.

Per this, I believe the 412 is the correct response form the API when an etag condition is not met. That can occur during concurrent updates, but per your note above the retry logic must update the etag to retry, but does not currently.

Does that make sense? And, is this issue the correct one to track that, or should we raise a new one?

preston-hf · 2020-08-13T17:41:30Z

I think we tracked ours down to actually being a race with the storage notification.

ghost · 2021-01-23T13:50:36Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

ghost added the bug label Jul 22, 2020

preston-hf mentioned this issue Jul 22, 2020

Add retry for storage bucket 412 GoogleCloudPlatform/magic-modules#3434

Merged

edwardmedia self-assigned this Jul 22, 2020

edwardmedia assigned slevenick Jul 22, 2020

edwardmedia mentioned this issue Sep 16, 2020

Unable to add roles/iam.serviceAccountTokenCreator role to the default Pub/Sub service account #7281

Closed

anakovt mentioned this issue Sep 23, 2020

feat: Add SA token creator binding terraform-google-modules/terraform-google-pubsub#37

Merged

mancaus mentioned this issue Nov 22, 2020

concurrent google_storage_bucket_iam_member calls result in googleapi: Error 412: Precondition Failed, conditionNotMet #7853

Closed

nat-henderson mentioned this issue Dec 23, 2020

Eliminate retries in the inner loop for errors that will not succeed. GoogleCloudPlatform/magic-modules#4334

Merged

5 tasks

nat-henderson closed this as completed in GoogleCloudPlatform/magic-modules#4334 Dec 23, 2020

ghost locked as resolved and limited conversation to collaborators Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

preston-hf commented Jul 22, 2020 •

edited

Loading

edwardmedia commented Jul 22, 2020

slevenick commented Jul 22, 2020

preston-hf commented Jul 22, 2020

preston-hf commented Jul 22, 2020

preston-hf commented Aug 6, 2020

preston-hf commented Aug 6, 2020

slevenick commented Aug 6, 2020

preston-hf commented Aug 6, 2020

mancaus commented Aug 13, 2020

preston-hf commented Aug 13, 2020

ghost commented Jan 23, 2021

Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

Comments

preston-hf commented Jul 22, 2020 • edited Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

edwardmedia commented Jul 22, 2020

slevenick commented Jul 22, 2020

preston-hf commented Jul 22, 2020

preston-hf commented Jul 22, 2020

preston-hf commented Aug 6, 2020

preston-hf commented Aug 6, 2020

slevenick commented Aug 6, 2020

preston-hf commented Aug 6, 2020

mancaus commented Aug 13, 2020

preston-hf commented Aug 13, 2020

ghost commented Jan 23, 2021

preston-hf commented Jul 22, 2020 •

edited

Loading