Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting IAM conditions on a bucket after creating it in the same request deadlocks #6833

Assignees
Labels

Comments

@preston-hf
Copy link

preston-hf commented Jul 22, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

0.12.28

Affected Resource(s)

  • google_storage_bucket

Terraform Configuration Files

resource "google_storage_bucket" "my_bucket" {
  name = "mybucket"
  labels = {
    controlled-by-terraform = true
  }
  project = var.project_id
  force_destroy = true
}

resource "google_storage_bucket_iam_member" "bkt_access" {
  bucket = google_storage_bucket.my_bucket.name
  member = "serviceAccount:[email protected]"
  role = "roles/storage.objectAdmin"
}

Debug Output

https://gist.github.com/preston-hf/d3872d1a8bdf7914bceab2cc83fa3b42

Expected Behavior

Should have created the bucket and added access as specified

Actual Behavior

It creates the bucket, then never successfully is able to add the google_storage_bucket_iam_member. It gets stuck in a loop where it gets a 412, retries, then gets a 412 again. From the GCP Console, it seems like it's working, the activity log shows it setting bucket iam, and the logs show it setting it as well. The 412 in the debug logs seems to indicate that it has an issue with the If-Match header, but the request that I see doesn't show it being sent.

Steps to Reproduce

  1. terraform apply

Important Factoids

I believe the fix to #6212 is causing this. It would be better for it to be fail due to eventual consistency than the state it is in now where it totally deadlocks. If kill terraform (it will not ever gracefully terminate) and apply again, it works just fine. This issue only occurs when you're trying to create a bucket and grant access to it in the same apply.

References

@edwardmedia
Copy link
Contributor

@preston-hf I have tried your code multiple times (all in the same apply) and none yielded 412 error code. I think adding retry should be fine. To workaround this, users could always add sleep between two.
@slevenick what do you think of this issue?

@slevenick
Copy link
Collaborator

@preston-hf How long are you seeing terraform deadlock on creating the IAM member for? It should be constrained by the timeout for creating the resource, which defaults to 6 or 10 minutes I believe.

@preston-hf
Copy link
Author

Yeah I am dealing with a pretty complicated config, I tried to provide one that would trigger it but maybe there's something further going on. I started out with a local-exec sleep on the bucket and it didn't work 100% of the time but it didn't deadlock that I can remember, it's been a couple of months now IIRC. It will run until I kill it, easily 15m+. It will not terminate gracefully either, it must be force-killed (via TF Cloud UI, but also C-c twice locally).

@preston-hf
Copy link
Author

The logs are accurate I just sanitized them. It will just loop doing that until you kill it.

@preston-hf
Copy link
Author

I got debug logging turned on and caught this again. Is it possible that it's actually two different resources fighting over the same IAM policy? One of them changes it in parallel after the other resource has already fetched the old policy as part of the previous apply or plan, and it looks like it's making the same request over and over with the same etag. Is there some way to resolve this? Make one resource dependent on the other one manually? I think there's still a bug here even if I am able to work around the race like that. If the PUT request is made to change to an endpoint that returns a 412, ideally it should replan that resource or fail outright, not loop doing the same request forever.

@preston-hf
Copy link
Author

I think the fix in #6235 actually caused this bug. It now retries the request when it gets a 412, but it retries without refreshing the etag. I think it should actually fail in this case and we should serialize changes to this resource in terraform. Not sure that's something the provider needs to handle or can we specify it in our tf config? I don't really want to make various iam_members arbitrarily dependent on each other just to serialize. I'm not running into this issue with any other resources either.

@slevenick
Copy link
Collaborator

It makes sense that modifying the IAM policy changes the etag which causes it to continue to fail trying to apply with the old etag. Generally two IAM resources on the same parent should use a mutex to prevent simultaneous updates, but with somewhat asynchronous APIs I could see something going wrong there.

Maybe we should revert that PR and no longer poll on 412 as it seems to be an unretryable error in some cases

I'm concerned that this ends up deadlocking though, that shouldn't be possible due to the timeouts used with the retry functionality

@preston-hf
Copy link
Author

Yes, I would support rolling that back, it's better to have an error and we are considering just polling with curl before moving on, but it's not clear that would resolve this issue. Based on my reading of the docs we should be getting a 409 Conflict not a 412 though. Is there any way you could work with GCP to find out exactly why this request is sending a 412? I can send you logs via sharing with gdocs or something.

@mancaus
Copy link
Contributor

mancaus commented Aug 13, 2020

We also see this quite frequently. I think it occurs both when applying multiple resource bindings to the same policy, and when running concurrent deployments that apply bindings to the same policy. The latter is a common use case in projects that host many deployments, especially when applying project level policy bindings, but also when applying bindings to the artifacts bucket that backs gcr.io access. Our CI/CD pipelines frequently fail with this error.

Per this, I believe the 412 is the correct response form the API when an etag condition is not met. That can occur during concurrent updates, but per your note above the retry logic must update the etag to retry, but does not currently.

Does that make sense? And, is this issue the correct one to track that, or should we raise a new one?

@preston-hf
Copy link
Author

I think we tracked ours down to actually being a race with the storage notification.

@ghost
Copy link

ghost commented Jan 23, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Jan 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.