Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent google_storage_bucket_iam_member calls result in googleapi: Error 412: Precondition Failed, conditionNotMet #7853

Assignees
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work size/s

Comments

@schaffino
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

terraform version v0.12.29
provider version v3.44.0

Affected Resource(s)

  • google_storage_bucket_iam_member

Terraform Configuration Files

Create a shared bucket

provider "google" {
  version = "3.44.0"
  project = var.project
  region  = var.region
}

resource "google_storage_bucket" "concurrent-iam-test-bucket" {
  name          = "concurrent-iam-test-bucket"
  location      = var.location
  project       = var.project
}

Terraform deploy creating a service account and giving permissions to the shared bucket

provider "google" {
  version = "3.44.0"
  project = var.project
  region  = var.region
}

resource "google_service_account" "test1" {
  account_id   = "concurrent-iam-test1"
  project      = var.project
}

resource "google_storage_bucket_iam_member" "concurrent_iam" {
  bucket = "concurrent-iam-test-bucket"
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.test1.email}"
}

Terraform deploy creating a different service account and giving permissions to the shared bucket

provider "google" {
  version = "3.44.0"
  project = var.project
  region  = var.region
}

resource "google_service_account" "test2" {
  account_id   = "concurrent-iam-test2"
  project      = var.project
}

resource "google_storage_bucket_iam_member" "concurrent_iam" {
  bucket = "concurrent-iam-test-bucket"
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.test2.email}"
}

Debug Output

This will be display in a loop till eventual timeout and Error applying IAM policy for storage bucket "b/concurrent-iam-test-bucket": Error setting IAM policy for storage bucket "b/concurrent-iam-test-bucket": googleapi: Error 412: Precondition Failed, conditionNotMet is returned

2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5: {
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:   "error": {
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:     "code": 412,
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:     "message": "Precondition Failed",
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:     "errors": [
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:       {
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:         "message": "Precondition Failed",
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:         "domain": "global",
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:         "reason": "conditionNotMet",
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:         "locationType": "header",
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:         "location": "If-Match"
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:       }
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:     ]
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5:   }
2020-11-19T16:06:02.231Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5: }
2020-11-19T16:06:02.232Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5: 2020/11/19 16:06:02 [DEBUG] Retry Transport: Returning after 1 attempts
2020-11-19T16:06:02.232Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5: 2020/11/19 16:06:02 [DEBUG] Dismissed an error as retryable. Retry on storage precondition not met - googleapi: Error 412: Precondition Failed, conditionNotMet
2020-11-19T16:06:02.232Z [DEBUG] plugin.terraform-provider-google_v3.44.0_x5: 2020/11/19 16:06:02 [TRACE] Waiting 10s before next try

Panic Output

Expected Behavior

We have a number of situations where many discreet terraform deployments need to grant specific permissions to a shared bucket that is created and managed in another deployment. We use the the resource google_storage_bucket_iam_member as it is only concerned with the addition or removal of the service accounts referenced in the resource definition and does not need to be aware of all permissions on the resource. We expect to be able to run these concurrent deployments and the accounts defined in the definition be removed or added depending on the terraform operation taking place.

Actual Behavior

What we are seeing is that when multiple operations attempt to either create or delete IAM permission on the shared bucket concurrently, only one of these operations will succeed and all other operations will hang untill a timeout occurs and we receive the following error

Error applying IAM policy for storage bucket "b/concurrent-iam-test-bucket": Error setting IAM policy for storage bucket "b/concurrent-iam-test-bucket": googleapi: Error 412: Precondition Failed, conditionNotMet

Steps to Reproduce

  1. Create a shared GCS bucket as in the example tf definition
  2. Run two other terraform deployments concurrently that create service accounts and apply additive permissions to the bucket.
  3. One of these deployments will succeed whilst the other will hang for 20m until a timeout occurs and the precondition error is returned.

Important Factoids

This occurs for both create and delete operations. I assume it occurs if iam permissions are being modified by non-terraform tools as well.

References

@ghost ghost added the bug label Nov 19, 2020
@venkykuberan venkykuberan self-assigned this Nov 19, 2020
@venkykuberan
Copy link
Contributor

@schaffino what's the behavior if you re-run failed config? is it succeeding ?. Looks to me more of an API concurrency issue.

@schaffino
Copy link
Author

It will succeed the next time as there is only one permission being applied on the second run. If you remove both permissions and run it again it will hang again. There are no API rate limits being hit that i can tell.

I have done a quick test with gsutil doing the same thing

With two terminals

while true; do 
  gsutil iam ch -d serviceAccount:some-service-account1 gs://some-bucket;  
  gsutil iam ch  serviceAccount:some-service-account1 gs://some-bucket;
done

and

while true; do 
  gsutil iam ch -d serviceAccount:some-service-account2 gs://some-bucket;  
  gsutil iam ch  serviceAccount:some-service-account2 gs://some-bucket;
done

This does surface periodic errors of the form

gslib.cloud_api.PreconditionException: PreconditionException: 412 Precondition Failed
ERROR 1120 15:13:32.716968 retry_decorator.py] Retrying in x seconds ...

But the gsutil command then succeeds successfully. Seems like the retry logic in the provider is just continuously rerunning a failing API call whereas gsutil will eventually successfully apply the change.

@ghost ghost removed waiting-response labels Nov 20, 2020
@venkykuberan
Copy link
Contributor

@schaffino can you attach the full debug log. We want to see the etag value being passed on the retry requests.

@mancaus
Copy link
Contributor

mancaus commented Nov 22, 2020

Hi - we have also come across this, per this comment on a potentially related issue.

From further investigation, we have a theory - that in iam_storage_bucket.go, updates to bucket policy are retried on 412, per the check in isStoragePreconditionError. In this case the 412 indicates "invalid etag", due to the concurrent update (per https://cloud.google.com/storage/docs/generations-preconditions#_Preconditions). However, the retry is performed without updating the etag, and thus fails with the same error.

In the calling code, in iam.go, there is handling for the similar/related "409 - Conflict" return from the request. This causes a retry that AFAICT would update the etag.

We plan to test this theory by removing the predicate from the call to sendRequestWithTimeout(...) and adding a check for 412 in addition to isConfictError(...) in iamPolicyReadModifyWrite. We'd be interested in your thoughts on the theory.

@nat-henderson
Copy link
Contributor

Thank you, that makes sense! This is very difficult to reproduce - do you have a repro case? I can safely make that change - no harm to it - but it will be hard to know if it works.

@schaffino
Copy link
Author

@ndmckinley @venkykuberan

Apologies for the slow response. I noticed that a fix has been put in for this so i've tested the test case in the issue with current master.

TRACE level logs below

https://gist.github.com/schaffino/1b969e148baa6f6d285104c68986980f

What we see now is that the operation is eventually successful, but is uninterruptible and continues retrying the failed call for 20 minutes then eventually succeeds, which adds a long delay to any terraform job encountering the problem.

The code given at the top of this issue will reliably encounter the problem as long as the concurrent applies of both configurations are executed within a second or so of each other.

@nat-henderson
Copy link
Contributor

Understood. I think the 412s are probably getting caught in a retry loop that doesn't use the isConflictError - that would explain why it takes so long. I'll put this back on the pile!

@ghost
Copy link

ghost commented Jan 23, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Jan 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.