Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-referential error enabling the Service Usage API #9489

Closed
Assignees
Labels

Comments

@PatrickDale
Copy link

PatrickDale commented Jul 1, 2021

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.1
on darwin_amd64
+ provider registry.terraform.io/hashicorp/google v3.66.1
+ provider registry.terraform.io/hashicorp/random v3.1.0
+ provider registry.terraform.io/hashicorp/tfe v0.25.2

Affected Resource(s)

  • google_project_service

Terraform Configuration Files

resource "google_project_service" "prerequisites" {
  for_each = toset([
    "serviceusage.googleapis.com",
    "cloudresourcemanager.googleapis.com",
  ])

  project = google_project.my_project.project_id
  service = each.value

  disable_on_destroy         = false
  disable_dependent_services = false
}

Debug Output

https://gist.github.com/PatrickDale/321a68360b28632d19a89144199058f3

Expected Behavior

Running terraform apply should have enabled the Service Usage API (serviceusage.googleapis.com).

Actual Behavior

terraform apply failed with the error:

Error waiting for Enable Project "<project-id>" Services: [serviceusage.googleapis.com]: error while retrieving operation: googleapi: Error 403: Service Usage API has not been used in project <project-number> before or it is disabled.

Steps to Reproduce

  1. terraform apply

Important Factoids

This operation was ran in Terraform Cloud using a service account that has permissions to enable services on this GCP project.

It seems like this error is self-referential -- it states that it cannot enable serviceusage.googleapis.com because the Service Usage API is disabled. From the terraform logs, it looks like the error comes from:

waitErr := serviceUsageOperationWait(config, op, project, fmt.Sprintf("Enable Project %q Services: %+v", project, services), userAgent, timeout)
if waitErr != nil {
return waitErr
}

After receiving the error, I checked the GCP console and the Service Usage API was enabled. I then ran terraform apply again which passed.

@PatrickDale PatrickDale added the bug label Jul 1, 2021
@leighmhart
Copy link

Confirmed, seeing the same behaviour with the latest provider.

@leighmhart
Copy link

@PatrickDale I just tried again on a fresh project and am no longer getting the same issue - I wonder if GCP APIs have fixed this issue quietly?

@ericnorris
Copy link

Hey @leighmhart, it looks like this is still happening; we've just ran into the issue again today:

Error: Request "Enable Project Service \"serviceusage.googleapis.com\" for project \"<a-different-project-id>"" returned error: Error waiting for Enable Project "<a-different-project-id>" Services: [serviceusage.googleapis.com]: error while retrieving operation: googleapi: Error 403: Service Usage API has not been used in project <a-different-project-number> before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/serviceusage.googleapis.com/overview?project=<a-different-project-number> then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry. Details: [ { "@type": "type.googleapis.com/google.rpc.Help", "links": [ { "description": "Google developers console API activation", "url": "https://console.developers.google.com/apis/api/serviceusage.googleapis.com/overview?project=a-different-project-number>" } ] }, { "@type": "type.googleapis.com/google.rpc.ErrorInfo", "domain": "googleapis.com", "metadata": { "consumer": "projects/<a-different-project-number>", "service": "serviceusage.googleapis.com" }, "reason": "SERVICE_DISABLED" }

Unfortunately we've noticed that this doesn't happen consistently, so it's hard to reproduce.

@nat-henderson
Copy link
Contributor

Ah - that makes some sense, of course the first request to fetch the operation would hit the service usage API endpoint... which was just enabled ... which therefore might not be ready to get requests yet. To fix this, I wonder if we might try to, say, just retry a few times if we get 403s on operation waits on this resource.

@ScottSuarez ScottSuarez self-assigned this Sep 15, 2021
@ScottSuarez
Copy link
Collaborator

ScottSuarez commented Sep 15, 2021

Due to the dependent relationship of these two resources it would make sense to establish a relationship between them. Here there is an implicit dependency on a race condition. So it's not surprising there is inconsistency.

resource "google_project_service" "prerequisite0" {
  project = google_project.my_project.project_id
  service = "serviceusage.googleapis.com"
  disable_on_destroy         = false
  disable_dependent_services = false
}

resource "google_project_service" "prerequisite1" {
  for_each = toset([
    "cloudresourcemanager.googleapis.com",
  ])

  project = google_project.my_project.project_id
  service = each.value
  depends_on = [google_project_service.prerequisite0]
  disable_on_destroy         = false
  disable_dependent_services = false
}

If you try the above configuration you should see more stability.
@ndmckinley mentioned we can retry on 403's which is also a potential fix on our end to brute force these dependent relationships without explicit declarations.

@rileykarson
Copy link
Collaborator

@ScottSuarez: Won't the GET request against the operation still fail if there are eventual consistency issues, regardless of the set of services being enabled/disabled in a single batchEnable call?

@ScottSuarez
Copy link
Collaborator

After speaking with @rileykarson, I am revising my opinion, the dependent relationship chain I mentioned above should not be applicable as we BATCH these to send a single request.

@ScottSuarez
Copy link
Collaborator

ScottSuarez commented Sep 15, 2021

@PatrickDale, could you provide me more information on your setup. Which service account are you using to configure google_project_service. Are you using a service account on a project that already has this api enabled or using a service account within this project where this api has not yet been enabled.

This is working locally/consistently for me, but I am using a service account which is located within a project where this api is already enabled. I suspect this is weird timing issue with the api since we are polling the operation, which is to say the api accepted our original request. Retrying on 403 is a good option but I want to see if I can reproduce this.

@ericnorris
Copy link

ericnorris commented Sep 15, 2021

@ScottSuarez we actually used to have the configuration set up like in your above example, but if I recall correctly ran into an issue where we got a Google API error saying that serviceusage.googleapis.com depended on cloudresourcemanager.googleapis.com, and so our code was changed to enable them both at the same time.

I can speak on behalf of @PatrickDale and say that our setup is a service account in project-a (our "host" project for Terraform Cloud), which is enabling services (and creating other things) in project-b. The host project has these APIs enabled.

@ScottSuarez
Copy link
Collaborator

ScottSuarez commented Sep 15, 2021

@ericnorris I have an identical setup but wasn't able to replicate. hmm could you provide a more detailed DEBUG log of this request, potentially scrubbing any sensitive bits.

you can set TF_LOG=DEBUG and TF_LOG_PATH=./wherever-you-want.log

I would like to reproduce if possible

@ScottSuarez
Copy link
Collaborator

also in scrubbing the logs, please retain project-a and project-b aliases

@ericnorris
Copy link

@ScottSuarez unfortunately this issue is intermittent for us, and not consistent. The same operation (on different projects) has different results, so we don't have debug logs from previous occurrences, and I wouldn't know when to turn them on for future occurrences.

It's worth noting that we have user_project_override = true and billing_project = project-a in the configuration that is being applied, but since we're running on the 3.74.0 provider version I'm not sure whether these are actually taking effect.

That said, I'm fairly convinced that what @ndmckinley said is correct. Since the code enables the API and then immediately calls the same API to check if it's enabled, it seems likely that there would be a race condition.

@rileykarson
Copy link
Collaborator

rileykarson commented Sep 16, 2021

I think the quota project will be project-b regardless of version. The project service resource doesn't know to check for a billing project, but does know to use the resource project when user_project_override is set. So even in later versions where billing_project is applied at the provider level, the resource project will override it because there's no way to trigger only the provider-level billing_project field. (And we should probably fix that so it knows to check for a billing project)

@ScottSuarez: I may have pointed you on a bit of a wild goose chase given that- sorry about that! We're almost certainly dealing with operations needing the service enabled, but enabling the service not needing it- and we'll likely want to retry 403s on the operation here.

@ScottSuarez
Copy link
Collaborator

okay I'll make that change !

@ScottSuarez
Copy link
Collaborator

We pushed a change just now that should release next Monday. Please let us know if you continue to experience this issue after this change hits. We were unable to reproduce the issue listed due to it's transient nature but are confident in our fix.

@ericnorris
Copy link

Thanks @ScottSuarez! It's been happening to us once every couple days, so we'll be able to confirm whether or not this was fixed for us about a week after the release.

We're also interested in seeing increased support for billing_project, so thank you for the additional PR to address that. It may also solve the issue for us, since we have the serviceusage.googleapis.com API enabled on our "host" project.

@diegosucariaG
Copy link

Hello there! We are facing the same issue.
It only happens when we try to create multiple projects while enabling multiple APIs on each.

I'll be happy to confirm the fix next week too.

Thanks!

@ScottSuarez
Copy link
Collaborator

The new version should now be released. When you are consuming 3.87.0 please inform us if this continues to occur

@PatrickDale
Copy link
Author

PatrickDale commented Oct 6, 2021

Hey @ScottSuarez, thanks for the help with this issue! We ran into this issue again today after upgrading to provider version 3.87.0.

Here are some info logs from the run that failed in case they are helpful:

{"@level":"info","@message":"module.module-name[\"project-name\"].google_project_service.prerequisites[\"serviceusage.googleapis.com\"]: Creating...","@module":"terraform.ui","@timestamp":"2021-10-06T17:54:56.014040Z","hook":{"resource":{"addr":"module.module-name[\"project-name\"].google_project_service.prerequisites[\"serviceusage.googleapis.com\"]","module":"module.module-name[\"project-name\"]","resource":"google_project_service.prerequisites[\"serviceusage.googleapis.com\"]","implied_provider":"google","resource_type":"google_project_service","resource_name":"prerequisites","resource_key":"serviceusage.googleapis.com"},"action":"create"},"type":"apply_start"}


{"@level":"info","@message":"module.module-name[\"project-name\"].google_project_service.prerequisites[\"serviceusage.googleapis.com\"]: Creation errored after 7s","@module":"terraform.ui","@timestamp":"2021-10-06T17:55:02.857985Z","hook":{"resource":{"addr":"module.module-name[\"project-name\"].google_project_service.prerequisites[\"serviceusage.googleapis.com\"]","module":"module.module-name[\"project-name\"]","resource":"google_project_service.prerequisites[\"serviceusage.googleapis.com\"]","implied_provider":"google","resource_type":"google_project_service","resource_name":"prerequisites","resource_key":"serviceusage.googleapis.com"},"action":"create","elapsed_seconds":7},"type":"apply_errored"}


{"@level":"error","@message":"Error: Request \"Enable Project Service \\\"serviceusage.googleapis.com\\\" for project \\\"project-name\\\"\" returned error: Error waiting for Enable Project \"project-name\" Services: [serviceusage.googleapis.com]: error while retrieving operation: googleapi: Error 403: Service Usage API has not been used in project project-number before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/serviceusage.googleapis.com/overview?project=project-number then retry ...

@ScottSuarez
Copy link
Collaborator

do you perhaps have more information here. We added retries on the polling of the operation. I want to confirm that we did poll that operation several times. This is tricky to fix since I can't repro it so I can't ensure it's fixed before I check it in.

@ScottSuarez
Copy link
Collaborator

Only other fix I can think of would be to poll with a larger interval?

@ScottSuarez ScottSuarez reopened this Oct 6, 2021
@PatrickDale
Copy link
Author

We can enable debug output to get more information for the next run that fails. Unfortunately, that is all the output we received from this run that failed. And I definitely understand how this is tricky to test since it is really inconsistent!

I did notice in the info message that enabling the service failed after 7 seconds:

"action":"create","elapsed_seconds":7}

Does that window line up with the amount of time you'd expect to be polling?

@ScottSuarez
Copy link
Collaborator

Possibly, I'd have to check. Is this latency longer then the previous attempts? If you have those old logs it would be easy to see.

@PatrickDale
Copy link
Author

It looks to be the same -- here is a gist of the first run we noticed failing, and it also failed after 7 seconds: https://gist.githubusercontent.com/PatrickDale/321a68360b28632d19a89144199058f3/raw/7cab66872dfe5c03a149e406f9bc141a7ce6bb01/gistfile1.txt

@ScottSuarez
Copy link
Collaborator

From what I can see, everything should be working as intended.. This is tricky.. If you can get those debug logs it would be a major help. In the meantime I'll see what I can do.

I might be able to spoof these operations responses with a proxy setup, but that will be expensive time-wise and I'm not sure if that is the best approach. I'm going to talk with my colleges. If you can get those detailed logs I might gain a bit more insight into whatever is happening.

@ScottSuarez
Copy link
Collaborator

In order to parallelize this solution I've filed a bug with the internal API team that owns this. Hopefully they can also provide some sort of work around on their end to dry up this inconsistency.

googler ref: b/202310048

@ScottSuarez
Copy link
Collaborator

unfortunately seems infeasible from the api teams response

Marking as infeasible. The inconsistencies in the platform aren't exactly intended, even if an expected consequence of the current design, but we hope to make this better for customers as we move forward.

@ScottSuarez
Copy link
Collaborator

After speaking with their team I am gaining some reliable means of reproduction and better ways to resolve this issue. The billing change could resolve this inconsistency.

yay ... more details to come, sorry for the noise !

@PatrickDale
Copy link
Author

Nice! Thanks for the info @ScottSuarez!

We will post debug logs here as well if we get another failure with that information.

@c4po
Copy link

c4po commented Oct 18, 2021

Hi @ScottSuarez

I was able to cache this exception in a test terraform workspace.
here is a trace log.
btw, here is the terraform code for the test.

@ScottSuarez
Copy link
Collaborator

ScottSuarez commented Oct 25, 2021

Hi @c4po, I've been pulled in many different directions as of late so this particular issue snuck away from me. I've set aside some time to play with it to hopefully address this during the week.

The permission fix I've checked in might already resolve this issue in your scenario but I still want to see if I can get the retrial going.

@ScottSuarez
Copy link
Collaborator

Note that releases are paused until 4.0 version of the provider which should release some time next week.

@ScottSuarez
Copy link
Collaborator

hiya ~ was wondering if the 4.0 release ended up fixing this for you?

@c4po
Copy link

c4po commented Nov 17, 2021

so far we have not seen any issue after upgrade to 4.0. thank!

@ScottSuarez
Copy link
Collaborator

nice ~ I'll go ahead and close this for now. If it happens feel free to comment

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.