Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent failure to create service principal #581

Closed
dimbleby opened this issue Sep 19, 2021 · 10 comments · Fixed by #608
Closed

intermittent failure to create service principal #581

dimbleby opened this issue Sep 19, 2021 · 10 comments · Fixed by #608

Comments

@dimbleby
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritise this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritise the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureAD Provider) Version

terraform v1.0.5, azuread v2.3.0

Affected Resource(s)

  • azuread_service_principal

Terraform Configuration Files

terraform {
  required_providers {
    azuread = {
      source  = "hashicorp/azuread"
      version = ">= 2.3.0"
    }
  }
}

data "azuread_client_config" "current" {}

resource "azuread_application" "foo" {
  count        = 10
  display_name = "foo"
  owners       = [data.azuread_client_config.current.object_id]
}

resource "azuread_service_principal" "foo" {
  count                        = 10
  application_id               = azuread_application.foo[count.index].application_id
  app_role_assignment_required = true
  owners                       = [data.azuread_client_config.current.object_id]
}

Debug Output

Panic Output

Expected Behavior

No error

Actual Behavior

╷
│ Error: Could not create service principal
│
│   with azuread_service_principal.foo[1],
│   on main.tf line 21, in resource "azuread_service_principal" "foo":
│   21: resource "azuread_service_principal" "foo" {
│
│ ServicePrincipalsClient.BaseClient.Post(): unexpected status 403 with OData error: Authorization_RequestDenied: When
│ using this permission, the backing application of the service principal being created must in the local tenant
╵

Steps to Reproduce

I ran a little shell script which keeps going until something fails:

#!/bin/bash

set -euo pipefail

while :
do
  terraform apply -auto-approve
  terraform destroy -auto-approve
done

Important Factoids

References

This is basically a re-opening of #535.

The first time I ran this it failed very quickly, but I didn't have debug logging turned on. I have now turned debug logging on and, as if to spite me, am not seeing failures. But I expect I'll get one sooner or later and upload the logs when I do.

@dimbleby
Copy link
Author

dimbleby commented Sep 19, 2021

I've actually run into a different problem before I could reproduce with debug logs...

│ Error: Could not create application
│
│   with azuread_application.foo[4],
│   on main.tf line 15, in resource "azuread_application" "foo":
│   15: resource "azuread_application" "foo" {
│
│ ApplicationsClient.BaseClient.Post(): unexpected status 403 with OData error: Directory_QuotaExceeded: The directory
│ object quota limit for the Principal has been exceeded. Please ask your administrator to increase the quota limit or
│ delete objects to reduce the used quota.
╵

This seems surprising, since I am destroying everything that I create. Not sure whether this is also an azuread bug, or is something funny on Azure where deleted applications aren't "really" deleted for a while...

@dimbleby
Copy link
Author

re the above: it turns out that deleted applications aren't permanently deleted but hang around for thirty days. And I can't find a programmatic way to delete them permanently, one seems to have to do it one by one through the portal. Which is extremely slow and tedious work.

Which is to say that I may not have a repro that includes the debug logs so soon after all.

@manicminer
Copy link
Contributor

manicminer commented Sep 20, 2021

Hi @dimbleby, you are correct that deleted applications (also users and groups) are only soft deleted. This has been the case for a long time but recently API support has been added to list/purge/restore these deleted object types. We're hoping to add support for this in the provider in the near future.

For the original error, please post a debug log when you next encounter it. The error you pasted occurs in various misleading circumstances, including when the application can't be found - this most commonly occurs during replication delays and we're trying to work around these wherever feasible to avoid erroring out in cases where we can safely retry. A debug log will be extremely helpful as many of these corner cases only occur in relatively rare circumstances (as you found out when trying to reproduce!). Thanks!

@dimbleby
Copy link
Author

dimbleby commented Sep 20, 2021

(I've learned how to delete the soft-deleted applications. But even though I now have none of those left, I still have no quota! I suspect that I also have a pile of soft-deleted service principals: but I don't find a way to delete, or even see, those. So I'm kinda stuck for reproducing this for now).

@manicminer
Copy link
Contributor

If you're keen to keep going, you can always create a new tenant for testing, and delete it once you have exhausted the quota :D

(Portal > Create a Resource > Azure Active Directory)

@dimbleby
Copy link
Author

So although I was failing to reproduce this with the small clean configuration above, I started saving off the debug logs from our pipelines - and this morning we have a repro.

Unfortunately the repro happening in our pipeline means that I can't share our full terraform definitions with you, and also that the logs are noisier than they would have been from the minimal repro. But the application / principal that failed are configured very much like the ones in the fragment above. If there are specific things about the configuration that it would be useful to know please ask.

Logs are at https://gist.github.com/dimbleby/c4a453c26e83d7e535a76a6fb3e25d74.

  • failure is reported at the end of the output, at 11:26:40
  • the only 403 in the logs actually happened a long time earlier, at 11:20:56 (line 4924)
    • is this gap normal?

Although our whole configuration and therefore the log file also has a bunch of AzureRM stuff going on, the section from lines 4000-5000 or thereabouts looks to be mostly AzureAD activity only, so hopefully you'll be able to make sense of it...

@manicminer
Copy link
Contributor

manicminer commented Sep 29, 2021

@dimbleby That's great, many thanks for posting the log. I can see from the requests that the application was created in the preceding two seconds (:54), and then returned by the API (:55), before the request to create the service principal was sent and rejected (:56). So it looks very much like an eventual consistency issue.

Usually in this scenario we'll see an error response that contains a message like The appId '00000000...' of the service principal does not reference a valid application object. Perhaps in this case the application ID is colliding with some other app ID out there in the same cloud (app IDs can ultimately come from any other tenant belonging to anyone)? Whatever the cause, we can probably mitigate this with a few retries similarly to how we retry the other error. I'll look at adding this.

In terms of the time delay between the error response and the apply operation failing, I'd say this is expected. Terraform is doing a lot of things in parallel and it will try to complete any operations in progress at the same time before aborting, to avoid state corruption.

@dimbleby
Copy link
Author

Super, thanks!

@github-actions
Copy link

This functionality has been released in v2.5.0 of the Terraform Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you!

@github-actions
Copy link

github-actions bot commented Nov 1, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants