Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azuread_service_principal delay before being usable #4

Closed
TechyMatt opened this issue Jul 23, 2018 · 25 comments · Fixed by #86
Closed

azuread_service_principal delay before being usable #4

TechyMatt opened this issue Jul 23, 2018 · 25 comments · Fixed by #86

Comments

@TechyMatt
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

terraform -v
Terraform v0.11.7

  • provider.azurerm v1.10.0

Affected Resource(s)

  • terraform-provider-azurerm_v1.10.0_x4

Terraform Configuration Files

provider "azurerm" {
    version = "1.10.0"
 }

resource "azurerm_azuread_application" "test" {
  name                       = "exampleTFapplication"
  available_to_other_tenants = false
  oauth2_allow_implicit_flow = false
}

resource "azurerm_azuread_service_principal" "test" {
  application_id = "${azurerm_azuread_application.test.application_id}"
}

resource "azurerm_azuread_service_principal_password" "test" {
  service_principal_id = "${azurerm_azuread_service_principal.test.id}"
  value                = "BVcKK237/&&)hyz@%nsadasdsa(*&^CC#Nd3"
  end_date             = "2020-01-01T01:02:03Z"
}

resource "azurerm_resource_group" "test" {
  name     = "testResourceGroup1"
  location = "West US"
}

resource "azurerm_role_assignment" "test" {
    depends_on = ["azurerm_azuread_service_principal.test"]
  scope                = "${azurerm_resource_group.test.id}"
  role_definition_name = "Reader"
  principal_id         = "${azurerm_azuread_service_principal.test.id}"
}

Panic Output

Error: Error applying plan:

1 error(s) occurred:

  • azurerm_role_assignment.test: 1 error(s) occurred:

  • azurerm_role_assignment.test: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal ######################## does not exist in the directory #######-#####-######-#########."

(sensitive details have been hashed out).

Expected Behavior

Here is a config that first creates the AzureAD application and the Service Principal. It then creates an RG followed by a role assignment. The logic here is we could have a single TF module that would allow us to on board new groups into an Azure subscription and generate them each their own SP.

Actual Behavior

When the azurerm_azuread_service_principal.test resource is created there looks to be a delay between creation and the ability to assign it it to a role and even with a depends_on that i've included in the sample code above that doesn't help. When I re-run the second time it always applies without issue as all other resources already exist.

Steps to Reproduce

  1. terraform apply
  • #0000
@schoren
Copy link

schoren commented Jul 23, 2018

I had the same issue today. In my case, I fixed it by using the azurerm_azuread_application id instead of the azurerm_azuread_service_principal id. Something like this:

resource "azurerm_azuread_application" "test" {
  name                       = "exampleTFapplication"
  available_to_other_tenants = false
  oauth2_allow_implicit_flow = false
}

resource "azurerm_azuread_service_principal" "test" {
  application_id = "${azurerm_azuread_application.test.application_id}"
}

resource "azurerm_azuread_service_principal_password" "test" {
  service_principal_id = "${azurerm_azuread_service_principal.test.id}"
  value                = "BVcKK237/&&)hyz@%nsadasdsa(*&^CC#Nd3"
  end_date             = "2020-01-01T01:02:03Z"
}

resource "azurerm_resource_group" "test" {
  name     = "testResourceGroup1"
  location = "West US"
}

resource "azurerm_role_assignment" "test" {
  scope                = "${azurerm_resource_group.test.id}"
  role_definition_name = "Reader"
  principal_id         = "${azurerm_azuread_application.test.application_id}"
}

It's a weird behavior, but I got that from the az ad sp create-for-rbac command. When comparing to the Azure Portal, the actual ID used was the application ID.

Hope it helps!

@TechyMatt
Copy link
Author

@schoren thanks for replying. I just tested this and when i tried the update I get the response:

Error: Error applying plan:

1 error(s) occurred:

  • azurerm_role_assignment.test: 1 error(s) occurred:

  • azurerm_role_assignment.test: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal 408b56eeXXXXXXXXXXX does not exist in the directory #######-#####-######-#########."

I then confirmed the outputs and they are both different values:

Outputs:

azurerm_azuread_application_id = 408b56eeXXXXXXXXXXX
azurerm_azuread_service_principal_id = b711cba7XXXXXXXXXXX

I looked at the azurerm_role_assignment documentation and it does specifically call out the principal ID is required.

Am I missing something obvious?

@schoren
Copy link

schoren commented Jul 24, 2018

Yes, yesterday I had a similar issue. I'm checking now to see if it is still happening. In another env, I had successfully deployed and assigned roles to services principal using that method.

@schoren
Copy link

schoren commented Jul 24, 2018

Ok, now it's working with the original solution, using azurerm_azuread_service_principal id. Not sure why it worked different before, but it's working as expected now. Is it working for you?

@joakimhellum
Copy link
Contributor

joakimhellum commented Jul 24, 2018

@mb290 @schoren this is also the behavior of Azure CLI when using the command az ad sp create-for-rbac, as it pauses execution for 5 seconds and retries role assignment creation up to 36 times, waiting for server replication.

References:
https://github.com/Azure/azure-cli/blob/master/src/command_modules/azure-cli-role/azure/cli/command_modules/role/custom.py#L959
https://github.com/Azure/azure-cli/blob/master/src/azure-cli-core/azure/cli/core/commands/arm.py#L995
image

To be clear the terraform configuration below works most of the time because it waits 30s for server replication using a hack (but sometimes it take longer than 30s, and then it fails with the same error you describe above):

provider "azurerm" {
  version = "~> 1.10.0"
}

data "azurerm_subscription" "current" {}

resource "random_string" "password" {
  length = 32
}

resource "random_id" "name" {
  byte_length = 16
}

variable "role" {
  default = "Contributor"
}

variable "end_date" {
  default = "2020-01-01T01:02:03Z"
}

resource "azurerm_azuread_application" "service_principal" {
  name = "${random_id.name.hex}"
}

resource "azurerm_azuread_service_principal" "service_principal" {
  application_id = "${azurerm_azuread_application.service_principal.application_id}"
}

resource "azurerm_azuread_service_principal_password" "service_principal" {
  service_principal_id = "${azurerm_azuread_service_principal.service_principal.id}"
  value                = "${random_string.password.result}"
  end_date             = "${var.end_date}"

  # wait 30s for server replication before attempting role assignment creation
  provisioner "local-exec" {
    command = "sleep 30"
  }
}

resource "azurerm_role_assignment" "service_principal" {
  scope                = "${data.azurerm_subscription.current.id}"
  role_definition_name = "${var.role}"
  principal_id         = "${azurerm_azuread_service_principal.service_principal.id}"
  depends_on           = ["azurerm_azuread_service_principal_password.service_principal"]
}

output "display_name" {
  description = "The Display Name of the Azure Active Directory Application associated with this Service Principal."
  value       = "${azurerm_azuread_service_principal.service_principal.display_name}"
}

output "application_id" {
  description = "The Application ID."
  value       = "${azurerm_azuread_application.service_principal.application_id}"
}

output "object_id" {
  description = "The Object ID for the Service Principal."
  value       = "${azurerm_azuread_service_principal.service_principal.id}"
}

output "password" {
  description = "The Password for this Service Principal."
  value       = "${azurerm_azuread_service_principal_password.service_principal.value}"
}

While this terraform configuration don't wait for server replication using the above hack, and always fails:

provider "azurerm" {
  version = "~> 1.10.0"
}

data "azurerm_subscription" "current" {}

resource "random_string" "password" {
  length = 32
}

resource "random_id" "name" {
  byte_length = 16
}

variable "role" {
  default = "Contributor"
}

variable "end_date" {
  default = "2020-01-01T01:02:03Z"
}

resource "azurerm_azuread_application" "service_principal" {
  name = "${random_id.name.hex}"
}

resource "azurerm_azuread_service_principal" "service_principal" {
  application_id = "${azurerm_azuread_application.service_principal.application_id}"
}

resource "azurerm_azuread_service_principal_password" "service_principal" {
  service_principal_id = "${azurerm_azuread_service_principal.service_principal.id}"
  value                = "${random_string.password.result}"
  end_date             = "${var.end_date}"
}

resource "azurerm_role_assignment" "service_principal" {
  scope                = "${data.azurerm_subscription.current.id}"
  role_definition_name = "${var.role}"
  principal_id         = "${azurerm_azuread_service_principal.service_principal.id}"
  depends_on           = ["azurerm_azuread_service_principal_password.service_principal"]
}

output "display_name" {
  description = "The Display Name of the Azure Active Directory Application associated with this Service Principal."
  value       = "${azurerm_azuread_service_principal.service_principal.display_name}"
}

output "application_id" {
  description = "The Application ID."
  value       = "${azurerm_azuread_application.service_principal.application_id}"
}

output "object_id" {
  description = "The Object ID for the Service Principal."
  value       = "${azurerm_azuread_service_principal.service_principal.id}"
}

output "password" {
  description = "The Password for this Service Principal."
  value       = "${azurerm_azuread_service_principal_password.service_principal.value}"
}

with the error:

Error: Error applying plan:

1 error(s) occurred:

* azurerm_role_assignment.service_principal: 1 error(s) occurred:

* azurerm_role_assignment.service_principal: authorization.RoleAssignmentsClient#Create: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal 12eab7225e744ca7b617876179b68b95 does not exist in the directory ssssssss-ssss-ssss-ssss-ssssssssssss."

Do anyone have suggestion for workaround in terraform? I don't yet understand how fix for this would be implemented in any of these resources.

I really don't want to use this very ugly hack:

...

resource "azurerm_azuread_service_principal" "service_principal" {
  application_id = "${azurerm_azuread_application.service_principal.application_id}"
}

resource "azurerm_azuread_service_principal_password" "service_principal" {
  service_principal_id = "${azurerm_azuread_service_principal.service_principal.id}"
  value                = "${random_string.password.result}"
  end_date             = "${var.end_date}"

  # wait 30s for server replication before attempting role assignment creation
  provisioner "local-exec" {
    command = "sleep 30"
  }
}

resource "azurerm_role_assignment" "service_principal" {
  scope                = "${data.azurerm_subscription.current.id}"
  role_definition_name = "${var.role}"
  principal_id         = "${azurerm_azuread_service_principal.service_principal.id}"
  depends_on           = ["azurerm_azuread_service_principal_password.service_principal"]
}

...

Many thanks,

@schoren
Copy link

schoren commented Jul 25, 2018

@joakimhellum-in Thanks for that clarification. It is an ugly workaround, but maybe that's the best we can get. I don't have a very deep understanding of terraform and this provider's inner workings, so I cannot tell if there's a cleaner solution.

For the time being, I think I'll implement what you suggested

@joakimhellum
Copy link
Contributor

joakimhellum commented Jul 25, 2018

We really want to avoid using the local-exec provisioner and sleep command as workaround, since we'd have to have pause execution approx. 180 seconds to really be sure server replication is done (sometimes server replication take long time). Also we run terraform on multiple OS/build agents where sleep is not always accessible. So it would be a really ugly hack. Using az ad sp create-for-rbac would be a better alternative for us than using the terraform resources currently.

Any suggestions on how to implement a fix for this in terraform is highly appreciated.

Update 1: yes, have really no idea how to approach fixing this in terraform other than retrying multple times on fail like az cli does, as the error returned from the API is very generic. Maybe @tombuildsstuff could help with what direction to take here.

Update 2:
FYI There is another issue #841 that seem to have the same kind of problem where retrying was implemented in the resource ref. https://github.com/terraform-providers/terraform-provider-azurerm/blob/master/azurerm/resource_arm_storage_container.go#L111

Update 3:
#1644 this is bad example of a workaround, would like just to start this discussion. any advice appreciated.

Thanks again,

@tombuildsstuff tombuildsstuff self-assigned this Jul 25, 2018
@kjhosein
Copy link

@tombuildsstuff and/or anyone - would you clarify something for me?

It appears (to me at least) that the solution to the various StatusCode=404, ErrorCode=ResourceNotFound issues in the AzureRM provider is to code a fix/retry into the particular resource component. I've noticed multiple such issues here.

Does this mean that you couldn't do something similar to the max_retries option in the AWS provider?

Thanks for any insight!

@LaurentLesle
Copy link
Contributor

I can confirm I have the same behaviour. This is related to the time to replicate the SP through the Azure AD servers.

My scenario is:

  • Create the azurerm_azuread_application,
  • Create the azurerm_azuread_service_principal
  • Create the azurerm_azuread_service_principal_password
  • Create a Keyvault
  • Assign a policy to that SP in KeyVault
  • Connect to Azure RM provider using that SP to create a secret key

Get the error : AADSTS70001: Application with identifier 'app guid here' was not found in the directory

retry 1 min later another terraform apply and everything goes through.

@kvolkovich-sc
Copy link

Have the same issue.

@andresguisado
Copy link

I have tried with 30s, 60s,180s and 200s and I am still getting the same issue...

Using directly az-cli is what worked for me as @joakimhellum-in mentioned previously:

resource "azurerm_azuread_service_principal_password" "app_spn_password" {
  service_principal_id = "${azurerm_azuread_service_principal.app_spn_id.id}"
  value                = "${random_string.password.result}"
  end_date             = "${var.spn_end_date}" #2020-01-01T01:02:03Z  

  provisioner "local-exec" {
    command = "az role assignment create --role ${var.spn_role_definition_name} --assignee-object-id ${azurerm_azuread_service_principal.app_spn_id.id} --scope ${var.spn_scope}"
  }

}

@andresguisado
Copy link

Did anybody think to query the AD servers by PowerShell to see if the SPN has been replicated through and then carry on?

http://community.idera.com/database-tools/powershell/ask_the_experts/f/active_directory__powershell_remoting-9/21621/check-if-user-exist-and-is-active-in-ad1-or-ad2

I am not sure if you can do this on Azure AD though...

@clstokes
Copy link

I'm getting ServicePrincipalNotFound errors for azurerm_kubernetes_cluster resources as well and a subsequent apply works. @tombuildsstuff, should I open a different issue than this one?

@tombuildsstuff
Copy link
Contributor

@clstokes that sounds like the same underlying issue as this, so we can track that here. Thanks!

@logankp
Copy link

logankp commented Dec 11, 2018

I'm getting the same issue but I'm not using depends_on. I created the cluster first then added the configuration to create the role assignment. No matter how many times I try to apply it fails.

@katbyte katbyte transferred this issue from hashicorp/terraform-provider-azurerm Jan 10, 2019
@katbyte
Copy link
Collaborator

katbyte commented Jan 10, 2019

Hi @mb290,

As in 2.0 we are deprecating all Azure AD resources and data sources in the Azure RM provider in favour of this new provider I have moved the issue here.

@tombuildsstuff tombuildsstuff changed the title azurerm_azuread_service_principal delay before being usable azuread_service_principal delay before being usable Jan 10, 2019
@tombuildsstuff tombuildsstuff removed their assignment Jan 10, 2019
@R0quef0rt
Copy link

I can confirm that this issue still exists with the new AzureAD provider.

@liamfoneill
Copy link

I also cannot do role assignments with Terraform for Service Principals. It works fine for AAD groups but I get the Status=400 Code="PrincipalNotFound" too. The service principal has been created days ago so I don't think it is a race condition that others seem to be experiencing. If this is being tracked in another issue @tombuildsstuff can you please post the link here as I cannot find it.

@stevenicholl
Copy link

stevenicholl commented Feb 5, 2019

I am also encountering

Original Error: autorest/azure: Service returned an error. Status=400 Code="PrincipalNotFound" Message="Principal 6b3xxxxxxxxxxxx58755xxxx does not exist in the directory xxxxx-xxxx-xxxx-xxxx-xxxxxxxx."

In my scenario the service principle is pre-existing so it cannot be a time thing. I am attempting to give an AKS SP permission to act as "Managed Identity Operator" over a User Managed Identity.

When using the respective AZ CLI command as the same user running Terraform, I have no issues.

az role assignment create --role "Managed Identity Operator" --assignee [SP ID] --scope "/subscriptions/[SUBSCRIPTIONID]/resourcegroups/sandbox/providers/Microsoft.ManagedIdentity/userAssignedIdentities/sandbox-mid"

In this example it looks like (as @liamfoneill above) the issue may lie with the azurerm_role_assignment resource.

Resolved for now by running the az cli command via a local-exec. It works for now, but would much prefer to use the native resource.

@adamrbennett
Copy link

I've barely tested this, so it's probably flawed, but it worked the first time I tried it:

resource "azuread_service_principal_password" "main" {
  service_principal_id = "${azuread_service_principal.main.id}"
  value = "${var.password}"
  end_date = "${var.end_date}"

  provisioner "local-exec" {
    command = <<EOF
until az ad sp show --id ${azuread_service_principal.main.application_id}
do
  echo "Waiting for service principal..."
  sleep 3
done
EOF
  }
}

At least it's an idea, and someone can probably identify the flaws and improve on it.

@katbyte katbyte modified the milestone: 0.2.0 Feb 10, 2019
@antoinne85
Copy link

If you happen to be running on Windows (where until is not available), here's another potential workaround:
Drop wait-for-service-principal.ps1 in your working directory and use a local-exec provisioner (similar to the previous option).

wait-for-service-principal.ps1

param(
    [string]$ApplicationId
)

$elapsed = 0;
$delay = 3;
$limit = 5 * 60;

$checkMsg = "Checking for service principal with Application ID $ApplicationId"
Write-Host $checkMsg
$cmd = "az ad sp show --id $ApplicationId";
Invoke-Expression $cmd
while($lastExitCode -ne 0 -and $elapsed -le $limit) {
    $elapsedSeconds = $elapsed + "s";
    Write-Host "Service principal is not yet available. Retrying in $delay seconds... ($elapsedSeconds elapsed)"
    Start-Sleep -Seconds $delay;
    $elapsed += $delay;

    Write-Host $checkMsg
    Invoke-Expression $cmd;
}

if($lastExitCode -eq 0) {
    Write-Host "Service principal is ready."
    exit 0
}

Write-Host "Service principal did not become ready within the allotted time."
exit 1
resource "azuread_service_principal_password" "ad_principal_pw" {
  service_principal_id = "${azuread_service_principal.ad_principal.id}"
  value = "${var.password}"
  end_date = "${var.end_date}"

  provisioner "local-exec" {
    command    = ".\\wait-for-service-principal.ps1 -ApplicationId \"${azuread_application.ad_app.application_id}\""
    interpreter = ["PowerShell"]
  }
}

@boeboe
Copy link

boeboe commented Apr 13, 2019

I am having the same issue. Is there a permanent solution on the roadmap? I see this issue was removed from the 0.3.0 milestone.

The work-around with the exec-local to wait for "az ad sp show --id ${azuread_service_principal.main.application_id}" does not work either. The exec returns ok, displaying the service principe, but it is yet not ready to get consumed by AKS. I guess timing/eventual consistency issue between several Azure API's.

Sleep 30 was the only way forward for me.

@jlpedrosa
Copy link
Contributor

Hi!

This also affects for AKS cluster, as the SP is not ready (or the password).

@lukasmrtvy
Copy link

lukasmrtvy commented May 29, 2019

@adamrbennett

Maybe something like this can replace resource timeout block.
Also there is no necessary to query API for destroying that resource. (I am not familiar what is done with local-exec at destroying time..), Its just an another guess..

resource "null_resource" "wait" {

  provisioner "local-exec" {
    command = <<EOF
        COUNTER=$RETRIES
        until [ $COUNTER -eq 0 ] || az ad sp show --id ${azuread_application.application.application_id} -o none
        do
            echo "Waiting for service principal..."
            let COUNTER-=1
            sleep $TIMEOUT
        done
    EOF

    environment = {
      TIMEOUT = "5"
      RETRIES = "20"
    }

  }

  provisioner "local-exec" {
    when = "destroy"
    command = "echo 'Wait hook'"
  }

}

@ghost
Copy link

ghost commented Jun 28, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jun 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.