Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: PowerShell deployment runs forever or fails #1793

Open
jdrepo opened this issue Oct 9, 2024 · 17 comments
Open

Bug Report: PowerShell deployment runs forever or fails #1793

jdrepo opened this issue Oct 9, 2024 · 17 comments

Comments

@jdrepo
Copy link

jdrepo commented Oct 9, 2024

Describe the bug

Deploy ALZ reference implementation with PowerShell doesn´t work

Steps to reproduce

  1. New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "westeurope" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-09-03/eslzArm/eslzArm.json" -TemplateParameterFile ".\ALZ-Portal-parametersFile.json" -WhatIf
  2. Getting the latest status of all resources... forever

Seems to me that there is the same error/behaviour in the "Test Portal experience" workflow in this repo ?

Screenshots

Image

Is this a general problem with PowerShell deployment at the moment ?

@jdrepo jdrepo added the bug Something isn't working label Oct 9, 2024
@Springstone
Copy link
Member

@jdrepo I'm busy testing the release of Policy Refresh and have had many issues today. I don't think it's a PowerShell issue, looks like the ARM engine is throttling or having issues. Can I ask you to try again after a bit?

There is some another issue currently impacting testing that we're investigating - its authentication related.

@jdrepo
Copy link
Author

jdrepo commented Oct 9, 2024

@Springstone Thanks for your reply, yes I can test it again later, I also had the problem yesterday and I also think it has nothing to do with PowerShell, as the problem also occurs with azure cli deployments, so then probably a fundamental problem with the ARM backend

@Springstone Springstone removed the bug Something isn't working label Oct 10, 2024
@Springstone
Copy link
Member

@jdrepo are you still experiencing issues or can we close the issue?

@jdrepo
Copy link
Author

jdrepo commented Oct 11, 2024

Hi @Springstone
didn't try it again. I'm currently out of office, will try it tomorrow and give then a feedback.
Please hold the issue open. Thanks

@jdrepo
Copy link
Author

jdrepo commented Oct 12, 2024

Hi @Springstone, tried again today and the issues still occur, seems to me nothing has changed, deployment runs forever if deploying with PowerShell

@Springstone
Copy link
Member

Springstone commented Oct 14, 2024

@jdrepo could you try run the same deployment without the -WHATIF (we suspect recent changes in pre-flight processes are causing a problem) - or try running again with the latest release of ALZ.

New-AzTenantDeployment -Name "ALZ-Deployment-$(Get-Date -Format 'yyyyMMddTHHMM')" -Location "swedencentral" -TemplateUri "https://raw.githubusercontent.com/Azure/Enterprise-Scale/2024-10-09/eslzArm/eslzArm.json" -TemplateParameterFile ".\parameters.json" -WhatIf -Debug

If I run the above it succeeds, takes about 60 seconds.

@jdrepo
Copy link
Author

jdrepo commented Oct 14, 2024

@Springstone Will give it another try, it seems to me that it depends on the template parameter file, i just tried it with a "small" parameter file with only the enterpriseScaleCompanyPrefix parameter in it , then the -whatif run succeeds. But with a full blown parameter file ( with all parameters set - more than 700 lines ) the deployment times out again.
What did you set in your parameters.json in your last test ?

@jdrepo
Copy link
Author

jdrepo commented Oct 14, 2024

@Springstone Did some more tests and I think found the cause: If I change the deployment location to another region than westeurope the what-if deployment runs without any problems.

I did a test with swedencentral or eastus or northeurope like you did and had no problems (sometimes it took some minutes, sometimes it runs under 60 seconds ), so it seems to me that there indeed some issues with the ARM backend in my preferred region westeurope

Can you try to run your last deploynment against the westeurope region ?

@Springstone
Copy link
Member

@jdrepo yes, looks like it is westeurope that is having issues with whatif. It actually looks like PowerShell hangs, as I can't terminate/break the deployment either, and running with -debug indicates it hangs pretty quickly. Might be related to restrictions in that region?
You may want to open a support ticket for this, and we'll see on our end if we can get someone from engineering to investigate.

@jdrepo
Copy link
Author

jdrepo commented Oct 15, 2024

@Springstone yes, I can observe the same behaviour. PowerShell hangs and can´t be terminated. I let the task running and after approx. 1 hour it came back with a lot of error messages

Image

I have no idea why the germanywestcentral region is involved when I deploy the template against the westeurope region.
Maybe it reroutes the ARM request to this region, because I´m located in Germany.

Earlier this morning I tried it against the germanywestcentral region that did run without any problem.
Now I tried it again and it hangs again, very strange.

I can try to open a support ticket but I don´t know if this environment is covered by a support plan...

`DEBUG: ============================ HTTP RESPONSE ============================

Status Code:
OK

Headers:
Cache-Control : no-cache
Pragma : no-cache
x-ms-ratelimit-remaining-tenant-reads: 249
x-ms-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1
x-ms-correlation-request-id : 502dc99b-81a6-4865-89d9-5519e99ef8c1
x-ms-routing-request-id : GERMANYWESTCENTRAL:20241015T093338Z:502dc99b-81a6-4865-89d9-5519e99ef8c1
Strict-Transport-Security : max-age=31536000; includeSubDomains
X-Content-Type-Options : nosniff
X-Cache : CONFIG_NOCACHE
X-MSEdge-Ref : Ref A: 7F43E0C7BFB649B7A1D874C76F40E3E0 Ref B: FRA231050415021 Ref C: 2024-10-15T09:33:38Z
Date : Tue, 15 Oct 2024 09:33:38 GMT

Body:
{
"status": "Failed",
"error": {
"code": "DeploymentWhatIfTimeout",
"message": "The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'."
}
}

DEBUG: 11:33:42 - [ResourceManagerCmdletBase.ExecuteCmdlet] Caught unhandled exception: Microsoft.Rest.Azure.CloudException:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.'
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.SdkClient.NewResourceManagerSdkClient.ExecuteDeploymentWhatIf(PSDeploymentWhatIfCmdletParameters parameters)
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentWhatIfCmdlet.ExecuteWhatIf()
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.CmdletBase.DeploymentCreateCmdlet.OnProcessRecord()
at Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.ResourceManagerCmdletBase.ExecuteCmdlet()
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [EnableErrorRecordsPersistence], Module = [], Cmdlet = []. Returning default value [False].
New-AzTenantDeployment:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.'
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayBreakingChangeWarning], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [DisplayRegionIdentified], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: 11:33:42 - [ConfigManager] Got nothing from [CheckForUpgrade], Module = [], Cmdlet = []. Returning default value [True].
DEBUG: AzureQoSEvent: Module: Az.Resources:7.5.0; CommandName: New-AzTenantDeployment; PSVersion: 7.4.5; IsSuccess: False; Duration: 01:02:29.6872120; SanitizeDuration: 00:00:00; Exception:
DeploymentWhatIfTimeout - Long running operation failed with status 'Failed'. Additional Info:'The request to predict template deployment changes to scope '/' has timed out. Diagnostic information: timestamp '20241015T093337Z', tracking id '57bceef4-2bc1-4523-9dd3-350f66db0e2b', request correlation id 'd69a9d3c-6b34-4fdc-8c59-7f02c3f48e86', location 'germanywestcentral'.';
DEBUG: 11:33:42 - [ConfigManager] Got [True] from [EnableDataCollection], Module = [], Cmdlet = [].
DEBUG: 11:33:42 - NewAzureTenantDeploymentCmdlet end processing.`

@Springstone
Copy link
Member

@jdrepo not to worry, I've opened an internal engineering ticket to investigate the WHATIF issue. However, I do believe that if you remove the WHATIF flag the deployment will proceed/succeed. If you find otherwise, please do let me know.

@jdrepo
Copy link
Author

jdrepo commented Oct 16, 2024

@Springstone yes as assumed the "real" deployment does start, so it´s indeed only an issue with --whatif deployments.
But during the deployment I now encountered some deployment errors, seems to me that there is a problem with the management groups hierarchy syncronization during the deployment flow. Is that a known issue ?

Image

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"InvalidCreatePolicyAssignmentRequest","message":"The policy definition specified in policy assignment 'Deny-MgmtPorts-Internet' is out of scope. Policy definitions should be specified only at or above the policy assignment scope. If the management groups hierarchy changed recently or if assigning a management group policy to new subscription, please allow up to 30 minutes for the hierarchy changes to apply and try again."}]}

@Springstone
Copy link
Member

@jdrepo any errors that include the text please allow up to 30 minutes for the hierarchy changes to apply and try again. are related to delays in the policy engine registering the policy. We had a mitigation in place to minimize this, but seems the ARM/Policy engines are struggling to keep up - so will be extending the deployment wait time to help minimize this issue.

TLDR Policy isn't available for assignment yet. If you re-run the deployment with the same parameters, it will succeed.

@jdrepo
Copy link
Author

jdrepo commented Oct 17, 2024

@Springstone can you give me short notice how and if I can extend deployment wait time ?

For the what-if issue:
Did you get an response from internal engineering ? I`ve opened a support call and would it be helpful to link these calls ?

@Springstone
Copy link
Member

@jdrepo you can't change the wait time yourself, but we've pushed through a patch to increase the wait time an additional couple of minutes, use the latest release (https://github.com/Azure/Enterprise-Scale/tree/2024-10-14).

I am working with engineering on the WHATIF issue but seems something has changed as it suddenly started working this morning. Could you confirm that it is working for you?

@jdrepo
Copy link
Author

jdrepo commented Oct 21, 2024

@Springstone isn'it possible to change the wait time if I modify the parameter "delayCount" in the template parameter file ?

Did another WHATIF deployment against the "westeurope" region and the issue still occurs ?
What I can´t understand if I start the deplyment against the "westeurope" region, why I get an error message mentioning the "switzerlandnorth" region ?

Image

Image

@Springstone
Copy link
Member

@jdrepo yes, you can increase the delayCount, did see it in the portal deployment parameters file - I've been working with other template param files :) for testing that don't include those parameters.

I've confirmed with PG this is a transient issue, which is why it sometimes works and sometimes doesn't (very inconsistent). I ran this on the weekend and 9/10 worked fine, one time it failed with a similar error.

I wouldn't worry about the SwitzerlandNorth, as it could be that an RP or part of ARM is running from there - the error message doesn't indicate an actual issue with deployment, just that it's a long running operation (basically it's timed out).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants