-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "wait" and "retry" deployment options #1013
Comments
Understood. This is something we have been considering, but haven't scheduled the work yet. If you (or others) have other examples that you have run into, it would be great to capture those here. I know RBAC replication (and replication delays in general) are another place where something like this would be helpful. |
@alex-frankel I'm assuming this is something we're planning on also addressing in the underlying platform? This feels like a leaky abstraction, not something that the end-user should have to deal with by adding delays. |
Agreed. @bmoore-msft and I were also discussing this yesterday. Ideally, ARM will co-locate all the calls end-to-end so a user never has to think about this. Not sure if/when that will be possible, and this may be a necessary evil in the meantime. |
The OP doesn't sound like replication (feels like concurrency) though I could see that you could potentially address both with something like retry. The problem in this case (or either really case) is indefinite postponement. This feels like a problem with the RP - common operations returning frequent 400s instead of maybe 429. The challenge with this workaround is not only does the user have to fail, then implement a non-deterministic work around (that's expensive on the service) it will mask problems with across ARM, RPs and user code. @rshariy - have you raised this issue with the RSV team? It doesn't appear to be an uncommon problem and seems like it should be addressed by the RSV... either it shouldn't happen or we're not helping customer figure out how to effectively use RSV. |
@bmoore-msft I raised a similar issue with the Azure Firewall product team about a year ago - the only solution we found is to use a PowerShell function to check Azure FW status (make sure it is not "updating") before kicking-off new ARM deployment to FW. Just logged ticket 120120226003381 about the RSV issue - lets see what MS support will come up with. |
this point is what gives us caution on implementing something like this. We have some potential solutions to deal with the replication delay in particular that we will explore before introducing a wait. @rshariy - please let us know the resolution of the case. |
I have a main template that looks like this: module kv 'keyvault.bicep' = {
name: 'kvSmoketestDeploy'
scope: rg
params: {
keyVaultName: keyVaultName
enableSoftDelete: false
}
}
module kvaccpol 'keyvaultaccesspolicy.bicep' = {
name: 'kvAccPolSmoketestDeploy'
scope: rg
params: {
keyVaultName: keyVaultName
action: 'add'
objectId: objectId
access: keyVaultAccessPolicyAccess
}
} When that runs, the deployment breaks with: {
"error": {
"code": "ParentResourceNotFound",
"message": "Can not perform requested operation on nested resource. Parent resource 'kv-kvaccpoltest' not found."
}
} (Code:NotFound) Running the deployment again, deploys the policy |
I ran into a scenario where I'd like a wait, not much code to show, basically deploying a FunctionApp, then want to output the default key for use in Api Management. The problem is the function app takes some time to spin up before the app keys are present... resource functionApp 'Microsoft.Web/sites@2020-06-01' = {
name: functionAppName
location: location
kind: 'functionapp'
...
output functionappdefaultkey string = listKeys('${functionApp.id}/host/default', functionApp.apiVersion).functionKeys.default Workaround is to run the initial deployment of the function app twice. |
@eja-git this isn't a "wait" scenario, it's bug in the deployment engine job scheduling... the listKeys job is scheduled too early... so that's the fix for your particular scenario. |
Hi, I've logged the following issue projectkudu/kudu#3312 (comment) that could also benefit from the wait option during a deployment. Best Regards |
I am trying to simplify firewall rule collection deploying by using
here is the error I get
I am sure that a short delay between deployments would help us to loop through all array |
Only one Rule Collection Group can be updated at a time with Azure Firewall Policy. Since the update refreshes all of the connected Azure Firewall instances, the amount of time it takes to update is non-deterministic. Therefore you will need to serialize the deployment using the Can you try:
|
I have two scenarios that come to mind from recent experience. Overarching enterprise management level policy being applied to a resource that has been created which I reference in next resource/module causing the Another Operation error. A retry would be useful here as I have no control or influence over the Policies. I have also faced situations where a newly created resource is not available when referenced immediately afterwards which I assume is a replication/caching issue as the next run works flawlessly. |
My scenario includes creating a Cosmos Account, this typically takes a few minutes and sometimes up to 10 minutes. In this case I am unable to use the resource output to set the connection string for use in subsequent modules e.g. passing into keyVault and functionAppSettings |
@markjbrown - do you mind taking a look at this one? I'd expect the Cosmos Account not to report complete until it is fully provisioned. @zapadoody -- do you happen to have the code sample of the repro and a correlation ID when the error occured? |
For run-time deployment errors you should raise a support ticket as they are best equipped to diagnose specific errors with an activity id. However I am happy to look at an existing bicep file though to see if there are any issues. I do have a sample on how to output the endpoint and key from a Cosmos account and input into appSettings for an App Service here if that helps. |
here's my cosmosAccount.bicep param location string
param cosmosAccountName string
param cosmosDefaultConsistencyPolicy string
param cosmosPrimaryRegion string
param cosmosSecondaryRegion string
var lowerCosmosAcctName = toLower(cosmosAccountName)
var locations = [
{
locationName: cosmosPrimaryRegion
failoverPriority: 0
isZoneRedundant: false
}
{
locationName: cosmosSecondaryRegion
failoverPriority: 1
isZoneRedundant: false
}
]
resource cosmosAccountResource 'Microsoft.DocumentDB/databaseAccounts@2021-06-15' = {
name: lowerCosmosAcctName
kind: 'GlobalDocumentDB'
location: location
properties: {
locations: locations
databaseAccountOfferType: 'Standard'
enableAutomaticFailover: true
consistencyPolicy: {
defaultConsistencyLevel: cosmosDefaultConsistencyPolicy
}
}
}
output cosmosAccountResourceName string = cosmosAccountResource.name here's the KeyVault.bicep param location string
param keyVaultName string
param productionPrincipalId string
param productionTenantId string
param stagingPrincipalId string
param stagingTenantId string
@secure()
param cosmosPrimaryConnectionString string
@secure()
param cosmosSecondaryConnectionString string
@secure()
param serviceStorageConnectionString string
@secure()
param appStorageConnectionString string
resource keyVault 'Microsoft.KeyVault/vaults@2019-09-01' = {
name: keyVaultName
location: location
properties: {
enabledForDeployment: true
enabledForTemplateDeployment: true
enabledForDiskEncryption: true
tenantId: productionTenantId
accessPolicies: [
{
tenantId: productionTenantId
objectId: productionPrincipalId
permissions: {
secrets: [
'get'
'list'
]
}
}
{
tenantId: stagingTenantId
objectId: stagingPrincipalId
permissions: {
secrets: [
'get'
'list'
]
}
}
]
sku: {
name: 'standard'
family: 'A'
}
}
}
resource cosmosPrimaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/cosmosPrimaryConnectionString'
properties: {
value: cosmosPrimaryConnectionString
}
dependsOn:[
keyVault
]
}
resource cosmosSecondaryConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/cosmosSecondaryConnectionString'
properties: {
value: cosmosSecondaryConnectionString
}
dependsOn:[
keyVault
]
}
resource serviceStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/dbConnectionString'
properties: {
value: serviceStorageConnectionString
}
dependsOn:[
keyVault
]
}
resource appStorageConnectionStringSecret 'Microsoft.KeyVault/vaults/secrets@2019-09-01' = {
name: '${keyVaultName}/appStorageConnectionString'
properties: {
value: appStorageConnectionString
}
dependsOn:[
keyVault
]
}
output appStorageConnectionStringUri string = appStorageConnectionStringSecret.properties.secretUri
output serviceStorageConnectionStringUri string = serviceStorageConnectionStringSecret.properties.secretUri
output cosmosPrimaryConnectionStringUri string = cosmosPrimaryConnectionStringSecret.properties.secretUri
output cosmosSecondaryConnectionStringUri string = cosmosSecondaryConnectionStringSecret.properties.secretUri and here's the main.bicep /// cosmos db account, database and container module
module cosmosAccountMod '../cosmosAccount.bicep' = {
name: 'cosmosAccount-${environmentName}-${buildNumber}'
params: {
cosmosAccountName: cosmosAccountName
cosmosDefaultConsistencyPolicy: cosmosDefaultConsistencyPolicy
cosmosPrimaryRegion: cosmosPrimaryRegion
cosmosSecondaryRegion: cosmosSecondaryRegion
location: location
}
}
module cosmosDatabaseMod '../cosmosDbContainer.bicep' = {
name: 'cosmosDBContainer-${environmentName}-${buildNumber}'
params: {
cosmosAccountName: cosmosAccountMod.outputs.cosmosAccountResourceName
cosmosContainerName: cosmosContainerName
cosmosDatabaseName: cosmosDatabaseName
cosmosThroughput: cosmosThroughput
}
dependsOn: [
cosmosAccountMod
]
}
// storage account module - storage for the tenants application
module appStorageAccountMod '../storageAccount.bicep' = {
name: 'appStorageAcctName-${environmentName}-${buildNumber}'
params: {
storageAcctName: appStorageAcctName
storageSkuName: appStorageAcctSku
location: location
}
}
// app insights module
module appInsightsMod '../appInsights.bicep' = {
name: 'appInsightsName-${environmentName}-${buildNumber}'
params: {
name: appInsightsName
resourceGroupLocation: location
}
}
// app service plan module
module appServicePlanMod '../appServicePlan.bicep' = {
name: 'appServicePlan-${environmentName}-${buildNumber}'
params: {
appSvcPlanSku: appSvcPlanSku
appSvcPlanTier: appSvcPlanTier
appSvcPlanName: appSvcPlanName
appPlanLocation: location
}
}
// function app module
module functionAppMod '../functionApp.bicep' = {
name: 'functionApp-${environmentName}-${buildNumber}'
params: {
appSvcPlanName: appSvcPlanName
functionAppName: functionAppName
location: location
}
dependsOn: [
appStorageAccountMod
appServicePlanMod
cosmosAccountMod
]
}
// service storage account module - storage for the function app
module serviceStorageAccountMod '../storageAccount.bicep' = {
name: 'serviceStorageAcctName-${environmentName}-${buildNumber}'
params: {
storageAcctName: serviceStorageAcctName
storageSkuName: serviceStorageAcctSku
location: location
}
}
// key vault module
module keyVaultMod '../keyVault.bicep' = {
name: 'keyVaultName-${environmentName}-${buildNumber}'
params: {
keyVaultName: keyVaultName
location: location
cosmosPrimaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[0].connectionString
cosmosSecondaryConnectionString: listConnectionStrings(resourceId('Microsoft.DocumentDB/databaseAccounts', cosmosAccountName), '2020-04-01').connectionStrings[1].connectionString
productionPrincipalId: functionAppMod.outputs.productionPrincipalId
productionTenantId: functionAppMod.outputs.productionTenantId
stagingPrincipalId: functionAppMod.outputs.stagingPrincipalId
stagingTenantId: functionAppMod.outputs.stagingTenantId
serviceStorageConnectionString: serviceStorageAccountMod.outputs.storageAccountConnectionString
appStorageConnectionString: appStorageAccountMod.outputs.storageAccountConnectionString
}
dependsOn:[
functionAppMod
cosmosAccountMod
cosmosDatabaseMod
]
}
// function app settings module
module functionAppSettingMod '../functionAppSettings.bicep' = {
name: 'functionAppSettings-${environmentName}-${buildNumber}'
params: {
appInsightsKey: appInsightsMod.outputs.appInsightsKey
cosmosConnectionStringUri: keyVaultMod.outputs.cosmosPrimaryConnectionStringUri
appStorageConnectionStringUri: keyVaultMod.outputs.appStorageConnectionStringUri
serviceStorageConnectionStringUri: keyVaultMod.outputs.serviceStorageConnectionStringUri
functionAppName: functionAppMod.outputs.prodSlotFunctionAppName
functionAppStagingName: functionAppMod.outputs.stagingSlotFunctionAppName
}
dependsOn:[
functionAppMod
appInsightsMod
cosmosAccountMod
keyVaultMod
]
} |
Also to clarify previously I was using the output in the cosmosAccount.bicep but changed to the query approach to try ad get away from the error. Thanks for the tip on raising the support ticket. |
@alex-frankel Can you take a look at that? It seems the dependsOn is being fulfilled with the ack of the started and/or accepted responses rather than succeeded |
@alex-frankel any thoughts on the bicep here? Also I have opened a support case for this if you need that ref # let me know and I can send direct. |
The problem is this listConnectionStrings function. I've never seen it before. I tried testing in an ARM template and it doesn't work (not sure why the template didn't fail validation). If you want to output the endpoint and keys use this syntax below. To make it as a connection string just concat them together with "AccountEndpoint=" and ";AccountKey=" "[reference(resourceId('Microsoft.DocumentDB/databaseAccounts', variables('cosmosAccountName'))).documentEndpoint]" |
@markjbrown apologies thank you for the assistance!!! |
We're hitting similar problems when deploying Azure SQL. We have a template that deploys a logical Azure SQL servers and then performs a number of additional configuration such as enabling audit, adding an AD Admin user, setting the connection policy, configuring firewall rules and adding elastic pools. All of these child resources are using 'dependsOn' to ensure that they run one after the other in series rather than in parallel. Most of the times this works, but occasionally the template deployment fails with an 'Internal Server Error'. When we raise this with the Microsoft support team they just tells us "The server is currently busy. Please wait a few minutes and try again." Retrying the template deployment doesn't always work, and there is no built in mechanism to add this delay. In this particular case I'd of thought a better response here would be to return a 429 response rather than a 500 response so that the deployment of each child resource can be automatically tried again with an exponential backoff between each retry. It's little issues like this that make working with ARM such a frustrating experience. Just because something deployed OK once, there's no guarantee that it will deploy successfully the next time. |
When Entra Domain Services (previously Azure AD Domain Services / AADDS) is deployed via Bicep the deployment completes within Bicep, but the actual resource remains in the "Deploying" state in Azure for at least ~20 minutes longer. A wait/retry mechanism would help ensure the service is fully provisioned before further deployments kick off that depend on it, or at least allow them to retry. |
Yet another case when you're trying to assign >1 federated identity to an uami within the same module:
PS: |
Hello, I would like to know if you continue with this very necessary development, here is another example of what is happening: It turns out that I have to create a vnet and multiple subnets, I have a module for vnets and another module for subnets. In the main, I call each module as follows: vnet module plus its parameters subnet module plus its parameters and the depends on vnet module name with the for function that reads the object of the subnets that it has to create. What happens is that sometimes when subnet 0 is created, Azure Deployment has not closed the process and when it is going to be sent to create subnet 1, an error appears that there is a previous creation process and that the next one cannot be created. subnet thus damaging the deployment. Does anyone have an idea how else I can solve this problem? Or maybe MS can help us with this valuable feature of adding waiting times to the modules. |
This is a very different issue. If you're expecting to be able to redeploy the module for your virtual network, you'll need to make sure you create your subnets with the virtual network, not separately (that's an anti-pattern). If you try to redeploy your virtual network only (no subnts) once you have created subnets and deployed resources in them, the deployment of the virtual network will attempt to delete your subnets, which is neither desired nor possible and will thus cause your virtual network deployment to fail. If you are looking to deploy additional subnets in an existing virtual network (and will then never again deploy the virtual network unless you pull the full subnet configuration again), then you need to use the |
what is the status? |
I have another, similar issue deploying a Front Door profile and a metricAlert in the same deployment. 'Microsoft.Cdn/profiles@2023-05-01 The error is "Couldn't find a metric named OriginHealthPercentage" |
Just to be clear here: Isn't that contradicting the statement from the documentation?
|
Whatever solution is planned for this, will it be Bicep-specific or will it be available in ARM-templates as well? I encountered an issue with Azure Policy where I use a policy-set containing a number of policies that each enables a given Defender for Cloud plan (Storage, CosmosDB, ARM, etc) if it is not enabled for a given subscription (each policy uses the |
@mattias-fjellstrom Likely ARM-level given that Bicep is generating ARM under the hood for deployments (as evidenced by the artifacts in Azure following such a deployment). |
@WhitWaldo is correct! |
@WhitWaldo Very true, that makes sense 👍🏻 |
Has this been assigned or further discussed @alex-frankel? We're an ISV with an azure managed application in the marketplace so IaC-based environments are part of our CICD. There are a few classes of errors here where this would be helpful. To highlight one: in the past few years alone we regularly see the metric alerts issue that's been discussed here, where metric's aren't "ready," and once or twice a year it results in multi-day disruptions to our customer updates and development cycle when the wait time needed is beyond anything we can orchestrate by manually pushing the alerts module down the deployment chain. I'm sure this proposal is extensive work and cuts against the spirit of a declarative DSL but as a practical effect for our org: we're essentially at the point where we are going to have to extend our entire deployment approach to include a packaged C#-based runner, and/or network-connected DevOps pipelines in to customer tenants, exclusively in order to achieve wait/retry functionality (and graceful failure, if I had a wish list). Unfortunately the Resource Providers simply aren't reliable enough to depend on here and we need appropriate tools to account for that reality. |
Just throwing this in as some potential inspiration Azure/terraform-provider-azapi#392 |
Chiming in with a workaround for wait/sleep during deployment using deployment scripts. Our use case is that we're deploying a log analytics workspace and data collection rules using the The workaround for us was to add a Warning: long code-blocktargetScope = 'resourceGroup'
param location string = resourceGroup().location
// Log analytics workspace that needs a bit more time after deployment to become ready
resource logAnalyticsWorkspace 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
location: location
name: 'log-workspace-with-wait'
properties: {
sku: {
name: 'PerGB2018'
}
retentionInDays: 30
}
}
// Deployment script that just waits for 10 seconds
resource deploymentScript 'Microsoft.Resources/deploymentScripts@2023-08-01' = {
name: 'wait-for-log-tables'
location: location
kind: 'AzureCLI'
properties: {
azCliVersion: '2.52.0'
scriptContent: 'sleep 10'
retentionInterval: 'PT1H'
}
dependsOn: [
logAnalyticsWorkspace
]
}
// DCRs that depends upon the log analytics workspace to be completely ready before deploying
resource dataCollectionRules 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
name: 'dcrs-with-wait'
location: location
kind: 'Windows'
identity: {
type: 'systemassigned'
}
properties: {
dataFlows: [
{
streams: [
'Microsoft-Perf'
'Microsoft-Event'
]
destinations: [
logAnalyticsWorkspace.name
]
}
]
dataSources: {
performanceCounters: [
{
streams: [
'Microsoft-Perf'
]
samplingFrequencyInSeconds: 30
counterSpecifiers: [
'\\LogicalDisk(C:)\\Avg. Disk Queue Length'
'\\LogicalDisk(C:)\\Current Disk Queue Length'
'\\Memory\\Available Mbytes'
'\\Memory\\Page Faults/sec'
'\\Memory\\Pages/sec'
'\\Memory\\% Committed Bytes In Use'
'\\PhysicalDisk(*)\\Avg. Disk Queue Length'
'\\PhysicalDisk(*)\\Avg. Disk sec/Read'
'\\PhysicalDisk(*)\\Avg. Disk sec/Transfer'
'\\PhysicalDisk(*)\\Avg. Disk sec/Write'
'\\Processor Information(_Total)\\% Processor Time'
'\\User Input Delay per Process(*)\\Max Input Delay'
'\\User Input Delay per Session(*)\\Max Input Delay'
'\\RemoteFX Network(*)\\Current TCP RTT'
'\\RemoteFX Network(*)\\Current UDP Bandwidth'
]
name: 'perfCounterDataSource10'
}
{
streams: [
'Microsoft-Perf'
]
samplingFrequencyInSeconds: 60
counterSpecifiers: [
'\\LogicalDisk(C:)\\% Free Space'
'\\LogicalDisk(C:)\\Avg. Disk sec/Transfer'
'\\Terminal Services(*)\\Active Sessions'
'\\Terminal Services(*)\\Inactive Sessions'
'\\Terminal Services(*)\\Total Sessions'
]
name: 'perfCounterDataSource30'
}
]
windowsEventLogs: [
{
streams: [
'Microsoft-Event'
]
xPathQueries: [
'Microsoft-Windows-TerminalServices-RemoteConnectionManager/Admin!*[System[(Level=2 or Level=3 or Level=4 or Level=0) ]]'
'Microsoft-Windows-TerminalServices-LocalSessionManager/Operational!*[System[(Level=2 or Level=3 or Level=4 or Level=0)]]'
'System!*'
'Microsoft-FSLogix-Apps/Operational!*[System[(Level=2 or Level=3 or Level=4 or Level=0)]]'
'Application!*[System[(Level=2 or Level=3)]]'
'Microsoft-FSLogix-Apps/Admin!*[System[(Level=2 or Level=3 or Level=4 or Level=0)]]'
]
name: 'eventLogsDataSource'
}
]
}
description: 'AVD Insights settings'
destinations: {
logAnalytics: [
{
name: logAnalyticsWorkspace.name
workspaceResourceId: logAnalyticsWorkspace.id
}
]
}
streamDeclarations: {}
}
dependsOn: [
deploymentScript
]
} |
ARM template deployment often fails with errors like:
"Another operation is in progress on the selected item. If there is an in-progress operation, please retry after it has finished."
"BMSUserErrorObjectLocked","message":"Another operation is in progress on the selected item."
Just to clarity - this is not a dependency issue. ARM deployment may fail if ,for example, you try to add a VM to an RSV and there is another VM being added at the same time: for a few seconds RSV will not accept new clients and as the result your deployment will fail.
Would like to have an option to pause deployment and/or retry it - may be introduce the "wait" and "retry" deployment conditions, i.e:
The text was updated successfully, but these errors were encountered: