-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ees 5446 prevent function app slot swap until orchestrations complete #5425
Ees 5446 prevent function app slot swap until orchestrations complete #5425
Conversation
5a8a070
to
4b14f01
Compare
217f0e5
to
e2e4a38
Compare
@@ -39,6 +39,9 @@ param privateEndpointSubnetId string? | |||
@description('Specifies whether this Function App is accessible from the public internet') | |||
param publicNetworkAccessEnabled bool = false | |||
|
|||
@description('IP address ranges that are allowed to access the Function App endpoints. Dependent on "publicNetworkAccessEnabled" being true.') | |||
param functionAppEndpointFirewallRules FirewallRule[] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this name be simplified? Perhaps just functionAppFirewallRules
? Seems like we've called it this in publicApiDataProcessor.bicep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup good idea - updated ta.
@@ -195,9 +207,20 @@ var commonSiteProperties = { | |||
netFrameworkVersion: '8.0' | |||
linuxFxVersion: appServicePlanOS == 'Linux' ? 'DOTNET-ISOLATED|8.0' : null | |||
keyVaultReferenceIdentity: keyVaultReferenceIdentity | |||
publicNetworkAccess: publicNetworkAccessEnabled ? 'Enabled' : 'Disabled' | |||
ipSecurityRestrictions: publicNetworkAccessEnabled && length(firewallRules) > 0 ? firewallRules : null | |||
ipSecurityRestrictionsDefaultAction: publicNetworkAccessEnabled && length(firewallRules) > 0 ? 'Deny' : 'Allow' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a potential bug? If publicNetworkAccessEnabled
is true, but there are no firewallRules
, this will evaluate to Allow
(which doesn't seem right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that you're right conceptually, although I'm not sure it makes much difference either way in reality. If publicNetworkAccessEnabled
is true and there are no firewall rules, that means it's open to the world anyway regardless of whether this is set to 'Allow' or 'Deny'.
But thinking about it now, I don't think that there's any reason to set this to anything but 'Deny' here, as we're already explicitly saying that each of our firewall rules are Allow
rules in this same file:
var firewallRules = [for (firewallRule, index) in functionAppFirewallRules: {
name: firewallRule.name
ipAddress: firewallRule.cidr
action: 'Allow'
tag: firewallRule.tag != null ? firewallRule.tag : 'Default'
priority: firewallRule.priority != null ? firewallRule.priority : 100 + index
}]
So I have simply set it to 'Deny' for best safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, yep that makes sense
scmIpSecurityRestrictions: [ | ||
{ | ||
ipAddress: 'Any' | ||
action: 'Allow' | ||
priority: 2147483647 | ||
name: 'Allow all' | ||
description: 'Allow all access' | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this actually for? Tried looking for what this is exactly, but came up a bit short.
Might want to add a comment clarify its purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's IP restrictions to determine who's able to see the part of the stack that's responsible for performing deploys to the Function App / App Service. So ideally when we have static IPs for our runners, we'd be able to whitelist them so that only the pipeline would be able to deploy code to these resources. Might also control who's able to access the Kudu site that accompanies each Function App / App Service host (the one you can access through Advanced tools
in Azure Portal.
I've stuck a comment on:
// TODO EES-5446 - this setting controls access to the deploy site for the Function App.
// This is currently the default value, but ideally we would lock this down to only be accessible
// by our runners and certain other whitelisted IP address ranges (e.g. trusted VPNs).
ILogger<LongRunningTriggerFunction> logger) | ||
{ | ||
[Function(nameof(TriggerLongRunningOrchestration))] | ||
public async Task<IActionResult> TriggerLongRunningOrchestration( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it looks like we've proven that the status check works, is there any value anymore in having this code?
Would probably argue that we could just delete all this long-running function code as it's just cluttering up the project.
If we think there's a reason to keep it, would suggest that we at least disable the trigger to prevent it from being triggered accidentally when we don't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, and I have disabled by adding into deploy-data-processor.yml
:
"AzureWebJobs.TriggerLongRunningOrchestration.Disabled=true" \
I'd suggest we keep it for allowing the testing of this ticket, upon which it can be enabled in Azure Portal and triggered when needed.
I've also moved the App__MetaInsertBatchSize
out of Bicep and into deploy-data-processor.yml
as well, so now all app-specific settings are controlled in the yml.
Statuses = new List<OrchestrationRuntimeStatus> | ||
{ | ||
OrchestrationRuntimeStatus.Pending, | ||
OrchestrationRuntimeStatus.Running | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, but could use a collection expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nitpicks welcome here! Done - I'm almost surprised that Rider didn't suggest it also :)
} | ||
}; | ||
|
||
[Function("StatusCheck")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use nameof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot - done ta
{ | ||
var activeOrchestrations = await client | ||
.GetAllInstancesAsync(filter: ActiveOrchestrationsQuery) | ||
.ToListAsync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could CountAsync
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh good suggestion, done ta
var activeOrchestrations = await client | ||
.GetAllInstancesAsync(filter: ActiveOrchestrationsQuery) | ||
.ToListAsync(); | ||
return new OkObjectResult(new { ActiveOrchestrations = activeOrchestrations.Count }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to return a 5xx status code if there are active orchestrations?
The main benefit is that we could reuse the wait-for-endpoint-success.yml
task rather than need the additional wait-for-orchestrations-to-complete.yml
task (which basically does a similar thing).
We'd probably want to also rename the function to something more intuitive for these semantics e.g. DeployStatusCheck
(checks if the function is ready for a new deployment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying, but I'm slightly hesitant to do this because I feel that the details you get form this endpoint could be useful, and also there's no easy way to distinguish a planned 500 from an ISE 500 if one should arise for whatever reason.
The other reason why having the count and the specific pipeline job that's asserting on the count is useful I think is that you can see during deploy in the DevOps logging how many active orchestrations are currently holding up the deploy, which gives you some indication of how long it'll take to get through.
It's not necessarily so desperately urgent at the mo because we're expecting orchestrations to be relatively quick to run, but if we end up reusing for another durable function app or we include longer orchestrations here in the future, I think it would be beneficial to have that visibility.
What do you reckon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see the screenshot under Testing pipeline changes with long-running orchestration
in the PR description for an idea of what I mean)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there's potentially some alterations we could do to make this work i.e. we could change wait-for-endpoint-success.yml
to output the body of non-200 responses. This would output the number of active orchestrations in the response JSON.
I'd also probably lean towards using 503 instead of 500 as this might be better for communicating the service isn't available to be deployed.
Regardless, it's not a big deal either way. If you're happy enough with the current approach, happy for this to go in as-is!
Seems like there's a bunch of merge conflicts from the alerts work going in! |
…o reduce noise during deployments. Prior to this we would see lots of FileNotFoundExceptions in the staging slot only during deploys.
…estrations via HTTP call. Added script to Data Processor deploy tasks to poll until active orchestrations are all complete prior to initiating slot swap
…access to the Data Processor app. Temporarily allowing anonymous access and public internet access while trialling active orchestrations endpoint
…est the deploy delay of the Data Processor.
…th and granting of an access token
…wn to Admin, Azure Pipeline runners (temporarily disabled in favour of whitelisting AzureCloud service tag) and general maintenance firewall rules.
…ate DNS zone deployment to save lots of time during a standard deploy
e2e4a38
to
782043f
Compare
Yup, it was unpleasant but all resolved now thanks! |
…rchestration trigger by default. Code improvements and variable renaming. Tweaks to Bicep templates.
…t-function-app-slot-swap-until-orchestrations-complete Ees 5446 prevent function app slot swap until orchestrations complete
Overview
This PR:
Additionally,this PR:
deploySharedPrivateDnsZones
parameter to the pipeline in order to allow for Bicep template deployment time savings of ~2 minutes for a rarely changing common resource.deployContainerApp
anddeployDataProcessor
parameters to provide further time saving opportunities if performing focused deployment activities on specific resources (and countless hours of savings if doing multiple deploys as pat of infra work).IpRange
Bicep type that is differentiated from the existingFirewallRule
type, as many Azure resources only support very simple properties (e.g. name, rangeStart, rangeEnd) for their restriction configuration, whereas others support richer details (e.g. description, priority, action).FileNotFoundException
bulletpoint on https://dfedigital.atlassian.net/browse/EES-5232.Summary of the work carried out in this PR
Adding in a waiting step to allow orchestrations to complete
The following diagram illustrates the step that we were missing:
(from https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-zero-downtime-deployment)
Adding a "StatusCheck" Function to report active orchestrations
I added a new HTTP Trigger Function which we can call that reports the current number of active / pending orchestrations on the instance that's being called e.g. the Data Processor in the production slot. The implementation was mostly borrowed from https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-zero-downtime-deployment. I kept the name of the Function the same as in the docs ("StatusCheck"), as I saw other mentions of it elsewhere around the internet and thought it'd probably be worth keeping the same for future Googling purposes!
Adding active orchestrations call to pipeline
New endpoint calls
I added 2 new task types to the pipeline. One is a generic call to an endpoint waiting for a success response (200 to 204). The other is a specific call to an "active orchestrations" endpoint, which waits for an active orchestration count of zero to be reached. Both of these tasks poll their endpoints for a certain number of attempts before timing out or reaching their success state.
After we have deployed new code and app settings to the staging slot, we firstly wait until it is back up and healthy again, polling the
/api/HealthCheck
endpoint until it returns 200.We then proceed to call the active orchestrations endpoint on the production slot until it returns zero active orchestrations.
We are then free to initiate the slot swap.
Finally for our own visibility and assurance, we poll the
/api/HealthCheck
endpoint in the production slot until it returns 200 following the slot swap. This should be immediate but we will see!Network changes to allow CI to call Data Processor endpoints
The Data Processor was previously locked down to VNet traffic only. This meant that the DevOps pipeline runners were unable to call it due to network restrictions.
In order to allow calls through without VNet peering (which given that we're in s101 is infinitely harder!) we instead enable public network access but restrict to specific IPs. As the work of assigning the DevOps runners a static IP address is ongoing, for the purposes of this PR we are whitelisting the "AzureCloud" service tag for now. This in conjunction with the EasyAuth work (detailed below) gives enough security while we wait for the runners to have a whitelisted IP address that we can rely on.
The Admin App's internal subnet range is whitelisted also, and the Admin App uses the internal range because the private endpoint connecting the Data Processor to the VNet allows VNet-connected resources to communicate without the need to go out over the public internet.
We also whitelist the generic maintenance IP address ranges so that we can more easily support the Function App when in production.
Because of these changes, we no longer need the steps in the pipeline that made the staging slot temporarily publicly visible and then invisible again, giving us a bonus speed boost.
EasyAuth changes to whitelist the DevOps SPN
In order for the pipeline to make authenticated and authorised calls to the Data Processor, we need to add the SPN's client ID (application ID) to the whitelist of applications that are allowed to make calls to the Data Processor. The Admin App Service is already whitelisted, and we add the SPN to this same list of allowed applications. I also realised that the Admin App Service's managed identity was a whitelisted user as well as the application being whitelisted, which wasn't required.
We get the application ID from the pipeline run by using the
addSpnToEnvironment
option which then exposes a number of environment variables for pipelines to consume.To make use of this new whitelisting, the pipeline needs to acquire an access token for the default scope of the Data Processor's app registration, which it does in both of the new endpoint-polling tasks by using
az account get-access-token
. It then uses this as a bearer token to call the protected endpoints.New "awaitActiveOrchestrations" pipeline parameter
Because the first time this code is deployed there will be no
/api/StatusCheck
endpoint in the production slot, we need to be able to bypass this step in the pipeline for its first successful deploy. This will need to be done manually, with this setting set to false. By default it will be true.Testing pipeline changes with long-running orchestration
In order to test the additional step in the pipeline, I added a dummy
LongRunningOrchestration
to the Data Processor. This can be used to create - you guessed it - a long-running orchestration. We can then see the effects of triggering one of these in the pipeline:The task polls until the orchestration is complete and then the deploy continues successfully.
I've chosen to leave this orchestration in the codebase for the purposes of testing this feature, as our current real orchestrations are just too fast!
Future work suggestions
LongRunningOrchestration
code.functionApp.bicep
todurableFunctionApp.bicep
as there is so much in there that is specific to Durable Function Apps only.deploy-durable-function-app.yml
template fromdeploy-data-processor.yml
for use in other future Durable Function Apps in the future, as it is all steps that we would wish to carry out on others too.scmIpSecurityRestrictions
element tofunctionApp.bicep
which is what currently exists on our Data Processor. It would be good to lock this down as well to our maintenance IP ranges and to DevOps runners.