Ees 5446 prevent function app slot swap until orchestrations complete #5425

duncan-at-hiveit · 2024-12-02T11:40:23Z

Overview

This PR:

adds in an additional step into the Data Processor deploy process to delay slot swapping until all active orchestrations in the production slot are complete.

Additionally,this PR:

adds additional deploySharedPrivateDnsZones parameter to the pipeline in order to allow for Bicep template deployment time savings of ~2 minutes for a rarely changing common resource.
adds additional deployContainerApp and deployDataProcessor parameters to provide further time saving opportunities if performing focused deployment activities on specific resources (and countless hours of savings if doing multiple deploys as pat of infra work).
adds a new IpRange Bicep type that is differentiated from the existing FirewallRule type, as many Azure resources only support very simple properties (e.g. name, rangeStart, rangeEnd) for their restriction configuration, whereas others support richer details (e.g. description, priority, action).
Removes the staging slot's healthcheck call to eliminate noise in App Insights from legit deploy activities - see the FileNotFoundException bulletpoint on https://dfedigital.atlassian.net/browse/EES-5232.

Summary of the work carried out in this PR

Adding a new endpoint to the Data Processor to report the number of active orchestrations.
Enabling the pipeline to have network access to the new Data Processor endpoint.
Enabling the pipeline to have EasyAuth access to the new Data Processor endpoint.
Adding additional steps into the pipeline to wait for services to be healthy in slots prior to commencing to next step in pipeline.
Adding additional steps into the pipeline to wait for active orchestrations to complete prior to slot swapping.

Adding in a waiting step to allow orchestrations to complete

The following diagram illustrates the step that we were missing:

(from https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-zero-downtime-deployment)

Adding a "StatusCheck" Function to report active orchestrations

I added a new HTTP Trigger Function which we can call that reports the current number of active / pending orchestrations on the instance that's being called e.g. the Data Processor in the production slot. The implementation was mostly borrowed from https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-zero-downtime-deployment. I kept the name of the Function the same as in the docs ("StatusCheck"), as I saw other mentions of it elsewhere around the internet and thought it'd probably be worth keeping the same for future Googling purposes!

Adding active orchestrations call to pipeline

New endpoint calls

I added 2 new task types to the pipeline. One is a generic call to an endpoint waiting for a success response (200 to 204). The other is a specific call to an "active orchestrations" endpoint, which waits for an active orchestration count of zero to be reached. Both of these tasks poll their endpoints for a certain number of attempts before timing out or reaching their success state.

After we have deployed new code and app settings to the staging slot, we firstly wait until it is back up and healthy again, polling the /api/HealthCheck endpoint until it returns 200.

We then proceed to call the active orchestrations endpoint on the production slot until it returns zero active orchestrations.

We are then free to initiate the slot swap.

Finally for our own visibility and assurance, we poll the /api/HealthCheck endpoint in the production slot until it returns 200 following the slot swap. This should be immediate but we will see!

Network changes to allow CI to call Data Processor endpoints

The Data Processor was previously locked down to VNet traffic only. This meant that the DevOps pipeline runners were unable to call it due to network restrictions.

In order to allow calls through without VNet peering (which given that we're in s101 is infinitely harder!) we instead enable public network access but restrict to specific IPs. As the work of assigning the DevOps runners a static IP address is ongoing, for the purposes of this PR we are whitelisting the "AzureCloud" service tag for now. This in conjunction with the EasyAuth work (detailed below) gives enough security while we wait for the runners to have a whitelisted IP address that we can rely on.

The Admin App's internal subnet range is whitelisted also, and the Admin App uses the internal range because the private endpoint connecting the Data Processor to the VNet allows VNet-connected resources to communicate without the need to go out over the public internet.

We also whitelist the generic maintenance IP address ranges so that we can more easily support the Function App when in production.

Because of these changes, we no longer need the steps in the pipeline that made the staging slot temporarily publicly visible and then invisible again, giving us a bonus speed boost.

EasyAuth changes to whitelist the DevOps SPN

In order for the pipeline to make authenticated and authorised calls to the Data Processor, we need to add the SPN's client ID (application ID) to the whitelist of applications that are allowed to make calls to the Data Processor. The Admin App Service is already whitelisted, and we add the SPN to this same list of allowed applications. I also realised that the Admin App Service's managed identity was a whitelisted user as well as the application being whitelisted, which wasn't required.

We get the application ID from the pipeline run by using the addSpnToEnvironment option which then exposes a number of environment variables for pipelines to consume.

To make use of this new whitelisting, the pipeline needs to acquire an access token for the default scope of the Data Processor's app registration, which it does in both of the new endpoint-polling tasks by using az account get-access-token. It then uses this as a bearer token to call the protected endpoints.

New "awaitActiveOrchestrations" pipeline parameter

Because the first time this code is deployed there will be no /api/StatusCheck endpoint in the production slot, we need to be able to bypass this step in the pipeline for its first successful deploy. This will need to be done manually, with this setting set to false. By default it will be true.

Testing pipeline changes with long-running orchestration

In order to test the additional step in the pipeline, I added a dummy LongRunningOrchestration to the Data Processor. This can be used to create - you guessed it - a long-running orchestration. We can then see the effects of triggering one of these in the pipeline:

The task polls until the orchestration is complete and then the deploy continues successfully.

I've chosen to leave this orchestration in the codebase for the purposes of testing this feature, as our current real orchestrations are just too fast!

Future work suggestions

Whitelist the DevOps runners' static IP once it has been established by the CIP team how best to achieve this.
Remove or restrict to the Dev environment the LongRunningOrchestration code.
Rename functionApp.bicep to durableFunctionApp.bicep as there is so much in there that is specific to Durable Function Apps only.
Break out a deploy-durable-function-app.yml template from deploy-data-processor.yml for use in other future Durable Function Apps in the future, as it is all steps that we would wish to carry out on others too.
Enable access restrictions to SCM site - I have included the default scmIpSecurityRestrictions element to functionApp.bicep which is what currently exists on our Data Processor. It would be good to lock this down as well to our maintenance IP ranges and to DevOps runners.

infrastructure/templates/public-api/ci/azure-pipelines.yml

ntsim · 2024-12-09T11:47:01Z

infrastructure/templates/public-api/components/functionApp.bicep

@@ -39,6 +39,9 @@ param privateEndpointSubnetId string?
 @description('Specifies whether this Function App is accessible from the public internet')
 param publicNetworkAccessEnabled bool = false

+@description('IP address ranges that are allowed to access the Function App endpoints. Dependent on "publicNetworkAccessEnabled" being true.')
+param functionAppEndpointFirewallRules FirewallRule[] = []


Could this name be simplified? Perhaps just functionAppFirewallRules? Seems like we've called it this in publicApiDataProcessor.bicep

Yup good idea - updated ta.

ntsim · 2024-12-09T11:47:58Z

infrastructure/templates/public-api/components/functionApp.bicep

@@ -195,9 +207,20 @@ var commonSiteProperties = {
    netFrameworkVersion: '8.0'
    linuxFxVersion: appServicePlanOS == 'Linux' ? 'DOTNET-ISOLATED|8.0' : null
    keyVaultReferenceIdentity: keyVaultReferenceIdentity
+    publicNetworkAccess: publicNetworkAccessEnabled ? 'Enabled' : 'Disabled'
+    ipSecurityRestrictions: publicNetworkAccessEnabled && length(firewallRules) > 0 ? firewallRules : null
+    ipSecurityRestrictionsDefaultAction: publicNetworkAccessEnabled && length(firewallRules) > 0 ? 'Deny' : 'Allow'


Is this a potential bug? If publicNetworkAccessEnabled is true, but there are no firewallRules, this will evaluate to Allow (which doesn't seem right)

I think that you're right conceptually, although I'm not sure it makes much difference either way in reality. If publicNetworkAccessEnabled is true and there are no firewall rules, that means it's open to the world anyway regardless of whether this is set to 'Allow' or 'Deny'.

But thinking about it now, I don't think that there's any reason to set this to anything but 'Deny' here, as we're already explicitly saying that each of our firewall rules are Allow rules in this same file:

var firewallRules = [for (firewallRule, index) in functionAppFirewallRules: { name: firewallRule.name ipAddress: firewallRule.cidr action: 'Allow' tag: firewallRule.tag != null ? firewallRule.tag : 'Default' priority: firewallRule.priority != null ? firewallRule.priority : 100 + index }]

So I have simply set it to 'Deny' for best safety.

Nice, yep that makes sense

infrastructure/templates/public-api/components/functionApp.bicep

ntsim · 2024-12-09T11:53:19Z

infrastructure/templates/public-api/components/functionApp.bicep

+    scmIpSecurityRestrictions: [
+      {
+        ipAddress: 'Any'
+        action: 'Allow'
+        priority: 2147483647
+        name: 'Allow all'
+        description: 'Allow all access'
+      }
+    ]


What is this actually for? Tried looking for what this is exactly, but came up a bit short.

Might want to add a comment clarify its purpose?

I believe it's IP restrictions to determine who's able to see the part of the stack that's responsible for performing deploys to the Function App / App Service. So ideally when we have static IPs for our runners, we'd be able to whitelist them so that only the pipeline would be able to deploy code to these resources. Might also control who's able to access the Kudu site that accompanies each Function App / App Service host (the one you can access through Advanced tools in Azure Portal.

I've stuck a comment on:

// TODO EES-5446 - this setting controls access to the deploy site for the Function App. // This is currently the default value, but ideally we would lock this down to only be accessible // by our runners and certain other whitelisted IP address ranges (e.g. trusted VPNs).

ntsim · 2024-12-09T12:40:37Z

...ion.ExploreEducationStatistics.Public.Data.Processor/Functions/LongRunningTriggerFunction.cs

+    ILogger<LongRunningTriggerFunction> logger)
+{
+    [Function(nameof(TriggerLongRunningOrchestration))]
+    public async Task<IActionResult> TriggerLongRunningOrchestration(


Given that it looks like we've proven that the status check works, is there any value anymore in having this code?

Would probably argue that we could just delete all this long-running function code as it's just cluttering up the project.

If we think there's a reason to keep it, would suggest that we at least disable the trigger to prevent it from being triggered accidentally when we don't need it.

Good idea, and I have disabled by adding into deploy-data-processor.yml:

"AzureWebJobs.TriggerLongRunningOrchestration.Disabled=true" \

I'd suggest we keep it for allowing the testing of this ticket, upon which it can be enabled in Azure Portal and triggered when needed.

I've also moved the App__MetaInsertBatchSize out of Bicep and into deploy-data-processor.yml as well, so now all app-specific settings are controlled in the yml.

ntsim · 2024-12-09T12:42:56Z

....Education.ExploreEducationStatistics.Public.Data.Processor/Functions/StatusCheckFunction.cs

+        Statuses = new List<OrchestrationRuntimeStatus>
+        {
+            OrchestrationRuntimeStatus.Pending,
+            OrchestrationRuntimeStatus.Running
+        }


Nitpick, but could use a collection expression?

All nitpicks welcome here! Done - I'm almost surprised that Rider didn't suggest it also :)

ntsim · 2024-12-09T12:49:07Z

....Education.ExploreEducationStatistics.Public.Data.Processor/Functions/StatusCheckFunction.cs

+        }
+    };
+
+    [Function("StatusCheck")]


Could use nameof

Good spot - done ta

ntsim · 2024-12-09T12:49:39Z

....Education.ExploreEducationStatistics.Public.Data.Processor/Functions/StatusCheckFunction.cs

+    {
+        var activeOrchestrations = await client
+            .GetAllInstancesAsync(filter: ActiveOrchestrationsQuery)
+            .ToListAsync();


Could CountAsync?

Ooh good suggestion, done ta

ntsim · 2024-12-09T12:57:25Z

....Education.ExploreEducationStatistics.Public.Data.Processor/Functions/StatusCheckFunction.cs

+        var activeOrchestrations = await client
+            .GetAllInstancesAsync(filter: ActiveOrchestrationsQuery)
+            .ToListAsync();
+        return new OkObjectResult(new { ActiveOrchestrations = activeOrchestrations.Count });


Would it be better to return a 5xx status code if there are active orchestrations?

The main benefit is that we could reuse the wait-for-endpoint-success.yml task rather than need the additional wait-for-orchestrations-to-complete.yml task (which basically does a similar thing).

We'd probably want to also rename the function to something more intuitive for these semantics e.g. DeployStatusCheck (checks if the function is ready for a new deployment).

I see what you're saying, but I'm slightly hesitant to do this because I feel that the details you get form this endpoint could be useful, and also there's no easy way to distinguish a planned 500 from an ISE 500 if one should arise for whatever reason.

The other reason why having the count and the specific pipeline job that's asserting on the count is useful I think is that you can see during deploy in the DevOps logging how many active orchestrations are currently holding up the deploy, which gives you some indication of how long it'll take to get through.

It's not necessarily so desperately urgent at the mo because we're expecting orchestrations to be relatively quick to run, but if we end up reusing for another durable function app or we include longer orchestrations here in the future, I think it would be beneficial to have that visibility.

What do you reckon?

(see the screenshot under Testing pipeline changes with long-running orchestration in the PR description for an idea of what I mean)

Hmm, there's potentially some alterations we could do to make this work i.e. we could change wait-for-endpoint-success.yml to output the body of non-200 responses. This would output the number of active orchestrations in the response JSON.

I'd also probably lean towards using 503 instead of 500 as this might be better for communicating the service isn't available to be deployed.

Regardless, it's not a big deal either way. If you're happy enough with the current approach, happy for this to go in as-is!

ntsim · 2024-12-09T12:59:41Z

Seems like there's a bunch of merge conflicts from the alerts work going in!

…o reduce noise during deployments. Prior to this we would see lots of FileNotFoundExceptions in the staging slot only during deploys.

…estrations via HTTP call. Added script to Data Processor deploy tasks to poll until active orchestrations are all complete prior to initiating slot swap

…access to the Data Processor app. Temporarily allowing anonymous access and public internet access while trialling active orchestrations endpoint

…est the deploy delay of the Data Processor.

…th and granting of an access token

…wn to Admin, Azure Pipeline runners (temporarily disabled in favour of whitelisting AzureCloud service tag) and general maintenance firewall rules.

…deploys

…ate DNS zone deployment to save lots of time during a standard deploy

duncan-at-hiveit · 2024-12-10T16:26:56Z

Seems like there's a bunch of merge conflicts from the alerts work going in!

Yup, it was unpleasant but all resolved now thanks!

…rchestration trigger by default. Code improvements and variable renaming. Tweaks to Bicep templates.

…t-function-app-slot-swap-until-orchestrations-complete Ees 5446 prevent function app slot swap until orchestrations complete

duncan-at-hiveit force-pushed the EES-5446-prevent-function-app-slot-swap-until-orchestrations-complete branch from 5a8a070 to 4b14f01 Compare December 2, 2024 12:10

duncan-at-hiveit marked this pull request as ready for review December 2, 2024 15:13

duncan-at-hiveit force-pushed the EES-5446-prevent-function-app-slot-swap-until-orchestrations-complete branch 2 times, most recently from 217f0e5 to e2e4a38 Compare December 2, 2024 15:15

ntsim reviewed Dec 9, 2024

View reviewed changes

duncan-at-hiveit added 10 commits December 10, 2024 09:37

EES-5446 - removing unhealthy check for Data Processor staging slot t…

2afd48d

…o reduce noise during deployments. Prior to this we would see lots of FileNotFoundExceptions in the staging slot only during deploys.

EES-5446 - added StatusCheck endpoint to provide count of active orch…

af64f98

…estrations via HTTP call. Added script to Data Processor deploy tasks to poll until active orchestrations are all complete prior to initiating slot swap

EES-5446 - adding in support for allowing specific IP address ranges …

457d63a

…access to the Data Processor app. Temporarily allowing anonymous access and public internet access while trialling active orchestrations endpoint

EES-5446 - added dummy long-running orchestration with which we can t…

c4b07e2

…est the deploy delay of the Data Processor.

EES-5446 - grant Azure DevOps SPN access to Data Processor via EasyAu…

d2c5cc8

…th and granting of an access token

EES-5446 - adding network restrictions to lock down Data Processor do…

94ac090

…wn to Admin, Azure Pipeline runners (temporarily disabled in favour of whitelisting AzureCloud service tag) and general maintenance firewall rules.

EES-5446 - corrected long-running test orchestration.

e61c935

EES-5446 - breaking out deploy tasks into template files

22af149

EES-5446 - reverting temporary pipeline changes for speeding up test …

1251ecb

…deploys

EES-5446 - making data processor and PSQL deploys independent of priv…

Loading
Loading status checks…

782043f

…ate DNS zone deployment to save lots of time during a standard deploy

duncan-at-hiveit force-pushed the EES-5446-prevent-function-app-slot-swap-until-orchestrations-complete branch from e2e4a38 to 782043f Compare December 10, 2024 10:14

duncan-at-hiveit added 2 commits December 10, 2024 16:41

EES-5446 - responded to various PR comments. Disabling long-running o…

Loading
Loading status checks…

05a1085

…rchestration trigger by default. Code improvements and variable renaming. Tweaks to Bicep templates.

EES-5446 - grouped long-running test functions into single class

Loading
Loading status checks…

a6a7832

ntsim approved these changes Dec 11, 2024

View reviewed changes

duncan-at-hiveit merged commit 1d0168d into dev Dec 11, 2024
8 checks passed

duncan-at-hiveit deleted the EES-5446-prevent-function-app-slot-swap-until-orchestrations-complete branch December 11, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ees 5446 prevent function app slot swap until orchestrations complete #5425

Ees 5446 prevent function app slot swap until orchestrations complete #5425

duncan-at-hiveit commented Dec 2, 2024 •

edited

Loading

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 11, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 9, 2024

duncan-at-hiveit Dec 10, 2024

duncan-at-hiveit Dec 10, 2024

ntsim Dec 11, 2024

ntsim commented Dec 9, 2024

duncan-at-hiveit commented Dec 10, 2024

Ees 5446 prevent function app slot swap until orchestrations complete #5425

Ees 5446 prevent function app slot swap until orchestrations complete #5425

Conversation

duncan-at-hiveit commented Dec 2, 2024 • edited Loading

Overview

Summary of the work carried out in this PR

Adding in a waiting step to allow orchestrations to complete

Adding a "StatusCheck" Function to report active orchestrations

Adding active orchestrations call to pipeline

New endpoint calls

Network changes to allow CI to call Data Processor endpoints

EasyAuth changes to whitelist the DevOps SPN

New "awaitActiveOrchestrations" pipeline parameter

Testing pipeline changes with long-running orchestration

Future work suggestions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntsim commented Dec 9, 2024

duncan-at-hiveit commented Dec 10, 2024

duncan-at-hiveit commented Dec 2, 2024 •

edited

Loading