Reference Implementation - Infrastructure

Cleaner and manageable deployment design
Ability to switch service(s) with other services providing similar capabilities depending on requirements
Separation between layers which enables implementation of RBAC easier in case multiple teams are responsible for different aspects of Azure Mission-Critical application deployment and operations

The Azure Mission-Critical reference implementations are composed of three distinct layers:

Infrastructure
Configuration
Application

Infrastructure layer contains all infrastructure components and underlying foundational services required for Azure Mission-Critical reference implementation. It is deployed using Terraform.

Configuration layer applies the initial configuration and additional services on top of the infrastructure components deployed as part of infrastructure layer.

Application layer contains all components and dependencies related to the application workload itself.

Architecture

Stamp independence

Every stamp - which usually corresponds to a deployment to one Azure Region - is considered independent. Stamps are designed to work without relying on components in other regions (i.e. "share nothing").

The main shared component between stamps which requires synchronization at runtime is the database layer. For this, Azure Cosmos DB was chosen as it provides the crucial ability of multi-region writes i.e., each stamp can write locally with Cosmos DB handling data replication and synchronization between the stamps.

Aside from the database, a geo-replicated Azure Container Registry (ACR) is shared between the stamps. The ACR is replicated to every region which hosts a stamp to ensure fast and resilient access to the images at runtime.

Stamps can be added and removed dynamically as needed to provide more resiliency, scale and proximity to users.

A global load balancer is used to distribute and load balance incoming traffic to the stamps (see Networking for details).

Stateless compute clusters

As much as possible, no state should be stored on the compute clusters with all states externalized to the database. This allows users to start a user journey in one stamp and continue it in another.

Scale Units

In addition to stamp independence and stateless compute clusters, each "stamp" is considered to be a Scale Unit (SU) following the Deployment stamps pattern. All components and services within a given stamp are configured and tested to serve requests in a given range. This includes auto-scaling capabilities for each service as well as proper minimum and maximum values and regular evaluation.

Configuration

Component	min	max
AKS nodes	3	12
Ingress controller replicas	3	24
CatalogService replicas	3	24
BackgroundProcessor replicas	3	12
Event Hub throughput units	1	10
Cosmos DB RUs	4000	40000

Note: Cosmos DB RUs are scaled in all regions simultaneously.

Each SU is deployed into an Azure region and is therefore primarily handling traffic from that given area (although it can take over traffic from other regions when needed). This geographic spread will likely result in load patterns and business hours that might vary from region to region and as such, every SU is designed to scale-in/-down when idle.

Infrastructure

Available Azure Regions

The reference implementation of Azure Mission-Critical deploys a set of Azure services. These services are not available across all Azure regions. In addition, only regions which offer Availability Zones (AZs) are considered for a stamp. AZs are gradually being rolled-out and are not yet available across all regions. Due to these constraints, the reference implementation cannot be deployed to all Azure regions.

As of May 2022, following regions have been successfully tested with the reference implementation of Azure Mission-Critical:

Europe/Africa

northeurope
westeurope
germanywestcentral
francecentral
uksouth
norwayeast
swedencentral
switzerlandnorth
southafricanorth

Americas

westus2
eastus
eastus2
centralus
southcentralus
brazilsouth
canadacentral

Asia Pacific

australiaeast
southeastasia
eastasia
japaneast
koreacentral
centralindia

Note: Depending on which regions you select, you might need to first request quota with Azure Support for some of the services (mostly for AKS VMs and Cosmos DB).

It's worth calling out that where an Azure service is not available, an equivalent service may be deployed in its place. Availability Zones are the main limiting factor as far as the reference implementation of AZ is concerned.

As regional availability of services used in reference implementation and AZs ramp-up, we foresee this list changing and support for additional Azure regions improving where reference implementation can be deployed.

Note: If the target availability SLA for your application workload can be achieved without AZs and/or your workload is not bound compliance related to data sovereignty, an alternate region where all services/AZs are available can be considered.

Global resources

Azure Front Door

Front Door is used as the only entry point for user traffic. All backend systems are locked down to only allow traffic that comes through the AFD instance.
Each stamp comes with a pre-provisioned Public IP address resource, which DNS name is used as a backend for Front Door.
Diagnostic settings are configured to store all log and metric data for 30 days (retention policy) in Log Analytics.

Azure Cosmos DB

SQL-API (Cosmos DB API) is being used
Multi-region write is enabled
The account is replicated to every region in which there is a stamp deployed.
zone_redundancy is enabled for each replicated region.
Request Unit autoscaling is enabled on container-level.
Each stamp deploys an Azure Private Endpoint to the Cosmos DB.
Network restrictions are enabled to allow only access from Private Endpoints.

Azure Container Registry

sku is set to Premium to allow geo-replication.
georeplication_locations is automatically set to reflect all regions that a regional stamp was deployed to.
zone_redundancy_enabled provides resiliency and high availability within a specific region.
admin_enabled is set to false. The admin user access will not be used. Access to images stored in ACR, for example for AKS, is only possible using AzureAD role assignments.
Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Log Analytics for Global Resources

Used to collect diagnostic logs of the global resources
daily_quota_gb is set to prevent overspend, especially on environments that are used for load testing.
retention_in_days is used to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.

Stamp resources

A stamp is a regional deployment and can also be considered as a scale-unit. For now we only always deploy one stamp in an Azure Region but this can be extended to allow multiple stamps per region if required.

Networking

The current networking setup consists of a single Azure Virtual Network per stamp that consists of one subnet dedicated for Azure Kubernetes Service (AKS) and an additional subnet for the Private Endpoints of different services.

For connected scenarios (access required to other company resources in other spokes or on-prem), it is expected that the VNets are pre-provisioned, for instance by a platform team and made available to the application team. For E2E (dev) environments this might be optional, therefore the reference implementation is prepared to create the VNets if needed.

For INT and PROD (or any other environments which do require connectivity), a multiple pre-provisioned VNets are expected to be available: Due to the blue-green deployment approach, at least two VNets per environment and region are required. The deployment pipeline looks for a file .ado/pipelines/config/vnets-[environment].json. If this file is not present, disconnected VNets will be deployed on demand, e.g. for E2E environments.

The file needs to hold the resource IDs of the VNets per region. See /.ado/pipelines/config/vnets-int.json for an example. The deployment pipeline will check which VNets are currently not in use by any other deployment and then tag the VNets to mark them as in use. Once an environment gets destroyed again, this "earmark" tag is being removed again. See /.ado/pipelines/templates/steps-get-or-create-vnet.yaml for the pipeline script which implements the logic.

The reference implementation is currently configured to require at least a VNet with a /23 address space for each stamp. This is to allow for a /24 subnet for AKS nodes and their pods. Change this based on your scaling requirements (number of nodes and number of pods). To change the subnet sizes (and thereby the required input size of /23), modify /src/infra/workload/releaseunit/modules/stamp/network.tf

Azure Key Vault

Key Vault is used as the sole configuration store by the application for both secret as well as non-sensitive values.
sku_name is set to standard.
Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Kubernetes Service

Azure Kubernetes Service (AKS) is used as the compute platform as it is most versatile and as Kubernetes is the de-facto compute platform standard for modern applications, both inside and outside of Azure.

The Azure Mission-Critical reference implementation uses Linux-only clusters as there is no requirement for any Windows-based containers and Linux is the more mature platform in terms of Kubernetes.

role_based_access_control (RBAC) is enabled.
sku_tier set to Paid (Uptime SLA) to achieve the 99.95% SLA within a single region (with availability_zones enabled).
http_application_routing is disabled as it is not recommended for production environments, a separate Ingress controller solution will be used.
Managed Identities (SystemAssigned) are used, instead of Service Principals.
addon_profile configuration
- azure_policy is set to true to enable the use of Azure Policies in Azure Kubernetes Service. The policy configured in the reference implementation is in "audit-only" mode. It is mostly integrated to demonstrate how to set this up through Terraform.
- oms_agent is configured to enable the Container Insights addon and ship AKS monitoring data to Azure Log Analytics via an in-cluster OMS Agent (DaemonSet).
Diagnostic settings are configured to store all log and metric data in Log Analytics.
default_node_pool (used as system node pool) settings
- availability_zones is set to 3 to leverage all three AZs in a given region.
- enable_auto_scaling is configured to let the all node pools automatically scale out if needed.
- os_disk_type is set to Ephemeral to leverage Ephemeral OS disks for performance reasons.
- upgrade_settings max_surge is set to 33% which is the recommended value for production workloads.
Separate "workload" (aka user) node pool with same settings as "system" node pool but different VM SKUs and auto-scale settings.
- The user node pool is configured with a taint workload=true:NoSchedule to prevent non-workload pods from being scheduled. The node_label set to role=workload can be used to target this node pool when deploying a workload (see charts/catalogservice for an example).

Individual stamps are considered ephemeral and stateless. Updates to the infrastructure and application are following a Zero-downtime Update Strategy and do not touch existing stamps. Updates to Kubernetes are therefore primarily rolled out by releasing new versions and replacing existing stamps. To update node images between two releases, the automatic_channel_upgrade in combination with maintenance_window is used:

automatic_channel_upgrade is set to node-image to automatically upgrade node pools with the most recent AKS node image.
maintenance_window contains the allowed window to run automatic_channel_upgrade upgrades. It is currently set to allowed on Sunday between 0 and 2 am.

Azure Log Analytics for Stamp Resources

Each region has an individual Log Analytics workspace configured to store all log and metric data. As each stamp deployment is considered ephemeral, these workspaces are deployed as part of the global resources and does not share the lifecycle of a stamp. This ensures that when a stamp is deleted (which happens regularly), logs are still available. Log Analytics workspaces reside in a separate resource group <prefix>-monitoring-rg.

sku is set to PerGB2018.
daily_quota_gb is set to 30 GB to prevent overspend, especially on environments that are used for load testing.
retention_in_days is set to 30 days to prevent overspend by storing data longer than needed in Log Analytics - long term log and metric retention is supposed to happen in Azure Storage.
For the Health Model, a set of Kusto Functions needs to be added to LogAnalytics. There is a sub-resource type called SavedSearch. Because these queries can get quite bulky, they are loaded from files instead of specified inline in Terraform. They are stored in the subdirectory monitoring/queries in the /src/infra directory.

Azure Application Insights

As with Log Analytics, Application Insights is also deployed per-region and does not share the lifecycle of an stamp. All Application Insight resources are deployed in a separate resource group <prefix>-monitoring-rg and are deployed as part of the global resources deployment.

Log Analytics Workspace-attached mode is being used.
daily_data_cap_in_gb is set to 30 GB to prevent overspend, especially on environments that are used for load testing.

Azure Policy

Azure Policy is used to monitor and enforce certain baselines. All policies are assigned on a per-stamp, per-resource group level. Azure Kubernetes Service is configured to use the azure_policy addon to leverage Policies configured outside of Kubernetes.

Azure Event Hub

Each stamp has one standard tier, zone_redundant Event Hub Namespace.
Auto-inflate (auto-scaleup) can be optionally enabled via a Terraform variable.
The namespace holds one Event Hub backendqueue-eh with dedicated consumer groups for each consumer (currently only one).
A Private Endpoint is deployed which is used to securely access the Event Hub from within the stamp's VNet.
Network restrictions are enabled to allow only access through Private Endpoints.
Diagnostic settings are configured to store all log and metric data in Log Analytics.

Azure Storage Accounts

Two storage accounts are deployed per stamp:
- A "public" storage account with "static web site" enabled. This is used to host the UI single-page application.
- A "private" storage account which is used for internals such as the health service and the Event Hub checkpointing.
Both accounts are deployed in zone-redundant mode (ZRS).

Supporting services

This repository also contains a couple of supporting services for the Azure Mission-Critical project:

Self-hosted Agents

Naming conventions

All resources used for Azure Mission-Critical follow a pre-defined and consistent naming structure to make it easier to identify them and to avoid confusion. Resource abbreviations are based on the Cloud Adoption Framework. These abbreviations are typically attached as a suffix to each resource in Azure.

A prefix is used to uniquely identify "deployments" as some names in Azure must be worldwide unique. Examples of these include Storage Accounts, Container Registries and CosmosDB accounts.

Resource groups

Resource group names begin with the prefix and then indicate whether they contain per-stamp or global resources. In case of per-stamp resource groups, the name also contains the Azure region they are deployed to.

<prefix><suffix>-<global | stamp>-<region>-rg

This will, for example, result in aoprod-global-rg for global services in prod or aoprod7745-stamp-eastus2-rg for a stamp deployment in eastus2.

Resources

<prefix><suffix>-<region>-<resource> for resources that support - in their names and <prefix><region><resource> for resources such as Storage Accounts, Container Registries and others that do not support - in their names.

This will result in, for example, aoprod7745-eastus2-aks for an AKS cluster in eastus2.

Back to documentation root

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Reference Implementation - Infrastructure

Table of contents

Architecture

Stamp independence

Stateless compute clusters

Scale Units

Infrastructure

Available Azure Regions

Global resources

Azure Front Door

Azure Cosmos DB

Azure Container Registry

Azure Log Analytics for Global Resources

Stamp resources

Networking

Azure Key Vault

Azure Kubernetes Service

Azure Log Analytics for Stamp Resources

Azure Application Insights

Azure Policy

Azure Event Hub

Azure Storage Accounts

Supporting services

Naming conventions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Reference Implementation - Infrastructure

Table of contents

Architecture

Stamp independence

Stateless compute clusters

Scale Units

Infrastructure

Available Azure Regions

Global resources

Azure Front Door

Azure Cosmos DB

Azure Container Registry

Azure Log Analytics for Global Resources

Stamp resources

Networking

Azure Key Vault

Azure Kubernetes Service

Azure Log Analytics for Stamp Resources

Azure Application Insights

Azure Policy

Azure Event Hub

Azure Storage Accounts

Supporting services

Naming conventions