Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Windows microsoft-aks VHD images not available #3700

Closed
jackfrancis opened this issue Aug 14, 2020 · 3 comments
Closed

Windows microsoft-aks VHD images not available #3700

jackfrancis opened this issue Aug 14, 2020 · 3 comments
Labels
bug Something isn't working

Comments

@jackfrancis
Copy link
Member

jackfrancis commented Aug 14, 2020

Updated: this issue is mitigated for new cluster creates w/ the following AKS Engine versions:

Also, for historical purposes, note that any version of AKS Engine after v0.54.1 will not be effected.

This issue describes error outcomes due to an Azure incident beginning approximately Thursday August 13 at 10 p.m. PST:

All clusters built with a reference to the microsoft-aks Windows VHD image reference were to scale, and all new clusters with Windows node referring to that VHD were not able to be created.

This was the result of all Windows VHD images being deleted. Replacement Windows VHD images were built to enable new clusters w/ Windows node pools.

How do I know if I'm affected?

If you're running a Kubernetes cluster created by any version of AKS Engine with a Windows node pool ("osType": "Windows" agentPoolProfiles configuration in your api model), then your cluster may have be effected by this incident. If you're running vanilla VMs (in other words, VMs in an an availability set, and not VMSS), then the guidance is to wait until a future AKS Engine release before performing any scale operations (see above list to determine if a suitable AKS Engine patch version is available for you). If you're running a VMSS Windows node pool, AKS Engine engineers have updated your VMSS model to ensure that future scale operations refer to a working VHD reference suitable for your cluster.

How do I know if my VMSS is affected?

Updated: All existing VMSS clusters have been patched in the backend to ensure that VMSS models point to a working VHD image reference. Scale operations using the VMSS API will work as expected.

The following example commands assume that you have the az CLI, and that you have the open source jq tool to perform JSON queries against the JSON output from az. Also, we assume you have exported the subscription ID and resource group of the cluster as the SUBSCRIPTION_ID and RESOURCE_GROUP environment variables, respectively.

On any candidate cluster that you suspect may be affected, you can query all VMSS in the resource group and look for those that are using the affected "aks-windows" image reference:

$ az vmss list --subscription $SUBSCRIPTION_ID --resource-group $RESOURCE_GROUP | jq -r '.[] | select(.virtualMachineProfile.storageProfile.imageReference.offer=="aks-windows").name'

If you get any VMSS names listed from the above command, then you are running Kubernetes nodes affected by the above incident. Future scale out operations will work as a result of a backend update to the VMSS model. For your next cluster deployment, you must use one of the above listed patch versions to create your cluster using aks-engine.

How do I know if my VMAS (availability set or "vanilla" VMs) is affected?

$ az vm list --subscription $SUBSCRIPTION_ID --resource-group $RESOURCE_GROUP | jq -r '.[] | select(.storageProfile.imageReference.offer=="aks-windows").name'

Again, if any VMAS are listed from the above command, you were effected.

What's the current guidance?

We are in the process of publishing patch releases for every affected AKS Engine version. Status of patches:

For new cluster create operations, we recommend using the patch version that corresponds with the "known-working" AKS Engine version used in your environment to bootstrap Kubernetes. For example, if you have had success creating Kubernetes clusters using AKS Engine v0.54.0, use v0.54.1 for your next cluster create operation.

For scaling existing clusters, you will want to replace your existing version of the aks-engine binary with its corresponding patch release. You can run aks-engine version to discover which version you're using to run scale operations against your cluster:

$ aks-engine version
Version: v0.54.0
GitCommit: fd2c45db1
GitTreeState: clean

Again: for cluster running VMSS Windows node pools, those VMSS's have been automatically updated to reflect the new, working image references. If you use the Azure VMSS interface (via UI, or CLI, or SDK), or cluster-autoscaler, to scale Windows nodes in your cluster, you do not need to take any action. However, if you are using vanilla Windows VMs (VM Availability Sets, or VMAS), then you'll need to get the patched aks-engine binary to continue using that to scale your clusters.

@andyzhangx
Copy link
Contributor

andyzhangx commented Aug 15, 2020

when aks-engine v0.54.0 will be hotfixed? our upstream pipeline all depends on this version, thanks.
Update:
looks like we need to use v0.54.1 to adopt the hotfix.

@jackfrancis
Copy link
Member Author

@jackfrancis
Copy link
Member Author

Patch releases have been provided .

@xuto2 xuto2 unpinned this issue Sep 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants