K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension #918

kwikwag · 2019-10-16T21:50:38Z

I am using a VMSS-backed westus-located K8S (1.14.6; aksEngineVersion : v0.40.2-aks) cluster.

I followed the guide at Connect with SSH to Azure Kubernetes Service (AKS) cluster nodes for maintenance or troubleshooting in order to be able to connect to my K8S nodes via SSH (to solve yet-another issue). That went well, however, after a few days, I same-version upgraded my K8S cluster in order to try and deal with yet another issue. I received multiple (about 60) deployment errors on the corresponding MC_ resource group:

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.","details":[{"code":"Conflict","message":"{\r\n \"status\": \"Failed\",\r\n \"error\": {\r\n \"code\": \"ResourceDeploymentFailure\",\r\n \"message\": \"The resource operation completed with terminal provisioning state 'Failed'.\",\r\n \"details\": [\r\n {\r\n \"code\": \"VMExtensionProvisioningError\",\r\n \"message\": \"Multiple VM extensions failed to be provisioned on the VM. Please see the VM extension instance view for other failures. The first extension failed due to the error: Provisioning of VM extension 'VMAccessForLinux' has timed out. Extension installation may be taking too long, or extension status could not be obtained.\"\r\n }\r\n ]\r\n }\r\n}"}]}

And my K8S cluster ultimately (after about 2.5h) entered a Failed state. During and after this failing deployment loop, I tried to delete the extension with the CLI, reinstall it with a different configurations (including an empty), different versions, finally reinstalling it as per the guide (az vmss extension set ...) with the same settings as I did originally. Each operation failed independently with the extension provisioning error as above. However, after a delete, even though I got an error message, when I listed extensions with az vmss extension list, I saw the extension indeed disappeared from the list of extensions on the VMSS, and running two consecutive deletes showed:

$ az vmss extension delete --resource-group $CLUSTER_RESOURCE_GROUP --vmss-name $SCALE_SET_NAME --name VMAccessForLinux
ERROR: Extension VMAccessForLinux not found

However, when restarting the VMSS via the Azure portal (by accessing the MC_ resource group), I still received the above error.

I then tried deleting the extension from the Azure portal, verifying it is deleted using the CLI, and then retry a same-version upgrade the K8S cluster to recover from the Failed state. I got the same errors, even though the extension did not show in the Portal VMSS Extensions page. This time I got 40 failed deployments (with the initial one taking 53 minutes), again failing after 2.5 hours.

Luckily (or not), I had SSH access to the node ( :) ). So I could locate the logs. Surprisingly I saw that the version installed is 1.5.3, even though when I originally installed the extension with the guide, I used 1.4. Perhaps it was in my attempts to delete/reset the extension when the cluster first failed, that the version change happened?

2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] sequence number is 0
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] setting file path is/var/lib/waagent/Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3/config/0.settings
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON config:
2019/10/16 17:49:25 ERROR:[Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON exception decoding
2019/10/16 17:49:25 ERROR:[Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] JSON error processing settings file:
2019/10/16 17:49:25 [Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3] Current sequence number, 0, is not greater than the sequnce number of the most recent executed configuration. Exiting...

The times don't conincide with failing MC_ deployments though, which repeatedly fail every 4 minutes. The file /var/lib/waagent/Microsoft.OSTCExtensions.VMAccessForLinux-1.5.3/config/0.settings is empty, which can explain the error, but when I tried to rewrite it to contain an empty JSON document ({}), and then restarted the VMSS, it was simply re-written.

I'm at a loss and so is my K8S cluster. Help?

The text was updated successfully, but these errors were encountered:

kwikwag mentioned this issue Oct 17, 2019

Disk attachment/mounting problems, all pods with PVCs stuck in ContainerCreating Azure/AKS#1278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension #918

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension #918

kwikwag commented Oct 16, 2019

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension #918

K8S cluster provisioning/VMSS restart failed due to VMExtensionProvisioningError on VMAccessForLinux extension #918

Comments

kwikwag commented Oct 16, 2019