CAPZ stays stuck in deleting mode when resources are all actually gone #4570

dtzar · 2024-02-13T20:27:09Z

/kind bug

What steps did you take and what happened:
Ran the command to delete an existing deployed cluster and it stayed permanently stuck in the deleting state. I then manually deleted the cluster resources created by CAPZ and CAPZ still stayed stuck in the deleting state.

What did you expect to happen:
CAPZ would detect that the resources it were trying to delete were already removed and get rid of the definition present which was trying to delete what is clearly already gone.

Environment:
management cluster

cluster-api-provider-azure version: 1.13.1
Kubernetes version: (use kubectl version): 1.29.1
OS (e.g. from /etc/os-release): wsl ubuntu 22.04 docker desktop K8s

The text was updated successfully, but these errors were encountered:

nojnhuh · 2024-02-14T20:04:02Z

I'd be surprised if this were a general problem since I don't remember seeing this ever in e2e. Can you provide any more details about what the cluster template looked like or if you were doing anything to it around the time it was being deleted? It would also be helpful to know what the capz-controller-manager was logging for the CAPZ resources while they were stuck and what the full YAML of the resources was then.

dtzar · 2024-02-15T00:05:59Z

I think the errors were like this when it was in this state. These are the logs with the state of an AKS cluster not being able to be created.

ASO logs

I0214 23:48:07.738380       1 generic_reconciler.go:96] "msg"="Encountered error, re-queuing..." "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "result"={"Requeue":false,"RequeueAfter":180000000000}
I0214 23:51:07.737209       1 common.go:58] "msg"="Reconcile invoked" "annotations"={"serviceoperator.azure.com/credential-from":"argocluster-aso-secret","serviceoperator.azure.com/operator-namespace":"capz-system","serviceoperator.azure.com/reconcile-policy":"skip","serviceoperator.azure.com/resource-id":"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster"} "conditions"="[Condition [Ready], Status = \"False\", ObservedGeneration = 1, Severity = \"Warning\", Reason = \"Failed\", Message = \"error getting status for resource ID \\\"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\\\": getting resource with ID: \\\"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\\\": ClientSecretCredential: unable to resolve an endpoint: server response error:\\n Get \\\"https://login.microsoftonline.com/#-345f-445e-90b7-31ac0c5cf7ef/v2.0/.well-known/openid-configuration\\\": dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving\", LastTransitionTime = \"2024-02-14 23:41:18 +0000 UTC\"]" "creationTimestamp"="2024-02-14T23:40:55Z" "deletionTimestamp"=null "finalizers"=["serviceoperator.azure.com/finalizer"] "generation"=1 "kind"={"kind":"ResourceGroup","apiVersion":"resources.azure.com/v1api20200601storage"} "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "owner"=null "ownerReferences"=[{"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","kind":"AzureManagedControlPlane","name":"argocluster","uid":"acd970a7-adff-4fb8-af20-98014aee4f3d","controller":true,"blockOwnerDeletion":true}] "resourceVersion"="732785" "uid"="162c81d9-4718-49e8-a5f3-d3a7a305d12e"
I0214 23:51:07.737286       1 generic_reconciler.go:335] "msg"="Skipping creation/update of resource due to policy" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "serviceoperator.azure.com/reconcile-policy"="skip"
I0214 23:51:07.737464       1 azure_generic_arm_reconciler_instance.go:421] "msg"="Resource successfully created/updated" "azureName"="argocluster" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "resourceID"="/subscriptions/##-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster"
I0214 23:51:07.737700       1 recorder.go:104] "msg"="Using credential from \"default/argocluster-aso-secret\"" "logger"="events" "object"={"kind":"ResourceGroup","namespace":"default","name":"argocluster","uid":"162c81d9-4718-49e8-a5f3-d3a7a305d12e","apiVersion":"resources.azure.com/v1api20200601storage","resourceVersion":"732785"} "reason"="CredentialFrom" "type"="Normal"
E0214 23:51:30.238458       1 generic_reconciler.go:361] "msg"="Encountered error impacting Ready condition" "error"="Reason: Failed, Severity: Warning, RetryClassification: RetrySlow, Cause: error getting status for resource ID \"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\": getting resource with ID: \"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\": ClientSecretCredential: unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/#-345f-445e-90b7-31ac0c5cf7ef/v2.0/.well-known/openid-configuration\": dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default"
I0214 23:51:30.280058       1 generic_reconciler.go:96] "msg"="Encountered error, re-queuing..." "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "result"={"Requeue":false,"RequeueAfter":180000000000}

CAPZ logs

I0214 23:58:49.719804       1 azuremanagedcluster_controller.go:169] "Successfully reconciled" logger="controllers.AzureManagedClusterReconciler.Reconcile" controller="azuremanagedcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedCluster" AzureManagedCluster="default/argocluster" namespace="default" name="argocluster" reconcileID="4f7de029-f703-4b0b-9c80-1fd98c860b88" kind="AzureManagedCluster" namespace="default" name="argocluster" x-ms-correlation-request-id="821674f9-3b7e-4d53-ad30-05646b3e8a39" cluster="argocluster" controlPlane="argocluster"
I0214 23:58:49.720816       1 azuremanagedmachinepool_controller.go:194] "AzureManagedControlPlane is not initialized" logger="controllers.AzureManagedMachinePoolReconciler.Reconcile" controller="azuremanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedMachinePool" AzureManagedMachinePool="default/argocluster-pool0" namespace="default" name="argocluster-pool0" reconcileID="ce0d7739-b492-45e9-8e32-e3225ac5b3e5" namespace="default" name="argocluster-pool0" kind="AzureManagedMachinePool" x-ms-correlation-request-id="07d964de-46a9-424a-8caa-a300e0848848" ownerCluster="argocluster"
I0214 23:58:57.004553       1 azuremanagedcontrolplane_controller.go:234] "Reconciling AzureManagedControlPlane" logger="controllers.AzureManagedControlPlaneReconciler.reconcileNormal" controller="azuremanagedcontrolplane" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedControlPlane" AzureManagedControlPlane="default/argocluster" namespace="default" name="argocluster" reconcileID="496335b7-c7c5-42d6-b1a7-4e99ef398eca" x-ms-correlation-request-id="cced4b6d-844f-46f0-8d73-a07c07b8d9ff"
E0214 23:58:57.004718       1 managedcontrolplane.go:427] "Unable to determine if ManagedControlPlaneScope VNET is managed by capz, assuming unmanaged" err="VirtualNetwork.network.azure.com \"argocluster\" not found" logger="scope.ManagedControlPlaneScope.IsVnetManaged" x-ms-correlation-request-id="28d3fddd-4e53-4226-ac43-d50c841734ff" AzureManagedCluster="argocluster"

dtzar · 2024-02-27T20:31:40Z

Unfortunately, I can't reproduce the problem right now - but will circle back when/if I do.

nojnhuh · 2024-02-27T20:36:58Z

It almost seems like there's something wonky in your Workload ID setup or your sub or something. I've never seen this kind of error in e2e or locally for me.

dtzar · 2024-03-14T21:53:40Z

I have a live repo of this now. I'm pretty sure you can reproduce this by doing the following:

Let CAPZ create a cluster
Power off the CAPZ management cluster
Manually delete the AKS cluster resource (but leave everything else in the RG)
Bring back online the CAPZ management cluster

Here's my logs I have right now from ASO (using 1.14.0 release also).

I0314 21:49:12.626378       1 recorder.go:104] "msg"="Reason: ResourceNotFound, Severity: Warning, RetryClassification: RetrySlow, Cause: The Resource 'Microsoft.ContainerService/managedClusters/argocluster' under resource group 'argocluster' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix: PUT https://management.azure.com/subscriptions/####/resourceGroups/argocluster/providers/Microsoft.ContainerService/managedClusters/argocluster/agentPools/pool1\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: ResourceNotFound\n

nawazkh · 2024-03-15T17:30:10Z

#4609 May or may not be connected to this issue. But since both of these issues are around deletion, I am mentioning the other issue as well in here.

willie-yao · 2024-03-27T00:41:49Z

Trying to fix this now and followed the steps to reproduce. I'm getting a slightly different error:

[manager] E0327 00:36:47.016389       1 generic_reconciler.go:361] "msg"="Encountered error impacting Ready condition" "error"="Reason: ReconciliationBlocked, Severity: Warning, RetryClassification: RetrySlow, Cause: Managed cluster \"aks-6629\" is in provisioning state \"Deleting\"" "logger"="controllers.ManagedClustersAgentPoolController" "name"="aks-6629-pool1" "namespace"="default"

I think that might be because I didn't wait for the Azure resource to fully delete before restarting the management cluster. This should be caused by the same issue though. I think the last time when we reproduced this together, we also started a delete locally before doing so on Azure.

dtzar · 2024-03-27T16:15:19Z

Yes, we did start the delete before powering the cluster off. So between step 1 and 2 should be added "delete the CAPZ cluster object" and don't let it finish deletion.

Either case IMO should be handled in some way though. If you don't initiate delete the CAPZ cluster object and manually delete the cluster while the management cluster is off, CAPZ should re-create the cluster as it has in its definition when it powers back on...

willie-yao · 2024-03-28T18:56:02Z

Just confirmed that the cluster does delete after powering off and on the management cluster, even in the case where the replicas is set to 0. The cluster deletion just takes a while (~25 minutes). This can be tracked in #4339

/close

k8s-ci-robot · 2024-03-28T18:56:07Z

@willie-yao: Closing this issue.

In response to this:

Just confirmed that the cluster does delete after powering off and on the management cluster, even in the case where the replicas is set to 0. The cluster deletion just takes a while (~25 minutes). This can be tracked in #4339

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

github-project-automation bot added this to CAPZ Planning Feb 13, 2024

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2024

dtzar assigned willie-yao Mar 25, 2024

dtzar mentioned this issue Mar 28, 2024

Configure controllers to watch ASO resources and queue reconcile requests for the appropriate CAPZ resources #4339

Open

k8s-ci-robot closed this as completed Mar 28, 2024

github-project-automation bot moved this to Done in CAPZ Planning Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

dtzar commented Feb 13, 2024

nojnhuh commented Feb 14, 2024

dtzar commented Feb 15, 2024

dtzar commented Feb 27, 2024

nojnhuh commented Feb 27, 2024

dtzar commented Mar 14, 2024 •

edited

Loading

nawazkh commented Mar 15, 2024

willie-yao commented Mar 27, 2024

dtzar commented Mar 27, 2024 •

edited

Loading

willie-yao commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

Comments

dtzar commented Feb 13, 2024

nojnhuh commented Feb 14, 2024

dtzar commented Feb 15, 2024

dtzar commented Feb 27, 2024

nojnhuh commented Feb 27, 2024

dtzar commented Mar 14, 2024 • edited Loading

nawazkh commented Mar 15, 2024

willie-yao commented Mar 27, 2024

dtzar commented Mar 27, 2024 • edited Loading

willie-yao commented Mar 28, 2024

k8s-ci-robot commented Mar 28, 2024

dtzar commented Mar 14, 2024 •

edited

Loading

dtzar commented Mar 27, 2024 •

edited

Loading