Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

Closed
dtzar opened this issue Feb 13, 2024 · 10 comments
Closed

CAPZ stays stuck in deleting mode when resources are all actually gone #4570

dtzar opened this issue Feb 13, 2024 · 10 comments
Assignees
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Comments

@dtzar
Copy link
Contributor

dtzar commented Feb 13, 2024

/kind bug

What steps did you take and what happened:
Ran the command to delete an existing deployed cluster and it stayed permanently stuck in the deleting state. I then manually deleted the cluster resources created by CAPZ and CAPZ still stayed stuck in the deleting state.

What did you expect to happen:
CAPZ would detect that the resources it were trying to delete were already removed and get rid of the definition present which was trying to delete what is clearly already gone.

Environment:
management cluster

  • cluster-api-provider-azure version: 1.13.1
  • Kubernetes version: (use kubectl version): 1.29.1
  • OS (e.g. from /etc/os-release): wsl ubuntu 22.04 docker desktop K8s
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 13, 2024
@nojnhuh
Copy link
Contributor

nojnhuh commented Feb 14, 2024

I'd be surprised if this were a general problem since I don't remember seeing this ever in e2e. Can you provide any more details about what the cluster template looked like or if you were doing anything to it around the time it was being deleted? It would also be helpful to know what the capz-controller-manager was logging for the CAPZ resources while they were stuck and what the full YAML of the resources was then.

@dtzar
Copy link
Contributor Author

dtzar commented Feb 15, 2024

I think the errors were like this when it was in this state. These are the logs with the state of an AKS cluster not being able to be created.

ASO logs

I0214 23:48:07.738380       1 generic_reconciler.go:96] "msg"="Encountered error, re-queuing..." "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "result"={"Requeue":false,"RequeueAfter":180000000000}
I0214 23:51:07.737209       1 common.go:58] "msg"="Reconcile invoked" "annotations"={"serviceoperator.azure.com/credential-from":"argocluster-aso-secret","serviceoperator.azure.com/operator-namespace":"capz-system","serviceoperator.azure.com/reconcile-policy":"skip","serviceoperator.azure.com/resource-id":"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster"} "conditions"="[Condition [Ready], Status = \"False\", ObservedGeneration = 1, Severity = \"Warning\", Reason = \"Failed\", Message = \"error getting status for resource ID \\\"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\\\": getting resource with ID: \\\"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\\\": ClientSecretCredential: unable to resolve an endpoint: server response error:\\n Get \\\"https://login.microsoftonline.com/#-345f-445e-90b7-31ac0c5cf7ef/v2.0/.well-known/openid-configuration\\\": dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving\", LastTransitionTime = \"2024-02-14 23:41:18 +0000 UTC\"]" "creationTimestamp"="2024-02-14T23:40:55Z" "deletionTimestamp"=null "finalizers"=["serviceoperator.azure.com/finalizer"] "generation"=1 "kind"={"kind":"ResourceGroup","apiVersion":"resources.azure.com/v1api20200601storage"} "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "owner"=null "ownerReferences"=[{"apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","kind":"AzureManagedControlPlane","name":"argocluster","uid":"acd970a7-adff-4fb8-af20-98014aee4f3d","controller":true,"blockOwnerDeletion":true}] "resourceVersion"="732785" "uid"="162c81d9-4718-49e8-a5f3-d3a7a305d12e"
I0214 23:51:07.737286       1 generic_reconciler.go:335] "msg"="Skipping creation/update of resource due to policy" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "serviceoperator.azure.com/reconcile-policy"="skip"
I0214 23:51:07.737464       1 azure_generic_arm_reconciler_instance.go:421] "msg"="Resource successfully created/updated" "azureName"="argocluster" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "resourceID"="/subscriptions/##-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster"
I0214 23:51:07.737700       1 recorder.go:104] "msg"="Using credential from \"default/argocluster-aso-secret\"" "logger"="events" "object"={"kind":"ResourceGroup","namespace":"default","name":"argocluster","uid":"162c81d9-4718-49e8-a5f3-d3a7a305d12e","apiVersion":"resources.azure.com/v1api20200601storage","resourceVersion":"732785"} "reason"="CredentialFrom" "type"="Normal"
E0214 23:51:30.238458       1 generic_reconciler.go:361] "msg"="Encountered error impacting Ready condition" "error"="Reason: Failed, Severity: Warning, RetryClassification: RetrySlow, Cause: error getting status for resource ID \"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\": getting resource with ID: \"/subscriptions/#-4518-4d05-9d6d-a29b5cf33a8d/resourceGroups/argocluster\": ClientSecretCredential: unable to resolve an endpoint: server response error:\n Get \"https://login.microsoftonline.com/#-345f-445e-90b7-31ac0c5cf7ef/v2.0/.well-known/openid-configuration\": dial tcp: lookup login.microsoftonline.com on 10.96.0.10:53: server misbehaving" "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default"
I0214 23:51:30.280058       1 generic_reconciler.go:96] "msg"="Encountered error, re-queuing..." "logger"="controllers.ResourceGroupController" "name"="argocluster" "namespace"="default" "result"={"Requeue":false,"RequeueAfter":180000000000}

CAPZ logs

I0214 23:58:49.719804       1 azuremanagedcluster_controller.go:169] "Successfully reconciled" logger="controllers.AzureManagedClusterReconciler.Reconcile" controller="azuremanagedcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedCluster" AzureManagedCluster="default/argocluster" namespace="default" name="argocluster" reconcileID="4f7de029-f703-4b0b-9c80-1fd98c860b88" kind="AzureManagedCluster" namespace="default" name="argocluster" x-ms-correlation-request-id="821674f9-3b7e-4d53-ad30-05646b3e8a39" cluster="argocluster" controlPlane="argocluster"
I0214 23:58:49.720816       1 azuremanagedmachinepool_controller.go:194] "AzureManagedControlPlane is not initialized" logger="controllers.AzureManagedMachinePoolReconciler.Reconcile" controller="azuremanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedMachinePool" AzureManagedMachinePool="default/argocluster-pool0" namespace="default" name="argocluster-pool0" reconcileID="ce0d7739-b492-45e9-8e32-e3225ac5b3e5" namespace="default" name="argocluster-pool0" kind="AzureManagedMachinePool" x-ms-correlation-request-id="07d964de-46a9-424a-8caa-a300e0848848" ownerCluster="argocluster"
I0214 23:58:57.004553       1 azuremanagedcontrolplane_controller.go:234] "Reconciling AzureManagedControlPlane" logger="controllers.AzureManagedControlPlaneReconciler.reconcileNormal" controller="azuremanagedcontrolplane" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedControlPlane" AzureManagedControlPlane="default/argocluster" namespace="default" name="argocluster" reconcileID="496335b7-c7c5-42d6-b1a7-4e99ef398eca" x-ms-correlation-request-id="cced4b6d-844f-46f0-8d73-a07c07b8d9ff"
E0214 23:58:57.004718       1 managedcontrolplane.go:427] "Unable to determine if ManagedControlPlaneScope VNET is managed by capz, assuming unmanaged" err="VirtualNetwork.network.azure.com \"argocluster\" not found" logger="scope.ManagedControlPlaneScope.IsVnetManaged" x-ms-correlation-request-id="28d3fddd-4e53-4226-ac43-d50c841734ff" AzureManagedCluster="argocluster"

@dtzar
Copy link
Contributor Author

dtzar commented Feb 27, 2024

Unfortunately, I can't reproduce the problem right now - but will circle back when/if I do.

@nojnhuh
Copy link
Contributor

nojnhuh commented Feb 27, 2024

It almost seems like there's something wonky in your Workload ID setup or your sub or something. I've never seen this kind of error in e2e or locally for me.

@dtzar
Copy link
Contributor Author

dtzar commented Mar 14, 2024

I have a live repo of this now. I'm pretty sure you can reproduce this by doing the following:

  1. Let CAPZ create a cluster
  2. Power off the CAPZ management cluster
  3. Manually delete the AKS cluster resource (but leave everything else in the RG)
  4. Bring back online the CAPZ management cluster

Here's my logs I have right now from ASO (using 1.14.0 release also).

I0314 21:49:12.626378       1 recorder.go:104] "msg"="Reason: ResourceNotFound, Severity: Warning, RetryClassification: RetrySlow, Cause: The Resource 'Microsoft.ContainerService/managedClusters/argocluster' under resource group 'argocluster' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix: PUT https://management.azure.com/subscriptions/####/resourceGroups/argocluster/providers/Microsoft.ContainerService/managedClusters/argocluster/agentPools/pool1\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: ResourceNotFound\n

@nawazkh
Copy link
Member

nawazkh commented Mar 15, 2024

#4609 May or may not be connected to this issue. But since both of these issues are around deletion, I am mentioning the other issue as well in here.

@dtzar dtzar added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type labels Mar 25, 2024
@willie-yao
Copy link
Contributor

Trying to fix this now and followed the steps to reproduce. I'm getting a slightly different error:

[manager] E0327 00:36:47.016389       1 generic_reconciler.go:361] "msg"="Encountered error impacting Ready condition" "error"="Reason: ReconciliationBlocked, Severity: Warning, RetryClassification: RetrySlow, Cause: Managed cluster \"aks-6629\" is in provisioning state \"Deleting\"" "logger"="controllers.ManagedClustersAgentPoolController" "name"="aks-6629-pool1" "namespace"="default"

I think that might be because I didn't wait for the Azure resource to fully delete before restarting the management cluster. This should be caused by the same issue though. I think the last time when we reproduced this together, we also started a delete locally before doing so on Azure.

@dtzar
Copy link
Contributor Author

dtzar commented Mar 27, 2024

Yes, we did start the delete before powering the cluster off. So between step 1 and 2 should be added "delete the CAPZ cluster object" and don't let it finish deletion.

Either case IMO should be handled in some way though. If you don't initiate delete the CAPZ cluster object and manually delete the cluster while the management cluster is off, CAPZ should re-create the cluster as it has in its definition when it powers back on...

@willie-yao
Copy link
Contributor

Just confirmed that the cluster does delete after powering off and on the management cluster, even in the case where the replicas is set to 0. The cluster deletion just takes a while (~25 minutes). This can be tracked in #4339

/close

@k8s-ci-robot
Copy link
Contributor

@willie-yao: Closing this issue.

In response to this:

Just confirmed that the cluster does delete after powering off and on the management cluster, even in the case where the replicas is set to 0. The cluster deletion just takes a while (~25 minutes). This can be tracked in #4339

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/managedclusters Issues related to managed AKS clusters created through the CAPZ ManagedCluster Type kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

No branches or pull requests

5 participants