AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

itodorova1 · 2024-04-16T12:30:58Z

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
After recent Cluster API upgrade from 1.11.4 to 1.14.2, we see the following error on multiple worker clusters:

E0416 11:49:50.130226       1 controller.go:329] "Reconciler error" err="error creating AzureManagedControlPlane xxx/xxx: failed to reconcile kubeconfig secret: failed to construct cluster-info: failed to reconcile certificate 
authority data secret for cluster: failed to get API group resources: unable to retrieve the complete
 list of server APIs: v1: Get \"https://xxx.hcp.westeurope.azmk8s.io:443/api/v1?timeout=10s\": net/http: request canceled
 while waiting for connection (Client.Timeout exceeded while awaiting headers)" 
controller="azuremanagedcontrolplane" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedControlPlane" AzureManagedControlPlane="xxx/xxx" namespace="xxx" name="xxx" reconcileID="X-X-X"

curl and telnet tests performed towards the APIs - not working

What did you expect to happen:
CAPI does not have connection with the cluster's APIs, we have not specifically established such in the past.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Is this expected behavior? We did not have the issue in tha previous versions.

Environment:

cluster-api-provider-azure version: v1.14.2
Kubernetes version: (use kubectl version): 1.28.3
OS (e.g. from /etc/os-release): Ubuntu

The text was updated successfully, but these errors were encountered:

willie-yao · 2024-04-17T20:18:18Z

Are you using AAD Pod identity or environment variables to set credentials? It was deprecated since v1.13.0. See the docs for more info on identity changes: https://capz.sigs.k8s.io/topics/multitenancy

Looking at the logs, there may also be something wrong with the CNI on the workload cluster.

itodorova1 · 2024-04-18T12:42:04Z

We are using AzureClusterIdentity with ServicePrincipal.
I believe ManualServicePrincipal is the depricated.
Also we see these errors on all workload clusters, which is our main concern that something has changed and for some reason we lost connection to the workloads APIs anymore.
We did not perform any changes that might have impacted the connection apart from the upgrade from v1.11.4 to v1.14.2
We tested the clutser by whitelisting our outbound IP and the connection is restored.
However checking the capi-controller logs, we se can see the following errors for all workload clusters:

E0418 11:38:17.178050       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},
NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,
KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},
VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster xxx/xxx: failed to create cluster accessor: error fetching REST client config for remote cluster \"xxx/xxx\": 
failed to retrieve kubeconfig secret for Cluster xxx/xxx: Secret \"xxx-kubeconfig\" 
not found" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="xxx/xxxintds2v5" namespace="xxx" name="xxxintds2v5" reconcileID=""

However when we do

$ clusterctl describe cluster -n xxx xxx 
NAME                                                         READY  SEVERITY  REASON  SINCE  MESSAGE 
Cluster/xxx                                                 True                     2d3h            
├─ClusterInfrastructure - AzureManagedCluster/xxx                                              
├─ControlPlane - AzureManagedControlPlane/xxx               True                     2d3h            
└─Workers                                                                                             
  ├─MachinePool/xxxintds2v5                                 True                     35h             
  ├─MachinePool/xxxintv5                                    True                     35h             
  └─MachinePool/xxxintv5dyn                                 True                     2d17h

Just to confirm our secrets configuration:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
  labels:
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
  name: xxx
  namespace: xxx
spec:
  allowedNamespaces: {}
  clientSecret:
    name: xxx
    namespace: xxx
  type: ServicePrincipal

willie-yao · 2024-04-19T18:15:25Z

@itodorova1 Can you share your cluster template? It could be something unrelated to secrets. It may also help to create a thread in our Slack to help us follow-up faster

itodorova1 · 2024-04-22T05:46:21Z

This is example of one of our clusters:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: xxx
  namespace: xxx
spec:
  paused: false
  clusterNetwork:
    services:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureManagedControlPlane
    name: xxx
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureManagedCluster
    name: xxx
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
metadata:
  name: xxx
  namespace: xxx
spec:
  identityRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureClusterIdentity
    name: xxx-kubernetes-identity
  aadProfile:
    managed: true
    adminGroupObjectIDs:
    - zzz
    - zzz
  apiServerAccessProfile:
    authorizedIPRanges:
    - x.x.x.x/32 # internal
  location: westeurope
  resourceGroupName: xxx
  sshPublicKey: yyy    
  version: v1.29.2
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedCluster
metadata:
  name: xxx
  namespace: xxx

willie-yao · 2024-04-23T21:20:39Z

Just to confirm our secrets configuration:

Sorry for not catching this earlier, but I think you're missing the clientID and tenantID fields in the AzureClusterIdentity. These may have been set as environment variables in your previous configuration but CAPZ removed support for auth via environment variables.

itodorova1 · 2024-04-24T08:50:19Z

Appologies, i the complete AzureClusterIdentity template is:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
  labels:
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
  name: xxx-kubernetes-identity
  namespace: xxx
spec:
  allowedNamespaces: {}
  clientID: xxx-yyy-zzz
  clientSecret:
    name: xxx-kubernetes-identity
    namespace: xxx
  tenantID: xxx-yyy-zzz
  type: ServicePrincipal

We have in total 5 workload clusters and the error persist for all of them:

E0424 08:32:36.096543       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{
0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:
[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,
Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},
NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:
,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:
[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster xxx-/xxx-: failed to create cluster accessor: 
error fetching REST client config for remote cluster \"xxx-/xxx-\": failed to retrieve kubeconfig secret for Cluster xxx-/xxx-: Secret \"xxx--kubeconfig\" not found" 
controller="machinepool" controllerGroup="cluster.x-k8s.io" 
controllerKind="MachinePool" MachinePool="xxx-/xxxintds2v5" namespace="xxx-" name="xxxintds2v5" reconcileID="e7b2133f-ad1e-45ad-816f-9d6048aa96a8"

We are working on reproducing the issue in lab environment. Do you think that there are any chnages that might have impacted us between 1.11.4 and 1.14.2? For now we do not see any reason for that behaviors.

itodorova1 · 2024-05-04T10:24:21Z

@willie-yao We were able to recreate the issue in a lab environment.
It seems that after we upgraded capi, the existing workload clusters (existing) kubeconfig secrets are not labeled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME} by the Control Plane providers. The change was introduced in v1.5. Therefore CAPI managers' caches are probably failing.
We also tested creating a cluster with the latest CAPI version and the kubeconfig secret is labeled by Control Plane, as expected.
We edited the lab kubeconfig secret manually and the issue seems to be resolved.
However, could you confirm that it is safe for production environment, as we are not able to find another way for now?

willie-yao · 2024-05-06T21:43:02Z

@itodorova1 Good find! Are you saying that the key-value pair was removed for v1.5 and added back in v1.6? This does seem like unusual behavior, but I think editing the kubeconfig secret manually seems fine for me as a temporary fix. @jackfrancis @nojnhuh wdyt?

I'll try and reproduce this on my end and see if it's a CAPI issue. It could be worth creating an issue in CAPI as well.

willie-yao · 2024-05-06T21:45:19Z

It seems like this doc PR in CAPI is related: kubernetes-sigs/cluster-api#9080

The wording in there suggests that adding the label to the kubeconfig is a recommended approach.

nojnhuh · 2024-05-06T21:48:45Z

I can see where CAPZ only creates new Secrets with the label but does not add it to existing ones which I would call a bug.

willie-yao · 2024-05-06T22:11:48Z

Ah I see where the bug is coming from. Will work on a fix! Thanks @nojnhuh
/assign

github-project-automation bot added this to CAPZ Planning Apr 16, 2024

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024

k8s-ci-robot assigned willie-yao May 6, 2024

willie-yao mentioned this issue May 10, 2024

Set cluster name label for pre-existing kubeconfig #4833

Merged

4 tasks

k8s-ci-robot closed this as completed in #4833 May 14, 2024

github-project-automation bot moved this to Done in CAPZ Planning May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

itodorova1 commented Apr 16, 2024

willie-yao commented Apr 17, 2024

itodorova1 commented Apr 18, 2024

willie-yao commented Apr 19, 2024

itodorova1 commented Apr 22, 2024

willie-yao commented Apr 23, 2024

itodorova1 commented Apr 24, 2024

itodorova1 commented May 4, 2024

willie-yao commented May 6, 2024

willie-yao commented May 6, 2024 •

edited

Loading

nojnhuh commented May 6, 2024

willie-yao commented May 6, 2024 •

edited

Loading

AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

Comments

itodorova1 commented Apr 16, 2024

willie-yao commented Apr 17, 2024

itodorova1 commented Apr 18, 2024

willie-yao commented Apr 19, 2024

itodorova1 commented Apr 22, 2024

willie-yao commented Apr 23, 2024

itodorova1 commented Apr 24, 2024

itodorova1 commented May 4, 2024

willie-yao commented May 6, 2024

willie-yao commented May 6, 2024 • edited Loading

nojnhuh commented May 6, 2024

willie-yao commented May 6, 2024 • edited Loading

willie-yao commented May 6, 2024 •

edited

Loading

willie-yao commented May 6, 2024 •

edited

Loading