Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureManagedControlPlane - failed to reconcile kubeconfig secret, unable to retrieve the complete list of server APIs. #4738

Closed
itodorova1 opened this issue Apr 16, 2024 · 11 comments · Fixed by #4833
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@itodorova1
Copy link

/kind bug

[Before submitting an issue, have you checked the Troubleshooting Guide?]

What steps did you take and what happened:
After recent Cluster API upgrade from 1.11.4 to 1.14.2, we see the following error on multiple worker clusters:

E0416 11:49:50.130226       1 controller.go:329] "Reconciler error" err="error creating AzureManagedControlPlane xxx/xxx: failed to reconcile kubeconfig secret: failed to construct cluster-info: failed to reconcile certificate 
authority data secret for cluster: failed to get API group resources: unable to retrieve the complete
 list of server APIs: v1: Get \"https://xxx.hcp.westeurope.azmk8s.io:443/api/v1?timeout=10s\": net/http: request canceled
 while waiting for connection (Client.Timeout exceeded while awaiting headers)" 
controller="azuremanagedcontrolplane" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureManagedControlPlane" AzureManagedControlPlane="xxx/xxx" namespace="xxx" name="xxx" reconcileID="X-X-X"

curl and telnet tests performed towards the APIs - not working

What did you expect to happen:
CAPI does not have connection with the cluster's APIs, we have not specifically established such in the past.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Is this expected behavior? We did not have the issue in tha previous versions.

Environment:

  • cluster-api-provider-azure version: v1.14.2
  • Kubernetes version: (use kubectl version): 1.28.3
  • OS (e.g. from /etc/os-release): Ubuntu
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@willie-yao
Copy link
Contributor

Are you using AAD Pod identity or environment variables to set credentials? It was deprecated since v1.13.0. See the docs for more info on identity changes: https://capz.sigs.k8s.io/topics/multitenancy

Looking at the logs, there may also be something wrong with the CNI on the workload cluster.

@itodorova1
Copy link
Author

We are using AzureClusterIdentity with ServicePrincipal.
I believe ManualServicePrincipal is the depricated.
Also we see these errors on all workload clusters, which is our main concern that something has changed and for some reason we lost connection to the workloads APIs anymore.
We did not perform any changes that might have impacted the connection apart from the upgrade from v1.11.4 to v1.14.2
We tested the clutser by whitelisting our outbound IP and the connection is restored.
However checking the capi-controller logs, we se can see the following errors for all workload clusters:

E0418 11:38:17.178050       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},
NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,
KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},
VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster xxx/xxx: failed to create cluster accessor: error fetching REST client config for remote cluster \"xxx/xxx\": 
failed to retrieve kubeconfig secret for Cluster xxx/xxx: Secret \"xxx-kubeconfig\" 
not found" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="xxx/xxxintds2v5" namespace="xxx" name="xxxintds2v5" reconcileID=""

However when we do

$ clusterctl describe cluster -n xxx xxx 
NAME                                                         READY  SEVERITY  REASON  SINCE  MESSAGE 
Cluster/xxx                                                 True                     2d3h            
├─ClusterInfrastructure - AzureManagedCluster/xxx                                              
├─ControlPlane - AzureManagedControlPlane/xxx               True                     2d3h            
└─Workers                                                                                             
  ├─MachinePool/xxxintds2v5                                 True                     35h             
  ├─MachinePool/xxxintv5                                    True                     35h             
  └─MachinePool/xxxintv5dyn                                 True                     2d17h  

Just to confirm our secrets configuration:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
  labels:
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
  name: xxx
  namespace: xxx
spec:
  allowedNamespaces: {}
  clientSecret:
    name: xxx
    namespace: xxx
  type: ServicePrincipal

@willie-yao
Copy link
Contributor

@itodorova1 Can you share your cluster template? It could be something unrelated to secrets. It may also help to create a thread in our Slack to help us follow-up faster

@itodorova1
Copy link
Author

This is example of one of our clusters:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: xxx
  namespace: xxx
spec:
  paused: false
  clusterNetwork:
    services:
      cidrBlocks:
      - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureManagedControlPlane
    name: xxx
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureManagedCluster
    name: xxx
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedControlPlane
metadata:
  name: xxx
  namespace: xxx
spec:
  identityRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AzureClusterIdentity
    name: xxx-kubernetes-identity
  aadProfile:
    managed: true
    adminGroupObjectIDs:
    - zzz
    - zzz
  apiServerAccessProfile:
    authorizedIPRanges:
    - x.x.x.x/32 # internal
  location: westeurope
  resourceGroupName: xxx
  sshPublicKey: yyy    
  version: v1.29.2
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureManagedCluster
metadata:
  name: xxx
  namespace: xxx

@willie-yao
Copy link
Contributor

Just to confirm our secrets configuration:

Sorry for not catching this earlier, but I think you're missing the clientID and tenantID fields in the AzureClusterIdentity. These may have been set as environment variables in your previous configuration but CAPZ removed support for auth via environment variables.

@itodorova1
Copy link
Author

Appologies, i the complete AzureClusterIdentity template is:

apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AzureClusterIdentity
metadata:
  labels:
    clusterctl.cluster.x-k8s.io/move-hierarchy: "true"
  name: xxx-kubernetes-identity
  namespace: xxx
spec:
  allowedNamespaces: {}
  clientID: xxx-yyy-zzz
  clientSecret:
    name: xxx-kubernetes-identity
    namespace: xxx
  tenantID: xxx-yyy-zzz
  type: ServicePrincipal

We have in total 5 workload clusters and the error persist for all of them:

E0424 08:32:36.096543       1 controller.go:329] "Reconciler error" err="failed to add &Node{ObjectMeta:{
0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:
[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,
Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},
NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:
,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:
[],VolumesAttached:[]AttachedVolume{},Config:nil,},} watch on cluster xxx-/xxx-: failed to create cluster accessor: 
error fetching REST client config for remote cluster \"xxx-/xxx-\": failed to retrieve kubeconfig secret for Cluster xxx-/xxx-: Secret \"xxx--kubeconfig\" not found" 
controller="machinepool" controllerGroup="cluster.x-k8s.io" 
controllerKind="MachinePool" MachinePool="xxx-/xxxintds2v5" namespace="xxx-" name="xxxintds2v5" reconcileID="e7b2133f-ad1e-45ad-816f-9d6048aa96a8"

We are working on reproducing the issue in lab environment. Do you think that there are any chnages that might have impacted us between 1.11.4 and 1.14.2? For now we do not see any reason for that behaviors.

@itodorova1
Copy link
Author

@willie-yao We were able to recreate the issue in a lab environment.
It seems that after we upgraded capi, the existing workload clusters (existing) kubeconfig secrets are not labeled with the key-value pair cluster.x-k8s.io/cluster-name=${CLUSTER_NAME} by the Control Plane providers. The change was introduced in v1.5. Therefore CAPI managers' caches are probably failing.
We also tested creating a cluster with the latest CAPI version and the kubeconfig secret is labeled by Control Plane, as expected.
We edited the lab kubeconfig secret manually and the issue seems to be resolved.
However, could you confirm that it is safe for production environment, as we are not able to find another way for now?

@willie-yao
Copy link
Contributor

@itodorova1 Good find! Are you saying that the key-value pair was removed for v1.5 and added back in v1.6? This does seem like unusual behavior, but I think editing the kubeconfig secret manually seems fine for me as a temporary fix. @jackfrancis @nojnhuh wdyt?

I'll try and reproduce this on my end and see if it's a CAPI issue. It could be worth creating an issue in CAPI as well.

@willie-yao
Copy link
Contributor

willie-yao commented May 6, 2024

It seems like this doc PR in CAPI is related: kubernetes-sigs/cluster-api#9080

The wording in there suggests that adding the label to the kubeconfig is a recommended approach.

@nojnhuh
Copy link
Contributor

nojnhuh commented May 6, 2024

I can see where CAPZ only creates new Secrets with the label but does not add it to existing ones which I would call a bug.

@willie-yao
Copy link
Contributor

willie-yao commented May 6, 2024

Ah I see where the bug is coming from. Will work on a fix! Thanks @nojnhuh
/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants