KCP is scaling continuosly #3353

jan-est · 2020-07-16T12:15:35Z

What steps did you take and what happened:

We have been debugging KCP problem in Metal3 last two days. We deploy KCP with one replica and after the infrastructure is ready and our baremetal node is provisioned, it stays up for a while. But then KCP starts to scale up another replica and deletes the first one. This stay in the loop, so each time new replica is ready, KCP starts to scale again.

What did you expect to happen:

KCP to be provisioned with single replica.

Anything else you would like to add:

KCP controller logs are full of:

I0716 02:04:38.450184       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 02:04:38.451160       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3"

E0716 10:21:24.809873       1 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile the remote kubelet RBAC role: failed to determine if resource kube-system/kubeadm:kubelet-config-1.18 already exists: etcdserver: leader changed" "controller"="kubeadmcontrolplane" "name"="test1" "namespace"="metal3"
I0716 10:21:24.819917       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:24.822555       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:25.959716       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:25.960982       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
E0716 10:22:28.784067       1 controller.go:188] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get https://192.168.111.249:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
E0716 10:22:28.821262       1 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="Get https://192.168.111.249:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="test1" "namespace"="metal3"

Problem is seen in KCP status as well.

status:
    conditions:
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Rolling 2 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"
      type: Ready
    - lastTransitionTime: "2020-07-16T08:27:03Z"
      status: "True"
      type: Available
    - lastTransitionTime: "2020-07-16T08:17:17Z"
      status: "True"
      type: CertificatesAvailable
    - lastTransitionTime: "2020-07-16T08:27:35Z"
      message: 1 of 2 completed
      reason: WaitingForInfrastructure@Machine/test1-6jsrb
      severity: Info
      status: "False"
      type: MachinesReady
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Rolling 2 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"
      type: MachinesSpecUpToDate
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Scaling down to 1 replicas (actual 2)
      reason: ScalingDown
      severity: Warning
      status: "False"
      type: Resized
    initialized: true
    observedGeneration: 1
    ready: true
    readyReplicas: 1
    replicas: 2
    selector: cluster.x-k8s.io/cluster-name=test1,cluster.x-k8s.io/control-plane
    unavailableReplicas: 1

Environment:

metal3-dev-env
Kind cluster

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

smoshiur1237 · 2020-07-16T12:57:37Z

Adding some more data to this scenario for better understanding. So even KCP replicas is set to 1, KCP is provisioning new controlplane, get provisioned and starts deleting the first controlplane. When its deleting the controlplane, KCP starts to provision new controlplane.

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       provisioning          test1-controlplane-9qw49   ipmi://192.168.111.1:6230   unknown            true     
node-1   OK       ready                                            ipmi://192.168.111.1:6231   unknown            false    
node-2   OK       ready                                            ipmi://192.168.111.1:6232   unknown            false    
node-3   OK       provisioned           test1-controlplane-fr7xh   ipmi://192.168.111.1:6233   unknown            true     

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       provisioned           test1-controlplane-9qw49   ipmi://192.168.111.1:6230   unknown            true     
node-1   OK       provisioned           test1-controlplane-nvdxb   ipmi://192.168.111.1:6231   unknown            true     
node-2   OK       ready                                            ipmi://192.168.111.1:6232   unknown            false    
node-3   OK       ready                                            ipmi://192.168.111.1:6233   unknown            false    

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       deprovisioning                                   ipmi://192.168.111.1:6230   unknown            false    
node-1   OK       provisioned           test1-controlplane-nvdxb   ipmi://192.168.111.1:6231   unknown            true     
node-2   OK       provisioning          test1-controlplane-9qk7b   ipmi://192.168.111.1:6232   unknown            true     
node-3   OK       ready

fabriziopandini · 2020-07-16T13:42:43Z

@smoshiur1237 is it possble to get the output of kubectl get get kcp and kubectl get machines together with the output of kubectl get bmh (i guess it is the infrastructure machine)
Also, what command are you issuing to make "KCP starts to scale up"?

maelk · 2020-07-16T13:53:54Z

bmh is not the actual infrastructure object, that would be the Metal3Machine. BareMetalHost is one level under. We do not run any command to trigger a rollout of KCP. it directly starts when we apply the manifests.

detiber · 2020-07-16T14:05:04Z

I think it would still be helpful to see the output of the kcp resource, machine resources, and infrastructure machine resources to try to get a better idea of what could be causing the controller to be confused and consistently trigger a rolling upgrade

maelk · 2020-07-16T15:15:07Z

An example of resources is here : https://kubernetes.slack.com/files/UF98WRP8R/F0177HNULUS/rollout-debug.yaml

benmoss · 2020-07-16T15:24:57Z

Followup from Slack: we think that it's the result of the KCP not having a clusterConfiguration set. We do a comparison here of a cached version of the clusterConfiguration stored in a Machine annotation with the current clusterConfiguration. In this case the Machine has "null" as the value of the annotation, since the clusterConfiguration was nil. The theory is this gets unmarshaled and compared with nil which returns false.

We want to handle clusterConfiguration being nil, so this should be fixed

fabriziopandini · 2020-07-16T15:43:51Z

Fix in flight; possible workaround: set ClusterConfiguration: {} in KCP

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 16, 2020

fabriziopandini mentioned this issue Jul 16, 2020

🐛 KCP: fix problem in rollout detection logic when ClusterConfiguration is nil #3356

Merged

ncdc added this to the v0.3.8 milestone Jul 16, 2020

k8s-ci-robot closed this as completed in #3356 Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP is scaling continuosly #3353

KCP is scaling continuosly #3353

jan-est commented Jul 16, 2020

smoshiur1237 commented Jul 16, 2020

fabriziopandini commented Jul 16, 2020

maelk commented Jul 16, 2020

detiber commented Jul 16, 2020

maelk commented Jul 16, 2020

benmoss commented Jul 16, 2020

fabriziopandini commented Jul 16, 2020

KCP is scaling continuosly #3353

KCP is scaling continuosly #3353

Comments

jan-est commented Jul 16, 2020

smoshiur1237 commented Jul 16, 2020

fabriziopandini commented Jul 16, 2020

maelk commented Jul 16, 2020

detiber commented Jul 16, 2020

maelk commented Jul 16, 2020

benmoss commented Jul 16, 2020

fabriziopandini commented Jul 16, 2020