Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP is scaling continuosly #3353

Closed
jan-est opened this issue Jul 16, 2020 · 7 comments · Fixed by #3356
Closed

KCP is scaling continuosly #3353

jan-est opened this issue Jul 16, 2020 · 7 comments · Fixed by #3356
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@jan-est
Copy link
Contributor

jan-est commented Jul 16, 2020

What steps did you take and what happened:

We have been debugging KCP problem in Metal3 last two days. We deploy KCP with one replica and after the infrastructure is ready and our baremetal node is provisioned, it stays up for a while. But then KCP starts to scale up another replica and deletes the first one. This stay in the loop, so each time new replica is ready, KCP starts to scale again.

What did you expect to happen:

KCP to be provisioned with single replica.

Anything else you would like to add:

KCP controller logs are full of:

I0716 02:04:38.450184       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 02:04:38.451160       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
E0716 10:21:24.809873       1 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile the remote kubelet RBAC role: failed to determine if resource kube-system/kubeadm:kubelet-config-1.18 already exists: etcdserver: leader changed" "controller"="kubeadmcontrolplane" "name"="test1" "namespace"="metal3"
I0716 10:21:24.819917       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:24.822555       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:25.959716       1 controller.go:232] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
I0716 10:21:25.960982       1 controller.go:295] controllers/KubeadmControlPlane "msg"="Rolling out Control Plane machines" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
E0716 10:22:28.784067       1 controller.go:188] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get https://192.168.111.249:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "cluster"="test1" "kubeadmControlPlane"="test1" "namespace"="metal3" 
E0716 10:22:28.821262       1 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="Get https://192.168.111.249:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="test1" "namespace"="metal3"

Problem is seen in KCP status as well.

status:
    conditions:
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Rolling 2 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"
      type: Ready
    - lastTransitionTime: "2020-07-16T08:27:03Z"
      status: "True"
      type: Available
    - lastTransitionTime: "2020-07-16T08:17:17Z"
      status: "True"
      type: CertificatesAvailable
    - lastTransitionTime: "2020-07-16T08:27:35Z"
      message: 1 of 2 completed
      reason: WaitingForInfrastructure@Machine/test1-6jsrb
      severity: Info
      status: "False"
      type: MachinesReady
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Rolling 2 replicas with outdated spec (0 replicas up to date)
      reason: RollingUpdateInProgress
      severity: Warning
      status: "False"
      type: MachinesSpecUpToDate
    - lastTransitionTime: "2020-07-16T08:27:06Z"
      message: Scaling down to 1 replicas (actual 2)
      reason: ScalingDown
      severity: Warning
      status: "False"
      type: Resized
    initialized: true
    observedGeneration: 1
    ready: true
    readyReplicas: 1
    replicas: 2
    selector: cluster.x-k8s.io/cluster-name=test1,cluster.x-k8s.io/control-plane
    unavailableReplicas: 1

Environment:

  • metal3-dev-env
  • Kind cluster

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 16, 2020
@smoshiur1237
Copy link

Adding some more data to this scenario for better understanding. So even KCP replicas is set to 1, KCP is provisioning new controlplane, get provisioned and starts deleting the first controlplane. When its deleting the controlplane, KCP starts to provision new controlplane.

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       provisioning          test1-controlplane-9qw49   ipmi://192.168.111.1:6230   unknown            true     
node-1   OK       ready                                            ipmi://192.168.111.1:6231   unknown            false    
node-2   OK       ready                                            ipmi://192.168.111.1:6232   unknown            false    
node-3   OK       provisioned           test1-controlplane-fr7xh   ipmi://192.168.111.1:6233   unknown            true     

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       provisioned           test1-controlplane-9qw49   ipmi://192.168.111.1:6230   unknown            true     
node-1   OK       provisioned           test1-controlplane-nvdxb   ipmi://192.168.111.1:6231   unknown            true     
node-2   OK       ready                                            ipmi://192.168.111.1:6232   unknown            false    
node-3   OK       ready                                            ipmi://192.168.111.1:6233   unknown            false    

$ kubectl get bmh -n metal3
NAME     STATUS   PROVISIONING STATUS   CONSUMER                   BMC                         HARDWARE PROFILE   ONLINE   ERROR
node-0   OK       deprovisioning                                   ipmi://192.168.111.1:6230   unknown            false    
node-1   OK       provisioned           test1-controlplane-nvdxb   ipmi://192.168.111.1:6231   unknown            true     
node-2   OK       provisioning          test1-controlplane-9qk7b   ipmi://192.168.111.1:6232   unknown            true     
node-3   OK       ready       

@fabriziopandini
Copy link
Member

@smoshiur1237 is it possble to get the output of kubectl get get kcp and kubectl get machines together with the output of kubectl get bmh (i guess it is the infrastructure machine)
Also, what command are you issuing to make "KCP starts to scale up"?

@maelk
Copy link
Contributor

maelk commented Jul 16, 2020

bmh is not the actual infrastructure object, that would be the Metal3Machine. BareMetalHost is one level under. We do not run any command to trigger a rollout of KCP. it directly starts when we apply the manifests.

@detiber
Copy link
Member

detiber commented Jul 16, 2020

I think it would still be helpful to see the output of the kcp resource, machine resources, and infrastructure machine resources to try to get a better idea of what could be causing the controller to be confused and consistently trigger a rolling upgrade

@maelk
Copy link
Contributor

maelk commented Jul 16, 2020

An example of resources is here : https://kubernetes.slack.com/files/UF98WRP8R/F0177HNULUS/rollout-debug.yaml

@benmoss
Copy link

benmoss commented Jul 16, 2020

Followup from Slack: we think that it's the result of the KCP not having a clusterConfiguration set. We do a comparison here of a cached version of the clusterConfiguration stored in a Machine annotation with the current clusterConfiguration. In this case the Machine has "null" as the value of the annotation, since the clusterConfiguration was nil. The theory is this gets unmarshaled and compared with nil which returns false.

We want to handle clusterConfiguration being nil, so this should be fixed

@fabriziopandini
Copy link
Member

Fix in flight; possible workaround: set ClusterConfiguration: {} in KCP

@ncdc ncdc added this to the v0.3.8 milestone Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
8 participants