Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KCP] Fails to upgrade single node control plane #2915

Closed
sedefsavas opened this issue Apr 15, 2020 · 26 comments · Fixed by #2958
Closed

[KCP] Fails to upgrade single node control plane #2915

sedefsavas opened this issue Apr 15, 2020 · 26 comments · Fixed by #2958
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@sedefsavas
Copy link

What steps did you take and what happened:
On a single-node control-plane, I upgraded kubernetes version, etcd and CoreDNS image tags by modifying KCP object. New node with upgraded Kubernetes version is created, old node is physically deleted but old node object and all pods that was on the old pod are dangling.

root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/# kubectl get nodes -A
NAME                                                  STATUS     ROLES    AGE     VERSION
cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj     NotReady   master   7m11s   v1.17.0
cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr     Ready      master   6m20s   v1.17.2
cluster-tvnb5a-cluster-tvnb5a-md-0-66496845f6-kvh9v   Ready      <none>   6m51s   v1.17.0
root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/# kubectl get pods -A
NAMESPACE     NAME                                                                        READY   STATUS             RESTARTS   AGE
kube-system   coredns-6955765f44-dpwvr                                                    1/1     Terminating        0          7m12s
kube-system   coredns-6955765f44-g4df6                                                    1/1     Terminating        0          7m12s
kube-system   coredns-7987b8d68f-f2d78                                                    1/1     Running            0          4m24s
kube-system   coredns-7987b8d68f-rftvl                                                    1/1     Running            0          4m23s
kube-system   etcd-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj                      0/1     CrashLoopBackOff   3          7m16s
kube-system   etcd-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr                      1/1     Running            0          6m19s
kube-system   kindnet-s6rvq                                                               1/1     Running            0          7m8s
kube-system   kindnet-wmxqx                                                               1/1     Running            0          6m37s
kube-system   kindnet-xvdks                                                               1/1     Running            0          7m12s
kube-system   kube-apiserver-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj            1/1     Running            0          7m16s
kube-system   kube-apiserver-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr            1/1     Running            0          6m36s
kube-system   kube-controller-manager-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj   1/1     Running            1          7m16s
kube-system   kube-controller-manager-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr   1/1     Running            0          6m35s
kube-system   kube-proxy-gfghs                                                            1/1     Running            0          4m22s
kube-system   kube-proxy-lj9lv                                                            1/1     Terminating        0          7m12s
kube-system   kube-proxy-qlgpw                                                            1/1     Running            0          3m57s
kube-system   kube-scheduler-cluster-tvnb5a-cluster-tvnb5a-control-plane-9jqkj            1/1     Running            1          7m16s
kube-system   kube-scheduler-cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr            1/1     Running            0          6m35s
root@cluster-tvnb5a-cluster-tvnb5a-control-plane-cdrtr:/#

What did you expect to happen:
I expected KCP to remove node object and cleanup resources that was on that node.

/kind bug
/area control-plane

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/control-plane Issues or PRs related to control-plane lifecycle management labels Apr 15, 2020
@vincepri
Copy link
Member

This seems like needs to be investigated
/priority critical-urgent
/milestone v0.3.4

@k8s-ci-robot k8s-ci-robot added this to the v0.3.4 milestone Apr 15, 2020
@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 15, 2020
@benmoss
Copy link

benmoss commented Apr 15, 2020

I tried reproducing with CAPA and everything worked as it should, the new machine appears and joins the cluster, the old machine is deleted and its node disappears when it's removed

@sedefsavas
Copy link
Author

I am testing it on a clusterctl initiated cluster with CAPD. I am investigating a bit more to see if this is related to image tag upgrades, will update soon.

@sedefsavas
Copy link
Author

sedefsavas commented Apr 15, 2020

e2e test (docker_upgrade_test.go) fails when I change control plane replica number from 3 to 1.
Physical container is being deleted but kubernetes cluster still has it in its node list.

E.g., below test-upgrade-0-test-upgrade-0-cwcb2 is the old node.

root@test-upgrade-0-test-upgrade-0-kzhxq:/# kubectl get nodes -A
NAME                                                STATUS     ROLES    AGE     VERSION
test-upgrade-0-test-upgrade-0-cwcb2                 NotReady   master   8m21s   v1.16.3
test-upgrade-0-test-upgrade-0-kzhxq                 Ready      master   6m20s   v1.17.2
test-upgrade-0-test-upgrade-0-md-5b5fdfd689-8dfvw   Ready      <none>   7m49s   v1.16.3

➜  cluster-api git:(sss) ✗ docker ps                                                    
CONTAINER ID        IMAGE                          COMMAND                  CREATED             STATUS              PORTS                                  NAMES
f1afbb205656        kindest/node:v1.17.2           "/usr/local/bin/entr…"   7 minutes ago       Up 7 minutes        45697/tcp, 127.0.0.1:45697->6443/tcp   test-upgrade-0-test-upgrade-0-kzhxq
b1431d3879d4        kindest/node:v1.16.3           "/usr/local/bin/entr…"   8 minutes ago       Up 8 minutes                                               test-upgrade-0-test-upgrade-0-md-5b5fdfd689-8dfvw
ceadbb46bfb1        kindest/haproxy:2.1.1-alpine   "/docker-entrypoint.…"   9 minutes ago       Up 9 minutes        33733/tcp, 0.0.0.0:33733->6443/tcp     test-upgrade-0-lb
54f344d2666e        kindest/node:v1.17.2           "/usr/local/bin/entr…"   11 minutes ago      Up 11 minutes       127.0.0.1:63611->6443/tcp              docker-e2e-hjwvew-control-plane

I also tried upgrading just the Kubernetes version and not touch image tags. Same result, old node is dangling. Since @benmoss confirmed that it is working in aws, I suspect this is a CAPD issue.

@vincepri
Copy link
Member

@sedefsavas scaling down a KCP replicas isn't a supported use case, @detiber can you confirm?

@detiber
Copy link
Member

detiber commented Apr 15, 2020

Scale down is something that we had originally intended to support in the proposal (mainly as a pre-requisite for upgrade), not sure if anything has changed since, though.

@vincepri
Copy link
Member

I was under the assumption that we were not allowing to go 1 replica -> 3 replicas -> 1 replica

@sedefsavas
Copy link
Author

@vincepri @detiber Sorry for the confusion. I am not scaling down, I changed the hardcoded 3 control plane replica number in the test to 1.

@sedefsavas
Copy link
Author

I see this issue with CAPV too. It is happening more often than not. @benmoss can you redo this test for CAPA too to see if it is consistently succeeding?

@vincepri
Copy link
Member

Is scale-down not working for this node? Can you up the logs in the controller and try to trace what happens?

@sedefsavas
Copy link
Author

sedefsavas commented Apr 16, 2020

Yes, in the scale down. It is failing to remove etcd member, hence machine is never deleted.
[manager] E0416 21:46:02.455746 8 scale.go:117] "msg"="Failed to remove etcd member for machine" "error"="failed to create etcd client: unable to create etcd client: context deadline exceeded" "cluster-nanme"="sedef" "name"="sedef" "namespace"="default"

I don't understand why it happens only sometimes though.

@vincepri
Copy link
Member

Sounds like it could be a timing issue, it'd be great to have an exact trace when it fails

@benmoss
Copy link

benmoss commented Apr 17, 2020

Can you give some more details of what the change you're making is? You're upgrading Kubernetes, etcd, and CoreDNS all at the same time?

@sedefsavas
Copy link
Author

No, only upgrading Kubernetes version.
I see different errors at different times. One of them is panic during ForwardETCDLeardship(). This is rarely happening

[manager] I0417 01:59:06.749227 7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default” 
[manager] I0417 01:59:06.750460 7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"=“default"
 [manager] E0417 01:59:41.357403 7 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) 
[manager] goroutine 316 [running]: 
[manager] k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16b0ec0
, 0x2712180) 
[manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
 [manager] k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0
, 0x0, 0x0) 
[manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82 
[manager] panic(0x16b0ec0, 0x2712180) 
[manager] /usr/local/Cellar/go/1.13.8/libexec/src/runtime/panic.go:679 +0x1b2 
[manager]
 sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).ForwardEtcdLeadership(0xc00e0051d0
, 0x1b0da20, 0xc000046098, 0xc00d614900, 0xc00d614b20, 0xc00bc79788, 0x0)
 [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/internal/workload_cluster_etcd.go:275 +0x1b9
 [manager] 
sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).scaleDownControlPlane(0xc000120ba0
, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0xc00e857788, 0x0, 0x0, 0x0, 0x0) 
[manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/scale.go:103 +0x260
 [manager] 
sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).upgradeControlPlane(0xc000120ba0
, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0xc00bc79788, 0x5, 0xc00012fdd0, 0x1, 0x1) 
[manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/upgrade.go:91 +0x69c
 [manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc000120ba0, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0x0, 0x0, 0x0, 0xc000b34750) 
[manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/controller.go:226 +0x11c1 
[manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc000120ba0, 0xc0007f1500, 0x7, 0xc0007f14f0, 0x5, 0xc00bc79c00, 0x4a817c800, 0x0, 0x0)
 [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/controller.go:173 +0x632 
[manager] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00071e180, 0x1712260, 0xc0037ced80, 0x0) 
[manager] /Users/ssavas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:256 +0x162 
[manager] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00071e180, 0xc0008e4700) 
[manager] /Users/ssavas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:232 +0xcb 
[manager] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00071e180) 
[manager] /Users/ssavas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:211 +0x2b
 [manager] k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000314200) 
[manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e 
[manager] k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000314200, 0x3b9aca00, 0x0, 0x1, 0xc00016d3e0)
 [manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8 
[manager] k8s.io/apimachinery/pkg/util/wait.Until(0xc000314200, 0x3b9aca00, 0xc00016d3e0) 
[manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
 [manager] created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1 
[manager] /Users/ssavas/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:193 +0x328 
[manager] I0417 01:59:41.367953 7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default” 
[manager] I0417 01:59:41.372939 7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default” 
[manager] panic: runtime error: invalid memory address or nil pointer dereference [recovered]
 [manager] panic: runtime error: invalid memory address or nil pointer dereference 
[manager] [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x152f099] 
[manager] [manager] goroutine 316 [running]:
 [manager] k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0
[manager] /Users/ssavas/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x105
 [manager] panic(0x16b0ec0, 0x2712180) 
[manager] /usr/local/Cellar/go/1.13.8/libexec/src/runtime/panic.go:679 +0x1b2 
[manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).ForwardEtcdLeadership(0xc00e0051d0, 0x1b0da20, 0xc000046098, 0xc00d614900, 0xc00d614b20, 0xc00bc79788, 0x0)
 [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/internal/workload_cluster_etcd.go:275 +0x1b9
 [manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).scaleDownControlPlane(0xc000120ba0, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0xc00e857788, 0x0, 0x0, 0x0, 0x0) [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/scale.go:103 +0x260 [manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).upgradeControlPlane(0xc000120ba0, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0xc00bc79788, 0x5, 0xc00012fdd0, 0x1, 0x1)
 [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/upgrade.go:91 +0x69c 
[manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc000120ba0, 0x1b0da20, 0xc000046098, 0xc00b70fc80, 0xc00c99eb00, 0x0, 0x0, 0x0, 0xc000b34750) [manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/controller.go:226 +0x11c1 [manager] sigs.k8s.io/cluster-api/controlplane/kubeadm/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc000120ba0, 0xc0007f1500, 0x7, 0xc0007f14f0, 0x5, 0xc00bc79c00, 0x4a817c800, 0x0, 0x0) 
[manager] /Users/ssavas/dev/capi/tilttest/cluster-api/controlplane/kubeadm/controllers/controller.go:173 +0x632 

@vincepri
Copy link
Member

It seems like that failure is related to the leaderCandidate not having a NodeRef yet, which is a little strange.

@vincepri
Copy link
Member

Aren't we waiting any longer that all the nodes in the control plane are ready before proceeding to delete the older machines?

@sedefsavas
Copy link
Author

I don't see a node ready check, without CNI is installed, 3 node control-plane upgrades are working fine.

@vincepri
Copy link
Member

Can you run a test locally with a custom build after adding a check the NodeRef is there for the leaderCandidate and see if that fixes it?

@sedefsavas
Copy link
Author

Another set of errors I see when the times it does not panic during upgrade is etcd remove member error:

I0417 18:26:52.486997       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
E0417 18:27:03.883410       7 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile the remote kubelet RBAC role: failed to determine if kubelet config rbac role \"kubeadm:kubelet-config-1.17\" already exists: etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"sedef"}
I0417 18:27:03.883958       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:03.884551       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:14.322259       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:14.322810       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:25.408750       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:25.409980       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:49.643253       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:27:49.643485       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:28:09.644642       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:28:09.645094       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:28:59.460770       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:28:59.461633       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:29:44.932322       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:29:44.932766       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:30:31.972144       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:30:31.973192       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:02.415326       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:02.415963       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:29.956943       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:29.957272       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:59.924889       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:31:59.925248       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
E0417 18:32:32.872552       7 scale.go:108]  "msg"="Failed to remove etcd member for machine" "error"="failed to create etcd client: unable to create etcd client: context deadline exceeded" "cluster-name"="sedef" "name"="sedef" "namespace"="default" 
E0417 18:32:34.915981       7 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to create etcd client: unable to create etcd client: context deadline exceeded"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"sedef"}
I0417 18:32:34.916224       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:32:34.916528       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:32:59.927359       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:32:59.927820       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
E0417 18:34:38.345803       7 controller.go:147] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get https://192.168.5.16:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
E0417 18:34:38.355740       7 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="Get https://192.168.5.16:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\u0026timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"sedef"}
I0417 18:34:38.356737       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:34:38.359137       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
E0417 18:34:56.915696       7 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile the remote kubelet RBAC role: failed to determine if kubelet config rbac role \"kubeadm:kubelet-config-1.17\" already exists: Get https://192.168.5.16:6443/apis/rbac.authorization.k8s.io/v1/namespaces/kube-system/roles/kubeadm:kubelet-config-1.17?timeout=30s: http2: server sent GOAWAY and closed the connection; LastStreamID=14267, ErrCode=NO_ERROR, debug=\"\""  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"sedef"}
I0417 18:34:56.916109       7 controller.go:179] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
I0417 18:34:56.916387       7 controller.go:225] controllers/KubeadmControlPlane "msg"="Upgrading Control Plane" "cluster"="sedef" "kubeadmControlPlane"="sedef" "namespace"="default" 
W0417 18:35:14.844579       7 http.go:392] Error reading backend response: unexpected EOF

Now testing what @vincepri suggested.

@vincepri
Copy link
Member

If I had to guess, it seems like it's trying to remove the etcd member too soon

@sedefsavas
Copy link
Author

sedefsavas commented Apr 17, 2020

@vincepri Noderef is not the issue, we already wait for kube-api-server to be ready.
Tested, same result.

@sedefsavas
Copy link
Author

/assign

@vincepri
Copy link
Member

vincepri commented Apr 17, 2020

We shouldn't have panics though, so we need a check in place somewhere before doing that scale down

we already wait for kube-api-server to be ready

We need to wait for the NodeRef to be on Machines

@Xenwar
Copy link

Xenwar commented Apr 28, 2020

Yes, in the scale down. It is failing to remove etcd member, hence machine is never deleted.
[manager] E0416 21:46:02.455746 8 scale.go:117] "msg"="Failed to remove etcd member for machine" "error"="failed to create etcd client: unable to create etcd client: context deadline exceeded" "cluster-nanme"="sedef" "name"="sedef" "namespace"="default"

I don't understand why it happens only sometimes though.

I have seen that consistently in metal3-dev-env.
In the upgrade of K8S version process, both during scale up and scale down. In order to reach the scale down phase.

@sedefsavas Do you have any update on this issue or some time line ?

@vincepri
Copy link
Member

@Xenwar There is a PR currently open that has more information #2958, this is considered release blocking, need to have a fix before v0.3.4 is cut.

@Xenwar
Copy link

Xenwar commented Apr 28, 2020

@vincepri Thanks, will follow the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
6 participants