Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeadm fails to bring up a HA cluster due to EOF error when uploading configmap #1321

Closed
iverberk opened this issue Dec 13, 2018 · 38 comments · Fixed by kubernetes/kubernetes#73093
Labels
area/HA help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@iverberk
Copy link

iverberk commented Dec 13, 2018

What keywords did you search in kubeadm issues before filing this one?

  • EOF uploading config

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version : kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:02:01Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version : Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Virtualbox VM
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.1 LTS
  • Kernel (e.g. uname -a):Linux controller-1 4.15.0-38-generic kubeadm output should be clear and beautiful #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Other: I have three VM's running that are connected via a host network with Virtualbox (10.10.0.11, 10.10.0.12 and 10.10.0.13). There is a docker container running on my host that binds to the gateway address for the host-network (10.10.0.1) to provide a control plane endpoint that the controller nodes can use. This works flawlessly with 1.12 version of Kubernetes (also kubeadm install).

What happened?

I'm trying to set up a HA cluster with three control plane nodes. I can successfully bootstrap the first controller but when I try to join the second controller it fails. After writing the etcd pod manifest it tries to write the new kubeadm-config (I guess with the updated controller api endpoints) but it fails with:

error uploading configuration: Get https://10.10.0.1:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: unexpected EOF

I'm using a haproxy loadbalancer in front of the three (to-be) API server nodes. HAproxy is quering the health endpoint of the API server and getting successful responses. Before joining the second controller I can successfully curl the endpoint with:

watch -n0.5 curl -k https://10.10.0.1:6443/api/v1/namespaces/kube-public/configmaps/cluster-info

When the second controller joins the above curl will fail with an EOF and only start working about 20-30 seconds later. In the meantime the join command tries to upload the new config and crashes.

What you expected to happen?

I would have expected that the config uploading was successful, either by waiting for a healthy control plane or no problems with the API server in the first place.

How to reproduce it (as minimally and precisely as possible)?

I'm setting this up in a Vagrant environment but I guess it's no different than described on the https://kubernetes.io/docs/setup/independent/high-availability/ page. Here is my kubeadm config for the first controller:

apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
bootstrapTokens:
- ttl: 1s
nodeRegistration:
  name: controller-1
  kubeletExtraArgs:
    node-ip: 10.10.0.11
    hostname-override: controller-1
localAPIEndpoint:
  advertiseAddress: 10.10.0.11
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: 1.13.0
clusterName: local
useHyperKubeImage: true
apiServer:
  certSANs:
  - "10.10.0.1"
  extraArgs:
    oidc-ca-file: /etc/kubernetes/pki/front-proxy-ca.crt
    oidc-issuer-url: https://keycloak.k8s.local/auth/realms/Kubernetes
    oidc-client-id: kubernetes
    oidc-username-claim: preferred_username
    oidc-username-prefix: user-
    oidc-groups-claim: groups
    oidc-groups-prefix: group-
    advertise-address: 10.10.0.11
    etcd-servers: "https://10.10.0.11:2379,https://10.10.0.12:2379,https://10.10.0.13:2379"
controlPlaneEndpoint: "10.10.0.1:6443"
networking:
  podSubnet: "10.200.0.0/16"

Anything else we need to know?

These are some of the api server logs when the etcd join happens:

E1213 19:03:31.444542       1 status.go:64] apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0xe, desc:"etcdserver: request timed out"}
I1213 19:03:31.444774       1 trace.go:76] Trace[753985076]: "Create /api/v1/namespaces/kube-system/pods" (started: 2018-12-13 19:03:24.441828372 +0000 UTC m=+114.693461957) (total time: 7.00
2869639s):
Trace[753985076]: [7.002869639s] [7.002606089s] END
I1213 19:03:33.109321       1 trace.go:76] Trace[1420167129]: "GuaranteedUpdate etcd3: *core.Node" (started: 2018-12-13 19:03:31.709712754 +0000 UTC m=+121.961346387) (total time: 1.399576303
s):
Trace[1420167129]: [1.399371901s] [1.398742251s] Transaction committed
I1213 19:03:33.109473       1 trace.go:76] Trace[769327697]: "Patch /api/v1/nodes/controller-1/status" (started: 2018-12-13 19:03:31.709618011 +0000 UTC m=+121.961251600) (total time: 1.39984
0945s):
Trace[769327697]: [1.399731089s] [1.39927972s] Object stored in database
I1213 19:03:33.112016       1 trace.go:76] Trace[2022830437]: "Create /api/v1/namespaces/default/events" (started: 2018-12-13 19:03:26.478524582 +0000 UTC m=+116.730158162) (total time: 6.633
469874s):
Trace[2022830437]: [6.633398498s] [6.633329302s] Object stored in database
I1213 19:03:33.112605       1 trace.go:76] Trace[1385314816]: "Create /api/v1/namespaces/kube-system/events" (started: 2018-12-13 19:03:28.97488789 +0000 UTC m=+119.226521475) (total time: 4.
137566155s):
Trace[1385314816]: [4.137467107s] [4.137345776s] Object stored in database
E1213 19:03:35.352793       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1213 19:03:35.353023       1 trace.go:76] Trace[1661503768]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-12-13 19:03:25.365882522 +0000 UTC m=+115.6
17516103) (total time: 9.987128454s):
Trace[1661503768]: [9.987128454s] [9.987104356s] END
E1213 19:03:35.886799       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
I1213 19:03:35.887162       1 trace.go:76] Trace[853992113]: "Get /api/v1/namespaces/kube-system/endpoints/kube-controller-manager" (started: 2018-12-13 19:03:25.887702792 +0000 UTC m=+116.13
9336376) (total time: 9.999444244s):
Trace[853992113]: [9.999444244s] [9.999421457s] END
E1213 19:03:35.932627       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}
E1213 19:03:35.933601       1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"context canceled"}

This is my haproxy config:

defaults
  timeout connect 5000ms
  timeout check 5000ms
  timeout server 30000ms
  timeout client 30000

global
  tune.ssl.default-dh-param 2048

listen stats
  bind :9000
  mode http
  stats enable
  stats hide-version
  stats realm Haproxy\ Statistics
  stats uri /stats

listen apiserver
  bind :6443
  mode tcp
  balance roundrobin
  option httpchk GET /healthz
  http-check expect string ok

  server apiserver1 10.10.0.11:6443 check check-ssl verify none
  server apiserver2 10.10.0.12:6443 check check-ssl verify none
  server apiserver3 10.10.0.13:6443 check check-ssl verify none

listen ingress
  bind :80
  mode http
  balance roundrobin

  server worker1 10.10.0.21:30080 check
  server worker2 10.10.0.22:30080 check
  server worker3 10.10.0.23:30080 check

listen ingress-443
  bind :443 ssl crt /usr/local/etc/haproxy/local-ssl.pem
  mode http
  balance roundrobin

  server worker1 10.10.0.21:30080 check
  server worker2 10.10.0.21:30080 check
  server worker3 10.10.0.23:30080 check
@fabriziopandini
Copy link
Member

fabriziopandini commented Dec 13, 2018

@iverberk Is there a reason for adding the following extra args:

etcd-servers: "https://10.10.0.11:2379,https://10.10.0.12:2379,https://10.10.0.13:2379"

Could you try with this setting?

@iverberk
Copy link
Author

No, that is a left-over from the 1.12 configuration. I will try to remove that and update the issue.

@iverberk
Copy link
Author

sigh I guess sometimes you need someone else to tell you the obvious... that was the culprit, sorry for the hassle. This did work well in 1.12 and for some reason, that I can't remember anymore, this was a necessary configuration parameter to make it work.

@iverberk
Copy link
Author

I thought this was solved but the problem still remains, even with the settings removed. Sometimes it works though. I'm not sure what the exact reason is, but most likely some kind of race condition. I've created a test repository to isolate the problem. In this repo: https://github.com/iverberk/kubeadm-cp-test you can find a test setup with Vagrant and Docker that show the problem when joining the second controller to the first controller. Hopefully this will illustrate the problem and make way for a solution.

@iverberk iverberk reopened this Dec 18, 2018
@neolit123 neolit123 added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. area/HA priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Dec 18, 2018
@iverberk
Copy link
Author

Ok, new information: bootstrapping is successful if I pre-pull the hyperkube image...I tested this with my own Ansible environment but will update the test repo as well to see if it works.

@iverberk
Copy link
Author

I can't quite reproduce the same result in the test repo. I stumbled upon this because after the first installation I would reset the kubeadm installation and run it again. The second time it would succeed. I guess one of the differences is that the image is already there at that point. It is still a weird issue and I'd like to know how you test the bootrstrapping scenario and why you never see this behaviour. If this is a Vagrant thing we should be able to pinpoint it.

@fabriziopandini
Copy link
Member

@iverberk could you kindly retest after kubernetes/kubernetes#72030 merged?

@wafflespeanut
Copy link

wafflespeanut commented Jan 6, 2019

I'm hitting the same issue regardless of pre-pulling the images. The first master bootstraps successfully whereas the second master fails with unexpected EOF when uploading some configuration. The weirdest part is that when I try adding another master node, it bootstraps successfully!

waffles@kube-master-1:~$ kubectl get nodes
NAME            STATUS   ROLES    AGE   VERSION
kube-master-1   Ready    master   28m   v1.13.1
kube-master-2   Ready    <none>   18m   v1.13.1
kube-master-3   Ready    master   86s   v1.13.1
kube-worker-1   Ready    <none>   26m   v1.13.1
kube-worker-2   Ready    <none>   25m   v1.13.1
kube-worker-3   Ready    <none>   25m   v1.13.1

Apparently, the second master isn't a master, but the third master has no trouble. The images were pre-pulled in all machines, so I guess pre-pulling doesn't affect anything.

My kubeadm-config.yaml was the same as the one in docs (only difference is in the pod subnet because I was using Flannel):

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: stable
networking:
  podSubnet: 10.244.0.0/16
apiServer:
  certSANs:
  - "PUBLIC_IP"
controlPlaneEndpoint: "PUBLIC_IP:PORT"

@iverberk Could you confirm this by adding another master after the second one fails?

@iverberk
Copy link
Author

iverberk commented Jan 6, 2019 via email

@timothysc timothysc added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jan 7, 2019
@timothysc timothysc added this to the v1.14 milestone Jan 7, 2019
@fmehrdad
Copy link

I have exact same problem.

@fabriziopandini
Copy link
Member

@iverberk @fmehrdad @wafflespeanut
IMO opinion the EOF error isn't related to image pre pull at all.
This was related to a race conditions fixed by kubernetes/kubernetes#72030 on master and then cherry picked in v1.13.2

Could you kindly repeated the test against on one of the above version?

@iverberk
Copy link
Author

iverberk commented Jan 12, 2019

I just retested with the repo that I created to reproduce this issue. This is the result of adding the second controller:

controller-2: Setting up docker-ce-cli (5:18.09.1~3-0~ubuntu-bionic) ...
    controller-2: Setting up kubeadm (1.13.2-00) ...
    controller-2: Setting up pigz (2.4-1) ...
    controller-2: Setting up docker-ce (5:18.09.1~3-0~ubuntu-bionic) ...
    controller-2: update-alternatives: using /usr/bin/dockerd-ce to provide /usr/bin/dockerd (dockerd) in auto mode
    controller-2: Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
    controller-2: Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.
    controller-2: Processing triggers for ureadahead (0.100.0-20) ...
    controller-2: Processing triggers for libc-bin (2.27-3ubuntu1) ...
    controller-2: Processing triggers for systemd (237-3ubuntu10.4) ...
    controller-2: kubelet set on hold.
    controller-2: kubeadm set on hold.
    controller-2: kubectl set on hold.
    controller-2: [preflight] Running pre-flight checks
    controller-2:       [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.1. Latest validated version: 18.06
    controller-2: [discovery] Trying to connect to API Server "10.11.0.1:6443"
    controller-2: [discovery] Created cluster-info discovery client, requesting info from "https://10.11.0.1:6443"
    controller-2: [discovery] Cluster info signature and contents are valid and no TLS pinning was specified, will use API Server "10.11.0.1:6443"
    controller-2: [discovery] Successfully established connection with API Server "10.11.0.1:6443"
    controller-2: [join] Reading configuration from the cluster...
    controller-2: [join] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
    controller-2: [join] Running pre-flight checks before initializing the new control plane instance
    controller-2:       [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 18.09.1. Latest validated version: 18.06
    controller-2: [certs] Using the existing "front-proxy-client" certificate and key
    controller-2: [certs] Using the existing "etcd/peer" certificate and key
    controller-2: [certs] Using the existing "etcd/healthcheck-client" certificate and key
    controller-2: [certs] Using the existing "etcd/server" certificate and key
    controller-2: [certs] Using the existing "apiserver-etcd-client" certificate and key
    controller-2: [certs] Using the existing "apiserver" certificate and key
    controller-2: [certs] Using the existing "apiserver-kubelet-client" certificate and key
    controller-2: [certs] valid certificates and keys now exist in "/etc/kubernetes/pki"
    controller-2: [certs] Using the existing "sa" key
    controller-2: [kubeconfig] Writing "admin.conf" kubeconfig file
    controller-2: [kubeconfig] Writing "controller-manager.conf" kubeconfig file
    controller-2: [kubeconfig] Writing "scheduler.conf" kubeconfig file
    controller-2: [etcd] Checking Etcd cluster health
    controller-2: [kubelet] Downloading configuration for the kubelet from the "kubelet-config-1.13" ConfigMap in the kube-system namespace
    controller-2: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
    controller-2: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
    controller-2: [kubelet-start] Activating the kubelet service
    controller-2: [tlsbootstrap] Waiting for the kubelet to perform the TLS Bootstrap...
    controller-2: [patchnode] Uploading the CRI Socket information "/var/run/dockershim.sock" to the Node API object "controller-2" as an annotation
    controller-2: [etcd] Announced new etcd member joining to the existing etcd cluster
    controller-2: [etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
    controller-2: [uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
    controller-2: error uploading configuration: Get https://10.11.0.1:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: unexpected EOF
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.

So unless my repo is not following the correct procedure, or the version of Kubadm that is used (1.13.2) is not correct, this is still not fixed.

@fabriziopandini would it be possible for you to evaluate my repo and make an assessment if it is correct and representative of a vanilla kubeadm HA bootstrap flow?

@fabriziopandini
Copy link
Member

@ereslibre could you kindly check again the EOF problem on your vagrant setup?

@ereslibre
Copy link
Contributor

@ereslibre could you kindly check again the EOF problem on your vagrant setup?

I will have a look at it tonight, the race condition fixed by kubernetes/kubernetes#72030 was slightly different, I never saw this one, but I'm happy to try to reproduce the issue and try to find the root cause. I will report back.

@masantiago
Copy link

I have experimented the same behaviour that you guys. The weirdest thing is to find that the first join (master-2) fails, but the second one is successful (master-3).

vagrant@k8-master1:~$ kubectl get nodes
NAME         STATUS   ROLES    AGE     VERSION
k8-master1   Ready    master   10m     v1.13.2
k8-master2   Ready    <none>   8m36s   v1.13.2
k8-master3   Ready    master   5m39s   v1.13.2

master-2 joining process yields:

[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
error uploading configuration: Get https://172.168.33.20:16443/api/v1/namespaces/kube-system/configmaps/kubeadm-config: unexpected EOF

@ereslibre
Copy link
Contributor

With https://github.com/iverberk/kubeadm-cp-test it's 100% reproducible. I didn't hit it with my project https://github.com/ereslibre/kubernetes-cluster-vagrant though.

I am still in the process of checking what's wrong, but some observations: upon joining the second master, some processes on the first master crash: namely, the controller-manager, the scheduler and etcd afterwards.

Scheduler:

E0113 17:56:42.710557       1 server.go:261] lost master
lost lease

Controller manager:

I0113 17:56:43.379606       1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded
F0113 17:56:43.379824       1 controllermanager.go:254] leaderelection lost

etcd keeps restarting on the first master too. etcd on the second master keeps failing to start:

2019-01-13 18:03:14.092724 I | etcdmain: rejected connection from "10.11.0.11:50946" (error "remote error: tls: bad certificate", ServerName "")
2019-01-13 18:03:14.107213 I | etcdmain: rejected connection from "10.11.0.11:50948" (error "remote error: tls: bad certificate", ServerName "")
2019-01-13 18:03:14.163532 W | etcdserver: could not get cluster response from https://10.11.0.11:2380: Get https://10.11.0.11:2380/members: EOF
2019-01-13 18:03:14.164947 C | etcdmain: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given urls

@masantiago are you experiencing similar errors in your setup?

@masantiago
Copy link

Yes, indeed. In fact, I ended up with the final status after several restarts of the Weave POD. It seems to be an unstable situation, confirmed when I shutdown the master-3. I am not able to access the cluster since then. It always responds with an unexpected EOF

Do not hesitate to ask me any trace you require.

I will have a look to your project @ereslibre. Did you also try with 13.x version?

@ereslibre
Copy link
Contributor

I will have a look to your project @ereslibre. Did you also try with 13.x version?

Kubernetes-cluster-vagrant is merely a project to be able to code easier on Kubernetes itself (soon will be deprecated completely by kind). I didn't focus on deploying existing released versions, but it shouldn't be hard to extend to support that usecase.

@iverberk I am changing the code of your project a bit to confirm some things: iverberk/kubeadm-cp-test@master...ereslibre:master

The main change I did is to avoid copying too many certificates and keys, because some certificates won't be valid if you copy all of them directly (certificate SANS use the detected IP address on each machine and some certificates just cannot be reused on all machines). After this change I no longer see crashes when growing etcd.

So, what I can see at this point (with the changes applied) is that etcd takes a bit longer to start on the second controller, and since we are automatically using the stacked etcd cluster, the phase uploadconfig timeouts. I don't consider this a race condition (if the theory proves to be right), but rather a timeout that is too low, taking into account that some images need to be pulled on the new machine.

I think I'm not getting this problem on my project because in my case I create a base box with all the dependencies pulled, used for all machines. This means that I don't wait to pull for images on kubernetes-cluster-vagrant.

I will report back when I have more information.

@ereslibre
Copy link
Contributor

ereslibre commented Jan 13, 2019

I think my theory stands. So, what I did was to use: iverberk/kubeadm-cp-test@master...ereslibre:master. It's very important to only copy the certificates that need copying, and not the rest, because if certain certificates exist they won't be generated and their SANs won't include the proper IP addresses on the new machines.

With the changes I previously linked I ran:

# ./create-controllers.sh
# vagrant provision controller-2

And the node joined just fine, without any timeout. This is because I added the docker pull of the images before the kubeadm join is called (and only copied the certificates and keys that needed copying). I will create a PR with a fix for the timeout issue.

@MalloZup
Copy link

ok hopefully this can fixed then with #1341

@ereslibre
Copy link
Contributor

ereslibre commented Jan 14, 2019

So, two things after discussing with @fabriziopandini:

  1. Copying the whole /etc/kubernetes/pki from one machine to another will lead to this problem.

    • When we do a kubeadm join we are only checking if the certificates are present, not checking if the SANs match what we expect. @fabriziopandini is +1 to not only check for the presence of the certificate, but to check also the SANs of it, and if it doesn't match what we expect, we regenerate them. Priority #0 is etcd, then kubeadm join on slave node fails preflight checks #1 is apiserver.
  2. The lack of image pre-pullling is addressed in issue kubeadm join controlplane not pulling images and fails #1341 with a PR in the works here: Kubeadm/HA: pull images during join for control-plane kubernetes#72870.

As for 1., despite we name the explicit certificates that need copying on the documentation (https://kubernetes.io/docs/setup/independent/high-availability/#steps-for-the-rest-of-the-control-plane-nodes), I think we can expect more people to try to copy /etc/kubernetes/pki directly between machines, basically because it's handier. If we address the issue of the SAN checking when doing a kubeadm join this wouldn't be a problem, because the certificates that don't match what we expect would simply be recreated.

I see this as a temporary solution until we have completely addressed the automatic transmission of secrets between masters when creating an HA cluster with kubeadm. Until we have a proper solution this would make it easier to copy things around between machines.

@fabriziopandini
Copy link
Member

@ereslibre thanks for the wrap up

despite we name the explicit certificates that need copying on the documentation ...

What about adding an explicit warning on the document: don't copy all the certs!

I see this as a temporary solution

Unfortunately automatic transmission of certs will be optional so, this fix should is necessary for v1.14 too; nevertheless, let's check if this can be fixed with a small change/eligible for backport in v1.13 .

if it doesn't match what we expect, we regenerate them

Might be better error out with a better error message (instead of silently changing certs). Wdyt?

@ereslibre
Copy link
Contributor

Might be better error out with a better error message (instead of silently changing certs). Wdyt?

I would also agree with this solution and feel it's a better one, because by automatically fixing the certificates we would be "promoting" the bad habit of copying everything when not everything is needed, so I'm good with erroring out and panicking the join in that case. We don't try to be smart, we just check if a certificate exists and if it doesn't match what we expect we abort.

@fmehrdad
Copy link

FYI I am only copying the files listed in https://kubernetes.io/docs/setup/independent/high-availability/ and still have this problem.

@ereslibre
Copy link
Contributor

ereslibre commented Jan 14, 2019

@fmehrdad Can you double check if prepulling the images in the new node before calling to kubeadm join helps? If that's the case #1341 is the issue.

@fmehrdad
Copy link

fmehrdad commented Jan 14, 2019 via email

@ereslibre
Copy link
Contributor

I have been pre pulling the images using "kubeadm config images pull" ahead of time. did not help.
I also tried 1.13.2 with no success.

I need more information in order to know what's going on in your setup @fmehrdad. This issue reported by @iverberk comes with a repository that shows two different issues.

  1. etcd failing to grow, this is going to be fixed by a PR that checks certificates are correct if they exist.
  2. Pre-pulling does not happen (kubeadm join controlplane not pulling images and fails #1341).

Both problems cause the same visible error, but the issues have a different nature. Can you please paste the configurations you are using to deploy the cluster and how your setup is done (HA...)?

@masantiago
Copy link

masantiago commented Jan 14, 2019

I've just tested with pre-pulling in second master, and the same behavior remains like @fmehrdad. My configuration is like that:

  • cni_version: 0.6.0-00
  • kubelet_version: 1.13.2-00
  • kubeadm_version: 1.13.2-00
  • docker_version: 18.06.0ce3-0~ubuntu

Vagrant like

BOX = "ubuntu/xenial64"
config.vm.define "k8-master1" do |app|
    	app.vm.box = BOX
		app.vm.network "private_network", ip: "172.168.33.10"
		app.vm.hostname = "k8-master1"
...
config.vm.define "k8-master2" do |app|
		app.vm.box = BOX
		app.vm.network "private_network", ip: "172.168.33.11"
		app.vm.hostname = "k8-master2"
...
config.vm.define "k8-master3" do |app|
		app.vm.box = BOX
		app.vm.network "private_network", ip: "172.168.33.12"
		app.vm.hostname = "k8-master3"

kubeadm-config.yaml

apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: "172.168.33.10"
  bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: 1.13.2
apiServer:
  certSANs:
  - "172.168.33.20"
controlPlaneEndpoint: "172.168.33.20:16443"   

where 172.168.33.20 is the VIP for the three masters using a keepalived and nginx load balancing.

master1

sudo kubeadm init --config=kubeadm-config.yaml
...
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Copying exactly the certs indicated in https://kubernetes.io/docs/setup/independent/high-availability/#stacked-control-plane-and-etcd-nodes

master2

sudo kubeadm config images pull
sudo kubeadm join 172.168.33.20:16443 --token n44hpu.7goanq56edi9v2dl --discovery-token-ca-cert-hash sha256:b40c6a97c2b9c984f471b46c7bf1c40f90a826eec5996d49a63ce8bf19b67608 --experimental-control-plane --apiserver-advertise-address 172.168.33.11

master3

sudo kubeadm config images pull
sudo kubeadm join 172.168.33.20:16443 --token n44hpu.7goanq56edi9v2dl --discovery-token-ca-cert-hash sha256:b40c6a97c2b9c984f471b46c7bf1c40f90a826eec5996d49a63ce8bf19b67608 --experimental-control-plane --apiserver-advertise-address 172.168.33.12

Result:

vagrant@k8-master1:~$ kubectl get nodes
NAME         STATUS   ROLES    AGE   VERSION
k8-master1   Ready    master   17m   v1.13.2
k8-master2   Ready    <none>   13m   v1.13.2
k8-master3   Ready    master   11m   v1.13.2

@ereslibre
Copy link
Contributor

I can see something that will be probably related. When joining the second master to the cluster the https://172.28.128.25:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config endpoint takes far longer to answer than on the third.

Master 2 (~14 seconds):

[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
I0115 12:52:38.633739    2027 round_trippers.go:419] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.14.0 (linux/amd64) kubernetes/1b28775" 'https://172.28.128.25:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config'
I0115 12:52:52.960533    2027 round_trippers.go:438] GET https://172.28.128.25:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 200 OK in 14325 milliseconds
I0115 12:52:52.960568    2027 round_trippers.go:444] Response Headers:
I0115 12:52:52.960576    2027 round_trippers.go:447]     Content-Type: application/json
I0115 12:52:52.960593    2027 round_trippers.go:447]     Content-Length: 1149
I0115 12:52:52.960604    2027 round_trippers.go:447]     Date: Tue, 15 Jan 2019 12:52:54 GMT

Master 3 (~40 milliseconds):

[uploadconfig] storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
I0115 12:54:02.519284    2118 round_trippers.go:419] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubeadm/v1.14.0 (linux/amd64) kubernetes/1b28775" 'https://172.28.128.25:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config'
I0115 12:54:02.557663    2118 round_trippers.go:438] GET https://172.28.128.25:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 200 OK in 38 milliseconds
I0115 12:54:02.557732    2118 round_trippers.go:444] Response Headers:
I0115 12:54:02.557741    2118 round_trippers.go:447]     Content-Type: application/json
I0115 12:54:02.557747    2118 round_trippers.go:447]     Content-Length: 1218
I0115 12:54:02.557753    2118 round_trippers.go:447]     Date: Tue, 15 Jan 2019 12:54:02 GMT

So I can confirm that the first control plane join takes a bit longer, so it seems that we have yet another cause for this issue and this would match with @masantiago and @fmehrdad descriptions.

I did the kubeadm join with -v10 in order to find what was the request that was blocking. I'll keep digging to see the root cause of this, but we have 3 different issues causing the same visible problem, maybe it's time to split this issue :)

@fmehrdad
Copy link

My problem was not related to k8s. My nginx-lb config had a very short timeout.

I changed my proxy_timeout from 3s to 24h.

Here is my config
user nginx;
worker_processes 1;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events {
worker_connections 1024;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" '
                  '"$http_user_agent" "$http_x_forwarded_for"';

access_log  /var/log/nginx/access.log  main;

sendfile        on;
#tcp_nopush     on;

keepalive_timeout  65;

#gzip  on;

include /etc/nginx/conf.d/*.conf;

}

stream {
upstream apiserver {
#server IP1:6443 weight=5 max_fails=9 fail_timeout=30s;
server IP2:6443 weight=5 max_fails=9 fail_timeout=30s;
#server IP3:6443 weight=5 max_fails=9 fail_timeout=30s;
}

server {
    listen 16443;
    proxy_connect_timeout 1s;
    proxy_timeout 24h;
    proxy_pass apiserver;
}
log_format proxy '$remote_addr [$time_local] '
             '$protocol $status $bytes_sent $bytes_received '
             '$session_time "$upstream_addr" '
             '"$upstream_bytes_sent" "$upstream_bytes_received" "$upstream_connect_time"';
access_log  /var/log/nginx/access.log  proxy;

}

@ereslibre
Copy link
Contributor

/assign

@k8s-ci-robot
Copy link
Contributor

@ereslibre: GitHub didn't allow me to assign the following users: ereslibre.

Note that only kubernetes members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rosti
Copy link

rosti commented Jan 16, 2019

@ereslibre will be working on this one.

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jan 16, 2019
@masantiago
Copy link

@fmehrdad Yesterday, I made a similar test but removing the LB (nginx) at all. I got to put all three masters in active, but:

  • They scheduled correctly three PODs (the masters were untainted, I do not have any workers).
  • When start shutting down master, the PODs were not re-scheduled to the remaining masters. So the situation was not stable.

Can you check if it is your case?

@masantiago
Copy link

Just tested @fmehrdad with your nginx config and I got the unexpected behaviour when shutting down masters. The replicas of PODs are not rescheduled and, moreover, when only the master 1 is left, I retrieve:

>>kubectl get pods -o wide
Unable to connect to the server: EOF

It seems that the HA of either etcd or control plane is not working properly.

I really appreciate your feedback in such case.

@ereslibre
Copy link
Contributor

It seems that the HA of either etcd or control plane is not working properly.

If you have grown your cluster to 3 masters etcd was also grown to 3. As per the etcd admin guide in a cluster of 3 the fault tolerance is of 1. If you shut down 2 out of 3 etcd instances your etcd cluster will be unavailable.

Please, let's keep this issue from now on for the certificate issue, since the reporter provided a repository that contained this problem and we are keeping this issue open because of that. This issue is split in 3 different cases:

  1. Bad etcd certificates: this issue.
  2. Not pre-pulling images when joining a control plane: kubeadm join controlplane not pulling images and fails #1341 (fix already merged in master)
  3. Not explicitly waiting for etcd to be healthy when we grow the cluster: kubeadm join does not explicitly wait for etcd to have grown when joining secondary control plane #1353

For any related issue please refer to the explicit issues linked above, otherwise please let's open a new bug report since this one is already mixing very different things. Thank you!

@iverberk
Copy link
Author

@ereslibre and others, just wanted to say a big thank you for investigating this! It's great to see such commitment on making kubeadm a great tool to use.

@ReSearchITEng
Copy link

Hello,
I have something very similar in the logs, and can't figure it out what is causing these messages in the apiserver logs.

Using k8s v1.12.1 in HA mode.
We have a 3 master cluster up and running. Difference from above is that we use a keepalived to move the VIP from one master to another (therefore no haproxy in front).

E0403 06:41:59.216530 1 status.go:64] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"The order in patch list:\n[map[address: CLUSTER_VIP_ADDR type:ExternalIP] map[address: ACTIVE_MASTER_IP_ADDR type:ExternalIP] map[address:CLUSTER_VIP_ADDR type:InternalIP] map[address:ACTIVE_MASTER_IP_ADDR type:InternalIP]]\n doesn't match $setElementOrder list:\n[map[type:ExternalIP] map[type:InternalIP] map[type:ExternalIP] map[type:InternalIP] map[type:Hostname]]\n"}

ACTIVE_MASTER_IP_ADDR -> is the current master node address where the keepalived is currently in master state (the other 2 are in slave/backup mode).
CLUSTER_VIP_ADDR -> is the VIP address which keepalived moves based on the health of the apiserver.

HOW to reproduce:
The entire setup is done with the project we maintain here for quite a while: https://github.com/ReSearchITEng/kubeadm-playbook/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.