create HA cluster is flaky #588

neolit123 · 2019-06-04T19:29:53Z

What happened:
i started seeing odd failures in the kind-master and -1.14 kubeadm jobs:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-1.14

after switching to this HA config:

# a cluster with 3 control-planes and 3 workers
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

I0604 19:15:09.075770     760 join.go:480] [preflight] Retrieving KubeConfig objects
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized 
 ✗ Joining more control-plane nodes 🎮
DEBU[22:15:10] Running: /usr/bin/docker [docker ps -q -a --no-trunc --filter label=io.k8s.sigs.kind.cluster --format {{.Names}}\t{{.Label "io.k8s.sigs.kind.cluster"}} --filter label=io.k8s.sigs.kind.cluster=kind] 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
DEBU[22:15:10] Running: /usr/bin/docker [docker rm -f -v kind-control-plane2 kind-control-plane kind-control-plane3 kind-worker kind-worker3 kind-worker2 kind-external-load-balancer] 
⠈⠁ Joining more control-plane nodes 🎮 Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1

What you expected to happen:
no errors.

How to reproduce it (as minimally and precisely as possible):

cd kind-src-path
GO111MODULE=on go build
# install the kind binary to PATH
cd kubernetes-src-path
kind build node-image --kube-root=$(pwd)
kind create cluster --config=<path-to-above-ha-config> --image kindest/node:latest

Anything else we need to know?:

i cannot reproduce the bug without --loglevel=debug.
sometimes it fails during joining the extra CP nodes, something during joining the workers.

Environment:

kind version: (use kind version): master at 43bf0e2
Kubernetes version: master at 1409ff38e5828f55
Docker version: (use docker info):

Containers: 10
 Running: 7
 Paused: 0
 Stopped: 3
Images: 128
Server Version: 18.06.3-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.13.0-41-generic
Operating System: Ubuntu 17.10
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.66GiB
Name: luboitvbox
ID: K2H6:2I6N:FSBZ:S77V:R5CQ:X22B:VYTF:WZ4R:UIKC:HGOT:UCHD:GCR2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="17.10 (Artful Aardvark)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 17.10"
VERSION_ID="17.10"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=artful
UBUNTU_CODENAME=artful

/kind bug
/priority important-soon
(?)

The text was updated successfully, but these errors were encountered:

neolit123 · 2019-06-04T19:40:15Z

seems to work with 3e5f299
possible regression in #585

EDIT: never mind, was able to reproduce it with v0.3.0 (flakes?)
EDIT2: v0.2.0 seems to work consistently.

/assign @BenTheElder

@BenTheElder i'm going to switch the kubeadm jobs to use a stable kind version (kubetest/kind needs update to pin a newer stable than 0.1.0):
kubernetes/test-infra#12883

possibly the sig-testing kind jobs can remain to use kind from master.

aojea · 2019-06-04T21:57:01Z

I can´t reproduce it neither find a similar error in the jobs, can you give me a link to some logs? I´m not familiar with testgrid and prow 😅

I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

The errors is 401 Unauthorized 🤔

neolit123 · 2019-06-04T22:07:20Z

this:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

happended here:

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1135916134365663234/build-log.txt

it's flaky and i'm able to reproduce it locally with with --loglevel=debug. happens sometimes.
make sure that you use the HA config too.

some of the other failures here, however are different:
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

e.g.
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136007475389009922/build-log.txt

the e2e suite passes:

�[1m�[32mSUCCESS!�[0m -- �[32m�[1m182 Passed�[0m | �[91m�[1m0 Failed�[0m | �[33m�[1m0 Pending�[0m | �[36m�[1m4230 Skipped�[0m

but then kind export logs times out:

time="20:51:10" level=debug msg="Running: /usr/bin/docker [docker exec --privileged kind-kubetest-external-load-balancer tar --hard-dereference -C /var/log -chf - .]"
2019/06/04 21:02:51 process.go:199: Abort after 30m0s timeout during kind export logs /logs/artifacts --loglevel=debug --name=kind-kubetest. Will terminate in another 15m
SIGABRT: abort

neolit123 · 2019-06-04T22:18:54Z

@aojea
i think the kind export logs timeout is reproducible consistently with the HA config.
it just stays at:

DEBU[01:17:24] Running: /usr/bin/docker [docker exec --privileged kind-external-load-balancer tar --hard-dereference -C /var/log -chf - .]

neolit123 · 2019-06-04T22:40:14Z

also:
kind build node-image --kube-root=$(pwd)
for k/k commit 1409ff38e5828f55

gives me some CRI errors:

INFO[2019-06-04T22:36:46.255776190Z] Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs" 
ERRO[2019-06-04T22:36:46.259945113Z] Failed to load cni during init, please check CRI plugin status before setting up network for pods  error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
INFO[2019-06-04T22:36:46.263918747Z] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2019-06-04T22:36:46.264185013Z] Start subscribing containerd event           
INFO[2019-06-04T22:36:46.265648130Z] Start recovering state                       
INFO[2019-06-04T22:36:46.266610875Z] Start event monitor                          
INFO[2019-06-04T22:36:46.266651618Z] Start snapshots syncer                       
INFO[2019-06-04T22:36:46.266663684Z] Start streaming server                       
INFO[2019-06-04T22:36:46.266727343Z] serving...                                    address="/run/containerd/containerd.sock"
INFO[2019-06-04T22:36:46.266748482Z] containerd successfully booted in 0.099826s  
ERRO[2019-06-04T22:36:46.885194357Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.112182694Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:47.666346302Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/VERSION" total=3
ERRO[2019-06-04T22:36:48.154099576Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.001964021Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:49.503738885Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:50.248597676Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:50.630062574Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:50.656049117Z] ImageCreate event &ImageCreate{Name:sha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:50.657063170Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/pause:3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:51.166245090Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:51.618692113Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:52.051331802Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:52.100652939Z] ImageCreate event &ImageCreate{Name:sha256:ea22abc123a1c4162fdd6e37a6594b242d5b83284faf8b5dff1fcaaf6c11ae48,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:52.101579489Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-proxy:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:52.512814469Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
ERRO[2019-06-04T22:36:52.979902246Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar locked: unavailable" ref="tar-0eae156e7f492f3e2fabe7ae144944ce8ce99166f49ed8191f53916870b19e55/layer.tar" total=43872768
INFO[2019-06-04T22:36:53.243875342Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:53.267360225Z] ImageCreate event &ImageCreate{Name:sha256:19bb968f77bba3a5b5f56b5c033d71f699c22bdc8bbe9412f0bfaf7f674a64cc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:53.268120372Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/ip-masq-agent:v2.4.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.324315128Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.328863044Z] ImageCreate event &ImageCreate{Name:sha256:25876c9e6eb0c50f882e6e586f1a41f98d19f44fb94677877d8c096c5b2249c2,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.329891024Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-apiserver:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.412397894Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:55.421736948Z] ImageCreate event &ImageCreate{Name:sha256:eb516548c180f8a6e0235034ccee2428027896af16a509786da13022fe95fe8c,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:55.422692034Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/coredns:1.3.1,Labels:map[string]string{io.cri-containerd.image: managed,},} 
ERRO[2019-06-04T22:36:57.098307600Z] (*service).Write failed                       error="rpc error: code = Unavailable desc = ref k8s.io/1/tar-repositories locked: unavailable" ref=tar-repositories total=142
INFO[2019-06-04T22:36:57.229823963Z] ImageCreate event &ImageCreate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.280581565Z] ImageCreate event &ImageCreate{Name:sha256:f227066bdc5f9aa2f8a9bb54854e5b7a23c6db8fce0f927e5c4feef8a9e74d46,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.281617808Z] ImageUpdate event &ImageUpdate{Name:docker.io/kindest/kindnetd:0.1.0,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.301074980Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.318367087Z] ImageCreate event &ImageCreate{Name:sha256:c155832fa95abf9b0c33d3f9c39ad43bf55002b853c44394a6dbb89dcc68f55a,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.319170150Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-controller-manager:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.320687117Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{},} 
INFO[2019-06-04T22:36:57.326232183Z] ImageCreate event &ImageCreate{Name:sha256:7dd88ed8b36300ced4153dcc2d0e4995378da059be497d31271771b19976fedc,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:36:57.327088405Z] ImageUpdate event &ImageUpdate{Name:k8s.gcr.io/kube-scheduler:v1.16.0-alpha.0.841_1409ff38e5828f-dirty,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.575908437Z] ImageCreate event &ImageCreate{Name:k8s.gcr.io/etcd:3.3.10,Labels:map[string]string{},} 
INFO[2019-06-04T22:37:00.580934619Z] ImageCreate event &ImageCreate{Name:sha256:2c4adeb21b4ff8ed3309d0e42b6b4ae39872399f7b37e0856e673b13c4aba13d,Labels:map[string]string{io.cri-containerd.image: managed,},} 
INFO[2019-06-04T22:37:00.581375948Z] Stop CRI service

EDIT: i think the cause for:

error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized

is a bad node-image.

BenTheElder · 2019-06-04T23:03:49Z

"CRI errors" during kind build node-image are normal.

kind export logs and kind load ... appear to be broken on HA because they're trying to do that against the nginx node. has nothing to do with CRI etc.

neolit123 · 2019-06-04T23:18:33Z

a couple of problems:

kind export logs fails kind export logs and kind load need to handle nginx node appropriately #589
Joining xxxxx nodes fails?

looks like --loglevel=debug is not a factor after all.

lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✗ Joining more control-plane nodes 🎮 
$KUBECONFIG is still set to use /home/lubo-it/.kube/kind-config-kind even though that file has been deleted, remember to unset it
Error: failed to create cluster: failed to join a control plane node with kubeadm: exit status 1
lubo-it@luboitvbox:~/go/src/k8s.io/kubernetes$ kind create cluster --image kindest/node:latest --config ./../../k8s.io/kubeadm/tests/e2e/kind/ha-cp/ha-cp.yaml
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:latest) 🖼 
 ✓ Preparing nodes 📦📦📦📦📦📦📦 
 ✓ Configuring the external load balancer ⚖️ 
 ✓ Creating kubeadm config 📜 
 ✓ Starting control-plane 🕹️ 
 ✓ Installing CNI 🔌 
 ✓ Installing StorageClass 💾 
 ✓ Joining more control-plane nodes 🎮 
 ✓ Joining worker nodes 🚜 
Cluster creation complete. You can now use the cluster with:

export KUBECONFIG="$(kind get kubeconfig-path --name="kind")"
kubectl cluster-info

BenTheElder · 2019-06-05T21:45:06Z

593 fixed the log export and image load issues, not sure about whatever flakiness is about #593

neolit123 · 2019-06-05T21:52:01Z

https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master

the latest green run passes with kind from master.
but the previous run has the Unauthorized in 3567 milliseconds\nerror execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized"

the one before that had the timout problem that #593 fixed.

a run that has kubernetes/test-infra#12883 should be in 1 hour.

aojea · 2019-06-05T22:02:30Z

I narrowed it down to this snippet, what can be the problem? what means invalid bearer token?

Starting autoregister controller
2019-06-05T18:52:11.653957701Z stderr F I0605 18:52:11.653932       1 cache.go:32] Waiting for caches to sync for autoregister controller
2019-06-05T18:52:11.657095624Z stderr F E0605 18:52:11.656920       1 controller.go:148] Unable to remove old endpoints from kubernetes service: StorageError: key not found, Code: 1, Key: /registry/masterleases/172.17.0.7, ResourceVersion: 0, AdditionalErrorMsg: 
2019-06-05T18:52:11.682115562Z stderr F E0605 18:52:11.671678       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.682181186Z stderr F E0605 18:52:11.673921       1 authentication.go:65] Unable to authenticate the request due to an error: invalid bearer token
2019-06-05T18:52:11.754070454Z stderr F I0605 18:52:11.753862       1 cache.go:39] Caches are synced for AvailableConditionController controller

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_kube-apiserver-kind-kubetest-control-plane3_58215256c50a36017e211d17dfbfd67c/kube-apiserver/0.log

neolit123 · 2019-06-05T22:12:13Z

bootstrap tokens are technically bearer tokens:
https://kubernetes.io/docs/reference/access-authn-authz/bootstrap-tokens/
AFAIK, this error can happen if kubeadm join passed a bad JWT (the --token part of join).

not sure how a token can flake though...
were you able to reproduce it locally?

on a side note, kubeadm does have concurrent CP join in k/k master now:
xref https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/create/actions/kubeadmjoin/join.go#L90-L95
(we can probably try it after this ticket is closed)

neolit123 · 2019-06-05T22:17:05Z

also the token in kind is hardcoded:
https://github.com/kubernetes-sigs/kind/blob/master/pkg/cluster/internal/kubeadm/const.go

aojea · 2019-06-06T11:35:17Z

@neolit123 the job seems stable now https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kind-master , doesn´t it?

neolit123 · 2019-06-06T11:48:08Z

yes, but we switched it to kind v0.2.0, because it's in the release-informing dashboard and the release team wants it green.
kubernetes/test-infra#12883

aojea · 2019-06-06T12:19:27Z

Seems that the etcd cluster wasn´t ready, in this case, the kind-control-plane3 is the one that wasn´t able to join the cluster at the moment of the authentication.
A hypothesis is that the LB forwarded the worker3 join connection to it and that´s why it failed, since the etcd cluster wasn´t ensambled. Should kubeadm handle this scenario?

I think that kind is too fast at this moment :)

2019-06-05T18:52:01.871076762Z stderr F 2019-06-05 18:52:01.870776 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05T18:52:01.871232212Z stderr F 2019-06-05 18:52:01.871073 W | rafthttp: health check for peer b0588fe16c7809d9 could not connect: dial tcp 172.17.0.7:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05T18:52:01.877611555Z stderr F 2019-06-05 18:52:01.877310 I | embed: rejected connection from "172.17.0.7:50372" (error "remote error: tls: bad certificate", ServerName "")
2019-06-05T18:52:06.918508278Z stderr F 2

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-kubeadm-kind-master/1136342315124658176/artifacts/kind-kubetest-control-plane3/pods/kube-system_etcd-kind-kubetest-control-plane3_5c9595671afd08252eb2e03d05489ee5/etcd/0.log

neolit123 · 2019-06-06T13:05:58Z

this is a possibility yes. but the kinder HA jobs are consistently green in this regard.
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-master-on-stable
i wonder what ends up being different.

i think we should bisect kind v0.2.0 -> latest and find what hapended.

aojea · 2019-06-06T13:44:34Z

That´s a fantastic idea, I think that I know what´s happening, let me do some tests

aojea · 2019-06-06T19:01:49Z

/assign @aojea

thomastaylor312 · 2019-06-06T20:40:09Z

I am experiencing this flakiness as well. Let me know if I can help test any fixes!

fabriziopandini · 2019-06-07T05:35:51Z

@aojea @neolit123

One additional question:

is it possible that health check on the load balancer are not working properly (the lead balancer lets traffic pass before the API server is fully operative - with etcd)?

And some answers (hopefully helpful)

kinder HA jobs are consistently green in this regard.

Kinder has wait loops after each action and it is less aggressive than kind in terms of parallelism; e.g.
https://github.com/kubernetes/kubeadm/blob/30be64eaa19c663dd1f8e8d9826608e52d06607c/kinder/pkg/actions/util.go#L85

It is possible to pass --wait=0 to get a behaviour more similar to what is implemented in kind, but parallelism for join is not there at all.

Should kubeadm handle this scenario?

Afaik kubeadm join implements a wait loop, but as soon as one API server answers it proceed with joining.
I will try to investigate if there are some regression there even if it seems that the problem is not on the waiting side but on the answering side (kube-apiserver answers before everything/etcd/RBAC is not properly set)

aojea · 2019-06-07T06:23:50Z

@fabriziopandini sorry for not updating the ticket, you are absolutely right, the problem is that nginx is only doing health checks at TCP level, however, haproxy was doing health checks at the TLS level with the option ssl-hello-chk.
The problem is that as soon as the TCP port is open nginx forwards traffic to it, but it can happen that the kube-apiserver is now working, thus the join fails.
I was doing research yesterday on load balancers, basically, we need SSL passthrough (L4 load balancing) with SSL health checks :

haproxy, is the best solution because it has the ssl-hello-chk option, however the docker container doesn't have a default configuration and it dies unless we bind mount the config file, we can change the entrypoint with an sleep forever per example , is this an option?
nginx, active health checks seem to be only available on the commercial version or with custom modules
traefik, TCP services is only available in alpha in the last version 2.0.0, the docs doesn't mention health checks. It's very powerful and flexible and can be used to solve the port forwarding too, something to check in the future.
envoy, seems a good alternative, want to test it today

Platforms/Architectures images: haproxy and nginx seems to have docker images for almost all platforms, but traefik and envoy are starting to support more platforms

BenTheElder · 2019-06-07T06:31:50Z

Alternatively we can not LB to all nodes while they are not joined yet.

fabriziopandini · 2019-06-10T07:11:20Z

@aojea thanks for your great work trying to address this issue.

Before answering to your question, le me recap the options that in my opinion should be considered for solving this issue:

act on the API server, implementing a change for not answering while not healthy
act on kubeadm, implementing a wait cycle that detects when then new etcd member is fully syncronized, and then create the new API server only after this condition is satisfied
act on kubeadm, tuning probes
4 act on kind load balancer

IMO 1 is the best option, but it will take some time

I'm in contact with etcd maintainers for implementing 2 in a smart way (I don't like the idea to add arbitrary sleep). ATM there is a promising feature still only on master; let's see if there is also something immediately actionable.

You are looking at 2/3

With regards to timers, they are the results of a painful trade-off between different types of kubeadm users. We should be really careful in touching those values unless we find something badly wrong or we can assume that the change doesn't hurt users with slower devices (raspberry is the corner case in this area).

AFAIK, licenses probe isn't the best candidate for this use case (detecting things are not started); I'm also trying to understand if readiness/liveness can actually block traffic for a static pod using host networking...

neolit123 · 2019-06-10T16:14:57Z

xref kubernetes/kubernetes#72102

neolit123 · 2019-06-10T16:17:38Z

My understanding of Readiness is that only works for Services, should it work here?
Can we use more agressive timers for the liveness probe?

immediate answer - no, we cannot as these are well balanced. we have found cases where the values do not match a use case, but normally there is another issue at play there.

aojea · 2019-06-11T10:15:32Z

I've looked at envoy and I'm afraid it can not solve the problem, seems it doesn't have (at least I couldn't find it) the option to do https health checks against the API /healthz endpoint (it does http only) . If anybody wants to give a shot this is the configuration I've tried

envoy config

docker run -d envoyproxy/envoy-alpine:v1.10.0
docker run -it ENVOYCONTAINER sh
cat <<EOF > /etc/envoy/envoy.yaml
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 9901
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.tcp_proxy
        typed_config:
          "@type": type.googleapis.com/envoy.config.filter.network.tcp_proxy.v2.TcpProxy
          stat_prefix: kind_tcp
          cluster: kind_cluster
  clusters:
  - name: kind_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: kind_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 172.16.0.4
                port_value: 6443
    health_checks:
      - timeout: 1s
        interval: 2s
        interval_jitter: 1s
        unhealthy_threshold: 1
        healthy_threshold: 3
        http_health_check:
          path: "/healthz"

exit

docker restart ENVOYCONTAINER

I think we should go with the 1- 3 options mentioned by fabrizio and, if we want to work around the problem until we found the definitive solution, we can go for the active loop #598.

aojea · 2019-06-11T10:16:59Z

/unassign @aojea

neolit123 · 2019-06-14T00:14:08Z

can't seem to reproduce the problem after the workaround in kubernetes/kubernetes#78915
cc @ereslibre

the issue of CM retries is now tracked in kubernetes/kubeadm#1606

aojea · 2019-06-14T05:42:17Z

Thanks @neolit123 and @ereslibre , that's awesome
Should we close if jobs are stable during some days?

ereslibre · 2019-06-14T08:26:19Z

Should we close if jobs are stable during some days?

I'll defer this decision to others. However, in my opinion, the workaround included in kubeadm was to save the situation for flakiness temporarily, and is meant to be reverted as far as we have discussed.

Ideally, during the 1.16 cycle we should be able to understand the source(s) of problems for concurrent joins and as a result of these investigations it could happen that they circle back to kind (e.g. the LB would need to perform health checks, so we only route traffic to initialized apiservers).

BenTheElder · 2019-06-14T17:32:42Z

NOTE: kubespray also has a mode with an internal nginx LB...
https://github.com/kubernetes-sigs/kubespray/blob/master/docs/ha-mode.md#kube-apiserver

aojea · 2019-06-14T23:08:10Z

/assign @aojea

I was talking today about this with @ereslibre , despite all the issues, one fact is that the API server is advertising the correct status in the healthz endpoint, and that's info that the LB should honor.

After trying different LBs seems that the only open source LB that allows to do health checks on an HTTPS endpoint and HTTPS passthrough load balancing are:

haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??
nginx, only the paid version or installing modules allows doing active health checks

BenTheElder · 2019-06-17T18:38:17Z

haproxy, we have to overcome the limitation that default image comes without config and dies ... maybe we can boot it with a sleep command ??

The standard docs show adding your config to the image. You can put a dummy config. If we boot a sleep then restart doesn't work properly.

add a dummy config to our custom image kindest/haproxy
rewrite the config at runtime by writing it into the container
use SIGHUP to reload the config

we actually had this working at one point already :-)

BenTheElder · 2019-06-24T19:09:49Z

#645 implemented haproxy with healthchecks per #588 (comment)

BenTheElder · 2019-06-24T21:43:03Z

so far I haven't gotten it to flake, still "soak testing" this locally 🙃

neolit123 · 2019-06-24T21:49:37Z

#588 (comment)

if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in.

neolit123 · 2019-06-24T21:55:21Z

given #645 merged, we can locally revert kubernetes/kubernetes#78915 and test again for flakes.

but this issue can also be closed as kind is now doing what is needed (e.g. use ha-proxy)?

ereslibre · 2019-06-24T21:56:56Z

@neolit123 absolutely, I will start investigating on my spare time. Will ping you back with my findings.

BenTheElder · 2019-06-24T22:00:00Z

if using k/k from master our workaround in kubernetes/kubernetes#78915 might be kicking in.

I'm testing with v1.14.2 currently. (the default image until we bump it... about to do that)

BenTheElder · 2019-06-24T22:01:00Z

$ docker logs kind-external-load-balancer 
[WARNING] 174/214223 (1) : config : missing timeouts for frontend 'controlPlane'.
   | While not properly invalid, you will certainly encounter various problems
   | with such a configuration. To fix this, please ensure that all following
   | timeouts are set to a non-zero value: 'client', 'connect', 'server'.
[NOTICE] 174/214223 (1) : New worker #1 (6) forked
[WARNING] 174/214225 (1) : Reexecuting Master process
[ALERT] 174/214225 (1) : sendmsg()/writev() failed in logger #1: No such file or directory (errno=2)
[WARNING] 174/214225 (6) : Stopping frontend controlPlane in 0 ms.
[WARNING] 174/214225 (6) : Stopping backend kube-apiservers in 0 ms.
[WARNING] 174/214225 (6) : Stopping frontend GLOBAL in 0 ms.
[WARNING] 174/214225 (6) : Proxy controlPlane stopped (FE: 0 conns, BE: 0 conns).
[NOTICE] 174/214225 (1) : New worker #1 (22) forked
[WARNING] 174/214225 (6) : Proxy kube-apiservers stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (6) : Proxy GLOBAL stopped (FE: 0 conns, BE: 0 conns).
[WARNING] 174/214225 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214225 (1) : Former worker #1 (6) exited with code 0 (Exit)
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane2 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[WARNING] 174/214226 (22) : Server kube-apiservers/kind-control-plane3 is DOWN, reason: Layer4 connection problem, info: "SSL handshake failure (Connection refused)", check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214226 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214250 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214336 (22) : Server kube-apiservers/kind-control-plane is DOWN, reason: Layer7 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 19 sessions active, 0 requeued, 0 remaining in queue.
[ALERT] 174/214336 (22) : backend 'kube-apiservers' has no server available!
[WARNING] 174/214341 (22) : Server kube-apiservers/kind-control-plane is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214344 (22) : Server kube-apiservers/kind-control-plane2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
[WARNING] 174/214413 (22) : Server kube-apiservers/kind-control-plane3 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 4ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

BenTheElder · 2019-06-24T22:01:17Z

I think we can close now, please re-open or file a new issue if you see further problems!
Thanks everyone!

neolit123 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 4, 2019

k8s-ci-robot assigned BenTheElder Jun 4, 2019

neolit123 mentioned this issue Jun 4, 2019

kubetest/kind: bump the stable version to v0.2.0 kubernetes/test-infra#12883

Merged

BenTheElder mentioned this issue Jun 4, 2019

kind export logs and kind load need to handle nginx node appropriately #589

Closed

neolit123 changed the title ~~create HA cluster fails with loglevel=debug~~ create HA cluster is flaky Jun 4, 2019

aojea mentioned this issue Jun 5, 2019

Export logs and load images only on Internal nodes #593

Merged

k8s-ci-robot assigned aojea Jun 6, 2019

BenTheElder mentioned this issue Jun 10, 2019

Library flexibility use case #600

Closed

k8s-ci-robot unassigned aojea Jun 11, 2019

ereslibre mentioned this issue Jun 12, 2019

kubeadm: Add ability to retry ConfigMap get if certain errors happen kubernetes/kubernetes#78915

Merged

k8s-ci-robot assigned aojea Jun 14, 2019

aojea mentioned this issue Jun 23, 2019

[WIP] Come back to haproxy #644

Closed

BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Jun 24, 2019

BenTheElder closed this as completed Jun 24, 2019

neolit123 mentioned this issue Jun 26, 2019

Issue with k8s.io/docs/setup/independent/high-availability/ kubernetes/website#14258

Closed

This was referenced Sep 17, 2019

RFE: use learner mode for joining etcd members #861

Closed

RFE: use learner mode for joining etcd members kubernetes/kubeadm#1793

Open

pablochacin mentioned this issue Nov 20, 2019

CI: Add delay between joining masters SUSE/skuba#844

Merged

create HA cluster is flaky #588

create HA cluster is flaky #588

Comments

neolit123 commented Jun 4, 2019 • edited Loading

neolit123 commented Jun 4, 2019 • edited Loading

aojea commented Jun 4, 2019

neolit123 commented Jun 4, 2019 • edited Loading

neolit123 commented Jun 4, 2019

neolit123 commented Jun 4, 2019 • edited Loading

BenTheElder commented Jun 4, 2019 • edited Loading

neolit123 commented Jun 4, 2019

BenTheElder commented Jun 5, 2019

neolit123 commented Jun 5, 2019

aojea commented Jun 5, 2019

neolit123 commented Jun 5, 2019 • edited Loading

neolit123 commented Jun 5, 2019

aojea commented Jun 6, 2019

neolit123 commented Jun 6, 2019 • edited Loading

aojea commented Jun 6, 2019 • edited Loading

neolit123 commented Jun 6, 2019

aojea commented Jun 6, 2019

aojea commented Jun 6, 2019

thomastaylor312 commented Jun 6, 2019

fabriziopandini commented Jun 7, 2019 • edited Loading

aojea commented Jun 7, 2019 • edited Loading

BenTheElder commented Jun 7, 2019

fabriziopandini commented Jun 10, 2019

neolit123 commented Jun 10, 2019

neolit123 commented Jun 10, 2019

aojea commented Jun 11, 2019 • edited Loading

aojea commented Jun 11, 2019

neolit123 commented Jun 14, 2019

aojea commented Jun 14, 2019

ereslibre commented Jun 14, 2019

BenTheElder commented Jun 14, 2019 • edited Loading

aojea commented Jun 14, 2019

BenTheElder commented Jun 17, 2019

BenTheElder commented Jun 24, 2019

BenTheElder commented Jun 24, 2019

neolit123 commented Jun 24, 2019

neolit123 commented Jun 24, 2019

ereslibre commented Jun 24, 2019

BenTheElder commented Jun 24, 2019

BenTheElder commented Jun 24, 2019

BenTheElder commented Jun 24, 2019

neolit123 commented Jun 4, 2019 •

edited

Loading

neolit123 commented Jun 4, 2019 •

edited

Loading

neolit123 commented Jun 4, 2019 •

edited

Loading

neolit123 commented Jun 4, 2019 •

edited

Loading

BenTheElder commented Jun 4, 2019 •

edited

Loading

neolit123 commented Jun 5, 2019 •

edited

Loading

neolit123 commented Jun 6, 2019 •

edited

Loading

aojea commented Jun 6, 2019 •

edited

Loading

fabriziopandini commented Jun 7, 2019 •

edited

Loading

aojea commented Jun 7, 2019 •

edited

Loading

aojea commented Jun 11, 2019 •

edited

Loading

BenTheElder commented Jun 14, 2019 •

edited

Loading