Cluster cannot be created because of (second) server node's issue #619

nguyenvulong · 2021-06-02T13:02:29Z

Original problem: I tried to create a cluster with multiple nodes in a single physical machine, this is the command

k3d cluster create -a3 -s3 test3 --trace

and the process stuck forever at this point, when the cluster was firing up the second server node:

DEBU[0005] Waiting for node k3d-test3-server-0 to get ready (Log: 'Running kube-apiserver') 
TRAC[0005] NodeWaitForLogMessage: Node 'k3d-test3-server-0' waiting for log message 'Running kube-apiserver' since '2021-06-02 12:45:02.516841645 +0000 UTC' 
DEBU[0006] Finished waiting for log message 'Running kube-apiserver' from node 'k3d-test3-server-0' 
....
DEBU[0007] Waiting for node k3d-test3-server-1 to get ready (Log: 'k3s is up and running') 
TRAC[0007] NodeWaitForLogMessage: Node 'k3d-test3-server-1' waiting for log message 'k3s is up and running' since '2021-06-02 12:45:04.23375296 +0000 UTC

It took a while for k3d-test3-server-1 (the second server node) to start (it kept restarting several times before that). This resulted in other nodes (including servers and agents) could not start up. You can see the "created" status in docker ps output

I have tried to purge everything from docker packages and reinstall them (I tried both docker-ce and apt package). Nothing worked so far. It stuck there forever. The weird thing is my other Ubuntu servers did not have the problem. Therefore I suspect this is a local issue - BUT i have tried to reinstall docker and even switch to other linux user just to make sure there is nothing in between k3d and docker. Any suggestions for debugging are really appreciated.

k3d version

k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)

OS version

LSB Version:	core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic

Docker Info:

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 11
  Running: 2
  Paused: 0
  Stopped: 9
 Images: 2
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-66-generic
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.26GiB
 Name: AMC_Test
 ID: HTTK:CWTG:YLKI:Q2AH:5MXF:MVOE:XHFV:V6BF:ZXDX:ZGZ2:EDWM:ZIIE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: No blkio weight support
WARNING: No blkio weight_device support

Additional logs

longnv@AMC_Test  ~  docker logs 0faef53a27b0           
time="2021-06-02T12:45:04.467620702Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:04.517018890Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:05.755968864Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:05.801646534Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:06.947352136Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:06.993416610Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:08.305266405Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:08.347054640Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: https://k3d-test3-server-0:6443/cacerts: 503 Service Unavailable"
time="2021-06-02T12:45:10.095231184Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-02T12:45:10.144121231Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
time="2021-06-02T12:45:10.144182981Z" level=info msg="Managed etcd cluster not yet initialized"
time="2021-06-02T12:45:10.148323728Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.148965934Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.149625481Z" level=info msg="certificate CN=system:kube-scheduler signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150251560Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.150885419Z" level=info msg="certificate CN=system:kube-proxy signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.151474780Z" level=info msg="certificate CN=system:k3s-controller signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152189659Z" level=info msg="certificate CN=cloud-controller-manager signed by CN=k3s-client-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.152901356Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.153597948Z" level=info msg="certificate CN=system:auth-proxy signed by CN=k3s-request-header-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154228404Z" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.154830831Z" level=info msg="certificate CN=etcd-client signed by CN=etcd-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.155457319Z" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.157501228Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1622637902: notBefore=2021-06-02 12:45:02 +0000 UTC notAfter=2022-06-02 12:45:10 +0000 UTC"
time="2021-06-02T12:45:10.158114939Z" level=info msg="Active TLS secret  (ver=) (count 8): map[listener.cattle.io/cn-0.0.0.0:0.0.0.0 listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.19.0.3:172.19.0.3 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=8E02A6749D1D91B3184B0B5B3517BD3A59F159D6]"
time="2021-06-02T12:45:10.166925515Z" level=info msg="Adding https://172.19.0.3:2380 to etcd cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380]"
time="2021-06-02T12:45:10.171117654Z" level=info msg="Starting etcd for cluster [k3d-test3-server-0-673cddf7=https://172.19.0.2:2380 k3d-test3-server-1-d28439f8=https://172.19.0.3:2380]"
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://172.19.0.3:2380"]}
{"level":"info","ts":"2021-06-02T12:45:10.173Z","caller":"embed/etcd.go:468","msg":"starting with peer TLS","tls-info":"cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.174Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.15.10","go-os":"linux","go-arch":"amd64","max-cpu-set":16,"max-cpu-available":16,"member-initialized":false,"name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/k3s/server/db/etcd/member","force-new-cluster":false,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://172.19.0.3:2380"],"advertise-client-urls":["https://172.19.0.3:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://172.19.0.3:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"k3d-test3-server-0-673cddf7=https://172.19.0.2:2380,k3d-test3-server-1-d28439f8=https://172.19.0.3:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
{"level":"info","ts":"2021-06-02T12:45:10.177Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/var/lib/rancher/k3s/server/db/etcd/member/snap/db","took":"2.328973ms"}
{"level":"warn","ts":"2021-06-02T12:45:10.245Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://172.19.0.2:2380/members","error":"Get \"https://172.19.0.2:2380/members\": EOF"}
{"level":"info","ts":"2021-06-02T12:45:10.246Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
{"level":"info","ts":"2021-06-02T12:45:10.247Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3d-test3-server-1-d28439f8","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.19.0.3:2379"]}
time="2021-06-02T12:45:10.247410008Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=https://kubernetes.default.svc.cluster.local,k3s --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=ServiceAccountIssuerDiscovery=false --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
{"level":"warn","ts":"2021-06-02T12:45:10.247Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-06-02T12:45:20.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:20.247501638Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:25.157Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

{"level":"warn","ts":"2021-06-02T12:45:35.247Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-06-02T12:45:35.247878656Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-06-02T12:45:35.985Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

The text was updated successfully, but these errors were encountered:

nguyenvulong · 2021-06-03T06:29:42Z

the second server (server-1) node's creation was unstable (up and then restarting and up again and turned off)

logs

whereas, at the first server node (server-0)


 longnv@AMC_Test  ~  docker logs k3d-test4-server-0                     
time="2021-06-03T06:34:28.634909121Z" level=info msg="Starting k3s v1.20.6+k3s1 (8d043282)"
time="2021-06-03T06:34:28.635415960Z" level=info msg="Managed etcd cluster initializing"
time="2021-06-03T06:34:28.652940033Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1622702068: notBefore=2021-06-03 06:34:28 +0000 UTC notAfter=2022-06-03 06:34:28 +0000 UTC"
time="2021-06-03T06:34:28.653570579Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1622702068: notBefore=2021-06-03 06:34:28 +0000 UTC notAfter=2022-06-03 06:34:28 +0000 UTC"

time="2021-06-03T06:34:31.191665668Z" level=info msg="Waiting for API server to become available"
time="2021-06-03T06:34:31.192695596Z" level=info msg="Running kube-controller-manager --address=127.0.0.1 --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/client-ca.key --configure-cloud-routes=false --controllers=*,-service,-route,-cloud-node-lifecycle --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --port=10252 --profiling=false --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true"
time="2021-06-03T06:34:31.194879272Z" level=info msg="Node token is available at /var/lib/rancher/k3s/server/token"
time="2021-06-03T06:34:31.194940316Z" level=info msg="To join node to cluster: k3s agent -s https://172.17.0.3:6443 -t ${NODE_TOKEN}"

time="2021-06-03T06:34:31.258702206Z" level=info msg="Cluster-Http-Server 2021/06/03 06:34:31 http: TLS handshake error from 127.0.0.1:57182: remote error: tls: bad certificate"
time="2021-06-03T06:34:31.266185845Z" level=info msg="Cluster-Http-Server 2021/06/03 06:34:31 http: TLS handshake error from 127.0.0.1:57188: remote error: tls: bad certificate"
time="2021-06-03T06:34:31.283555019Z" level=info msg="certificate CN=k3d-test4-server-0 signed by CN=k3s-server-ca@1622702068: notBefore=2021-06-03 06:34:28 +0000 UTC notAfter=2022-06-03 06:34:31 +0000 UTC"
time="2021-06-03T06:34:31.287757212Z" level=info msg="certificate CN=system:node:k3d-test4-server-0,O=system:nodes signed by CN=k3s-client-ca@1622702068: notBefore=2021-06-03 06:34:28 +0000 UTC notAfter=2022-06-03 06:34:31 +0000 UTC"
time="2021-06-03T06:34:31.303917155Z" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
time="2021-06-03T06:34:31.304105936Z" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
{"level":"warn","ts":"2021-06-03T06:34:31.304Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {/run/k3s/containerd/containerd.sock  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix /run/k3s/containerd/containerd.sock: connect: no such file or directory\". Reconnecting..."}
time="2021-06-03T06:34:32.306542001Z" level=info msg="Containerd is now running"
time="2021-06-03T06:34:32.325151498Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:6443/v1-k3s/connect"
time="2021-06-03T06:34:32.328649824Z" level=info msg="Handling backend connection request [k3d-test4-server-0]"
time="2021-06-03T06:34:32.329658931Z" level=warning msg="Disabling CPU quotas due to missing cpu.cfs_period_us"
time="2021-06-03T06:34:32.329802862Z" level=info msg="Running kubelet --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --cni-bin-dir=/bin --cni-conf-dir=/var/lib/rancher/k3s/agent/etc/cni/net.d --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --container-runtime=remote --containerd=/run/k3s/containerd/containerd.sock --cpu-cfs-quota=false --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=k3d-test4-server-0 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --runtime-cgroups=/k3s --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key"
time="2021-06-03T06:34:32.331056447Z" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --healthz-bind-address=127.0.0.1 --hostname-override=k3d-test4-server-0 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Flag --cloud-provider has been deprecated, will be removed in 1.23, in favor of removing cloud provider code from Kubelet.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.

W0603 06:34:49.523374       7 shared_informer.go:494] resyncPeriod 14h11m56.881090259s is smaller than resyncCheckPeriod 21h33m50.047935796s and the informer has already started. Changing it to 21h33m50.047935796s

I0603 06:35:25.088367       7 reconciler.go:319] Volume detached for volume "values" (UniqueName: "kubernetes.io/configmap/bd9c276a-142a-4b4f-bbea-40f09cef112c-values") on node "k3d-test4-server-0" DevicePath ""
I0603 06:35:25.088407       7 reconciler.go:319] Volume detached for volume "helm-traefik-token-2264x" (UniqueName: "kubernetes.io/secret/bd9c276a-142a-4b4f-bbea-40f09cef112c-helm-traefik-token-2264x") on node "k3d-test4-server-0" DevicePath ""
I0603 06:35:25.088424       7 reconciler.go:319] Volume detached for volume "content" (UniqueName: "kubernetes.io/configmap/bd9c276a-142a-4b4f-bbea-40f09cef112c-content") on node "k3d-test4-server-0" DevicePath ""
W0603 06:35:25.903656       7 pod_container_deletor.go:79] Container "febbf7e0e21b940017fa14c0aa5ccd549d101cf8bc2f2e3f7a4bcab30cee80f5" not found in pod's containers
W0603 06:39:37.429666       7 info.go:53] Couldn't collect info from any of the files in "/etc/machine-id,/var/lib/dbus/machine-id"

nguyenvulong · 2021-06-03T12:05:20Z

tested with other versions (v4.4.0 and v3.4.0), same problem happened. It always stopped at second server node.

setting K3D_FIX_CGROUPV2=true
k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)

nguyenvulong · 2021-06-03T13:07:38Z

@iwilltry42 sorry for bothering you, but if there is any debugging tricks I can use.. please suggest.
I have tried: docker logs, --trace, dockerd -D (debugging daemon), syslog, ... none of these seemed to find the culprit.

Similar issue(s) - i'll keep updating this.

#592
#538
#621

iwilltry42 · 2021-06-04T06:51:32Z

Hi @nguyenvulong , thanks for opening this issue!
On first sight, I cannot find any obvious issue 🤔
However, recently there were quite some changes upstream in k3s to fix various issues with OS-Level stuff (kernel changes that broke things, upstream k8s changes that had to be fixed, etc.).
Maybe you can use the --image flag to test one of the newer k3s releases?

nguyenvulong · 2021-06-04T07:09:49Z

Thanks a lot for your time
trying with newest image:

TLDR; same logs.

Jasper-Ben · 2021-06-04T22:33:26Z

Running into the same issue. Running with env variable K3D_FIX_CGROUOV2=1 set, I was able to get past the server creation part but now I'm stuck waiting for the second agent to get ready.

Edit: It actually seems like node-0 is in a crashloop:

Flag --address has been deprecated, see --bind-address instead.
I0604 22:49:41.463769       7 controllermanager.go:176] Version: v1.20.6+k3s1
I0604 22:49:41.464096       7 deprecated_insecure_serving.go:53] Serving insecurely on 127.0.0.1:10252
I0604 22:49:41.464130       7 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-controller-manager...
W0604 22:49:41.487265       7 authorization.go:47] Authorization is disabled
W0604 22:49:41.487293       7 authentication.go:40] Authentication is disabled
I0604 22:49:41.487308       7 deprecated_insecure_serving.go:51] Serving healthz insecurely on 127.0.0.1:10251
time="2021-06-04T22:49:41.489954029Z" level=info msg="Writing static file: /var/lib/rancher/k3s/server/static/charts/traefik-1.81.0.tgz"
time="2021-06-04T22:49:41.490241820Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/rolebindings.yaml"
time="2021-06-04T22:49:41.490368071Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/ccm.yaml"
time="2021-06-04T22:49:41.490472310Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/auth-delegator.yaml"
time="2021-06-04T22:49:41.490578123Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-apiservice.yaml"
time="2021-06-04T22:49:41.490687612Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/auth-reader.yaml"
time="2021-06-04T22:49:41.490961797Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-deployment.yaml"
time="2021-06-04T22:49:41.491075143Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/metrics-server-service.yaml"
time="2021-06-04T22:49:41.491183851Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/resource-reader.yaml"
time="2021-06-04T22:49:41.491411497Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/traefik.yaml"
time="2021-06-04T22:49:41.491642979Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/coredns.yaml"
time="2021-06-04T22:49:41.491835468Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/local-storage.yaml"
time="2021-06-04T22:49:41.491941871Z" level=info msg="Writing manifest: /var/lib/rancher/k3s/server/manifests/metrics-server/aggregated-metrics-reader.yaml"
I0604 22:49:41.587506       7 leaderelection.go:243] attempting to acquire leader lease kube-system/kube-scheduler...
time="2021-06-04T22:49:41.592358201Z" level=info msg="Starting k3s.cattle.io/v1, Kind=Addon controller"
I0604 22:49:41.592418       7 leaderelection.go:243] attempting to acquire leader lease kube-system/k3s...
time="2021-06-04T22:49:41.592416702Z" level=info msg="Starting /v1, Kind=Node controller"
time="2021-06-04T22:49:41.596076723Z" level=info msg="Cluster dns configmap already exists"
I0604 22:49:41.603215       7 leaderelection.go:253] successfully acquired lease kube-system/k3s
time="2021-06-04T22:49:41.603467478Z" level=warning msg="Unable to fetch coredns config map: configmap \"coredns\" not found"
time="2021-06-04T22:49:41.603824200Z" level=warning msg="Unable to fetch coredns config map: configmap \"coredns\" not found"
I0604 22:49:41.613799       7 controller.go:609] quota admission added evaluator for: addons.k3s.cattle.io
I0604 22:49:41.674145       7 controller.go:609] quota admission added evaluator for: deployments.apps
time="2021-06-04T22:49:41.705348408Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChartConfig controller"
time="2021-06-04T22:49:41.705404175Z" level=info msg="Starting /v1, Kind=Pod controller"
time="2021-06-04T22:49:41.705415847Z" level=info msg="Starting /v1, Kind=Endpoints controller"
time="2021-06-04T22:49:41.705394236Z" level=info msg="Starting /v1, Kind=Service controller"
time="2021-06-04T22:49:41.705386872Z" level=info msg="Starting batch/v1, Kind=Job controller"
time="2021-06-04T22:49:41.705367735Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller"
I0604 22:49:41.721060       7 controller.go:609] quota admission added evaluator for: helmcharts.helm.cattle.io
I0604 22:49:41.838616       7 node.go:172] Successfully retrieved node IP: 172.26.0.2
I0604 22:49:41.838650       7 server_others.go:143] kube-proxy node IP is an IPv4 address (172.26.0.2), assume IPv4 operation
I0604 22:49:41.839653       7 server_others.go:186] Using iptables Proxier.
I0604 22:49:41.839837       7 server.go:650] Version: v1.20.6+k3s1
I0604 22:49:41.840114       7 conntrack.go:103] Set sysctl 'net/netfilter/nf_conntrack_max' to 524288
F0604 22:49:41.840129       7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

nguyenvulong · 2021-06-05T08:43:10Z

Running into the same issue. Running with env variable K3D_FIX_CGROUOV2=1 set, I was able to get past the server creation part but now I'm stuck waiting for the second agent to get ready.

Edit: It actually seems like node-0 is in a crashloop:

can you screenshot this?
in my case it stuck at "Starting Node" (the second node)

Jasper-Ben · 2021-06-05T10:00:12Z

Sure:

For me server-0 was stuck in a crashloop for quite some time (see server-0 log). But after a few minutes it evantually started successfully (see server-0-success log). But now server-1 is stuck in a crashloop (see server-1 log).

For completeness sake, here are the logs of any other container running:
server-2 log
agent-0 log

some background info on my system:

up-to-date Arch Linux
Kernel 5.12.8-arch1-1
k3d version v4.4.4
k3s version v1.20.6-k3s1 (default)
docker:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-tp-docker)

Server:
 Containers: 8
  Running: 4
  Paused: 0
  Stopped: 4
 Images: 140
 Server Version: 20.10.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 36cc874494a56a253cd181a1a685b44b58a2e34a.m
 runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.12.8-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 62.3GiB
 Name: NB-2243
 ID: JYPS:XTIZ:P3QB:PARG:ATOW:SVJX:TNKP:COPU:6IAF:KTYS:E7DX:7U6G
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: jasperben
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No kernel memory limit support
WARNING: No oom kill disable support

nguyenvulong · 2021-06-05T10:09:53Z

@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has

F0605 09:33:43.803005 7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

see this #612 , you may want to open a new issue if it does not help.
In you case you've made it further as all of your servers eventually started. The fix of nf_conntrack_max may help!

Jasper-Ben · 2021-06-05T19:26:55Z

@Jasper-Ben thanks, i checked your logs. And i think you got different problem than mine, in your log it has

F0605 09:33:43.803005 7 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

see this #612 , you may want to open a new issue if it does not help.
In you case you've made it further as all of your servers eventually started. The fix of nf_conntrack_max may help!

Thanks, I'll have a look 🙂

Edit: Workaround did the trick in my case! 🎉 I hope you find a solution to your problem soon 😕

nguyenvulong · 2021-06-07T10:57:10Z

Glad to hear that @Jasper-Ben .

I'll keep everyone updated with my case if I find something

nguyenvulong · 2021-06-07T12:24:21Z

After failing with one physical server, i tried to reproduce the experiment with another physical server: the result is "one is working and one is not"
So i decided to exec into the container shell (sh) of the two server nodes and compare the running processes inside..

This is a normally run (physical) server: left side is server-node-0, right side is server-node-1. Pay attention to the running processes

This is the troublesome (physical) server that I got issue, server-node-1 is not working properly ...

left side is server-node-0, right side is server-node-1

It is obvious that there are some services not running, they are

containerd
/bin/containerd-shim-runc-v2 -namespace k8s.io -id
/pause
{entry} /bin/sh /usr/bin/entry
/traefik --configfile=/config/traefik.toml

Do you have any idea of how to manually start them @iwilltry42 .
I would like to start the services inside node-1 manually for further debugging.
I don't really know why they were not started by k3d though +_+

iwilltry42 · 2021-06-07T15:22:33Z

Hi @nguyenvulong , thanks for investing the time in investigating here!
I'll have a closer look into this. Can you please paste the OS/Kernel/Docker details of both physical machines that you're running there?

I don't really know why they were not started by k3d though

k3d does only start k3s via the docker image, all the rest is started by k3s. That means, that the k3s bootstrap process doesn't finish successfully, according to the logs, because it cannot connect to the initializing server node properly.

Maybe this is interesting for @brandond as well or he has some input from the logs? 🤔

UPDATE:

I feel like those log lines are not only "Info" level 🤔
(server-0)


time="2021-06-03T06:34:37.332814805Z" level=info msg="Stopped tunnel to 127.0.0.1:6443"
time="2021-06-03T06:34:37.333253412Z" level=info msg="Connecting to proxy" url="wss://172.17.0.3:6443/v1-k3s/connect"
time="2021-06-03T06:34:37.333705386Z" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:6443/v1-k3s/connect"
time="2021-06-03T06:34:37.334073910Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
...

nguyenvulong · 2021-06-07T15:52:30Z

@iwilltry42 Actually I have tested with 4 physical machines. All information below (the failed one is the server 2)

Let me summarize what i have done when i tried to fix it

Upgraded the kernel from 5.4 to 5.9 (both kernel gave the same result)
Purged and reinstalled docker.io, fresh installed it (both debian release and docker-ce)
Try multiple k3d versions, newest image and even rebuilt k3d from scratch
I have gone through all the logs when i tried different things above, the logs are the same errors, that's why today when i docker exec into the container (of server - second node), i realized that was because the services inside did not start .. (by comparing with other physical machines that work)
There is one difference i found in docker info, that is in physical server 2, you can see that, at the end of the output, it says

WARNING: No blkio weight support
WARNING: No blkio weight_device support

(I have no idea about this one, not sure it is even related. But at least the other three physical machines don't have that.)

In my humble opinion, I do not think the kernel version, or the Ubuntu version would affect the creation of service inside a container. I spent my focus on how to debug the container and its running services but could not get enough from the logs..

And thank you so much for spending time on this. I have checked other issues in this repository and some of them may be related (see this comment)

Information of the physical machine 1 (working)

OS

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial

Kernel

Linux AMC 4.4.0-200-generic #232-Ubuntu SMP Wed Jan 13 10:18:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Docker

Containers: 7
 Running: 7
 Paused: 0
 Stopped: 0
Images: 2
Server Version: 18.09.7
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 
runc version: N/A
init version: v0.18.0 (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-200-generic
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 31.32GiB
Name: AMC
ID: UFUJ:3OIP:RGYE:SXC4:ZA4U:3Y5R:C4LM:VMUD:FNZU:T66Q:I6WH:IMGB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Information of the physical machine 2 (NOT working)

OS

LSB Version:	core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic

Kernel
Linux AMC_Test 5.9.0-050900-lowlatency #202010112230 SMP PREEMPT Sun Oct 11 22:37:09 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Docker

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.9.0-050900-lowlatency
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.25GiB
 Name: AMC_Test
 ID: RZ75:HWN5:HFEC:4JDL:3LUQ:SIRP:YLIQ:JPXX:ZM2K:2BPC:HEHI:TAGC
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No blkio weight support
WARNING: No blkio weight_device support

Physical Machine 3 (working)

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

Linux AISRC 5.4.0-72-generic #80~18.04.1-Ubuntu SMP Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Client:
 Debug Mode: false

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 19.03.6
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 
 runc version: 
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-72-generic
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 24
 Total Memory: 62.5GiB
 Name: AISRC
 ID: 4UIK:MMHI:WOCM:M4LL:W7RJ:WB24:JKYS:GR7D:LYHP:ZJ3F:L43H:2HSR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Physical Machine 4 (working)

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.2 LTS
Release:	16.04
Codename:	xenial

Linux s4urc 4.4.0-200-generic #232-Ubuntu SMP Wed Jan 13 10:18:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 12
  Running: 12
  Paused: 0
  Stopped: 0
 Images: 9
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.4.0-200-generic
 Operating System: Ubuntu 16.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 15.57GiB
 Name: s4urc
 ID: TSHZ:WQL3:NW4S:H2EY:XJDC:4L4R:BQU5:7VLT:GBK2:YTYU:XQ7Z:H3UR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

UPDATE:

I feel like those log lines are not only "Info" level 🤔
(server-0)

one of my wild guesses was the 2nd node (node-1) may have failed because it had something to do with node-0, but well, let me see if i can find anything...

brandond · 2021-06-07T21:56:58Z

Wait I'm confused, are you trying to make a k3d cluster out of 4 separate hosts? Normally you would use k3d to mock up a multi-node cluster on a single host; if you want to build a cluster out of multiple hosts why not just use K3s?

nguyenvulong · 2021-06-08T04:49:03Z

Wait I'm confused, are you trying to make a k3d cluster out of 4 separate hosts? Normally you would use k3d to mock up a multi-node cluster on a single host; if you want to build a cluster out of multiple hosts why not just use K3s?

No @brandond , I was trying to create k3d cluster on a single physical machine. However I could not do that due to some errors when k3d trying to spawn a second node (server node, aka node-1, hence the name of this thread)

Because of that, I tried to install k3d onto 3 additional physical machines, independently, and repeat the same test. Above are their configurations. As you can see, only one physical machine got that problem, the rest were fine.

P/S.
Out of curiosity, I simply thought that, maybe that physical machine got some package problem. Therefore i tried to fix it by fresh-installing k3d, docker, and other components as described above. Nothing worked, and i also found similar issues in this repository that seem to be similar to mine.

iwilltry42 · 2021-06-10T15:23:24Z

@nguyenvulong , I just tried, but couldn't replicate your issue with those OS and docker versions 🤔
Or rather, I couldn't replicate it with Ubuntu 18.04.5 (couldn't quickly find a Vagrant image with .3 that works with libvirt) and Docker 20.10.2 and kernel 4.15.0.
So it's the same experience as with your other machines.
Seems like there is something there (Kernel Version or Docker Version (or one of its deps)), that doesn't work well here 🤔
Any chance you could try with the latest patch releases of the Kernel and Docker? 🤔

i also found similar issues in this repository that seem to be similar to mine

Can you link them here please, so we can try to connect the dots?

nguyenvulong · 2021-06-10T15:44:59Z

@iwilltry42

Seems like there is something there (Kernel Version or Docker Version (or one of its deps)), that doesn't work well here 🤔
Any chance you could try with the latest patch releases of the Kernel and Docker? 🤔

I will update that machine to the latest and let you know within 24h. Thanks a lot for your time.

Can you link them here please, so we can try to connect the dots?

I linked them here in this #619 (comment)
Please double check and if you think they are not similar i will remove them from my comment.

nguyenvulong · 2021-06-11T09:15:27Z

@iwilltry42
My updates after using the newest versions (kernel & docker)

Linux AMC_Test 5.12.0-051200-lowlatency #202104252130 SMP PREEMPT Sun Apr 25 21:36:42 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux


 longnv@AMC_Test  ~  docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 7
  Running: 2
  Paused: 0
  Stopped: 5
 Images: 2
 Server Version: 20.10.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
 runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.12.0-051200-lowlatency
 Operating System: Ubuntu 18.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.25GiB
 Name: AMC_Test
 ID: 6W66:3A2U:5MZM:XJCH:N7NZ:OI4A:4FLB:VYHL:6K6N:5E5U:VH62:PGEN
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I attach the logs of two server nodes below.
k3d-test-server-1.txt
k3d-test-server-0.txt

iwilltry42 · 2021-06-11T09:54:13Z

So I inspected the logs again and something is wrong there with the etcd connection.

server-1 is added to the etcd cluster according to the server-0 logs:

{"level":"info","ts":"2021-06-11T09:02:24.876Z","caller":"membership/cluster.go:392","msg":"added member","cluster-id":"c74448475845f0fb","local-member-id":"e58c878e0e01014","added-peer-id":"7cb0e34e17d82726","added-peer-peer-urls":["https://172.18.0.3:2380"]}

Looks like server-1 is then trying to initialize, but right after opening the backend db, it fails to get the members from server-0 and closes etcd:

{"level":"warn","ts":"2021-06-11T09:02:24.937Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://172.18.0.2:2380/members","error":"Get \"https://172.18.0.2:2380/members\": EOF"}
{"level":"info","ts":"2021-06-11T09:02:24.938Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3d-test-server-1-d8d11e8a","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.18.0.3:2379"]}
{"level":"info","ts":"2021-06-11T09:02:24.938Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3d-test-server-1-d8d11e8a","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://172.18.0.3:2379"]}

What's unexpected is, that it just goes ahead to start the API Server:

time="2021-06-11T09:02:24.939180074Z" level=info msg="Running kube-apiserver ...

In the aftermath, this obviously fails to connect to the local etcd, as that was closed already:

{"level":"warn","ts":"2021-06-11T09:02:24.939Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
...
time="2021-06-11T09:02:49.939949826Z" level=info msg="Failed to test data store connection: context deadline exceeded"

server-0 then eventually removes the failed learner:

{"level":"info","ts":"2021-06-11T09:07:32.341Z","caller":"membership/cluster.go:422","msg":"removed member","cluster-id":"c74448475845f0fb","local-member-id":"e58c878e0e01014","removed-remote-peer-id":"7cb0e34e17d82726","removed-remote-peer-urls":["https://172.18.0.3:2380"]}

And this just goes on and on without exiting, so k3d doesn't catch the error 🤔

UPDATE 1:

At the same time that server-1 logs the EOF error (after which it shutsdown etcd), the server-0 logs this:

{"level":"warn","ts":"2021-06-11T09:02:24.937Z","caller":"embed/config_logging.go:270","msg":"rejected connection","remote-addr":"172.18.0.1:43776","server-name":"","ip-addresses":["0.0.0.0","127.0.0.1","172.18.0.3","10.43.0.1"],"dns-names":["localhost"],"error":"tls: \"172.18.0.1\" does not match any of DNSNames [\"localhost\"] (lookup 172.18.0.1: Name does not resolve)"}

No clue if that's related, as 172.18.0.1 is the docker network's default gateway 🤔

@brandond , should this be catched in k3s, that it doesn't just continue with the init process once the etcd server failed? 🤔

@nguyenvulong, this means, that your issue is different from the others that you've found so far.
-> while this failing cluster creation is still running (meaning that both containers are still running), can you try and do docker exec k3d-test-server-1 traceroute 172.18.0.2 and paste the output here, please?

nguyenvulong · 2021-06-11T10:16:24Z

while this failing cluster creation is still running (meaning that both containers are still running), can you try and do docker exec k3d-test-server-1 traceroute 172.18.0.2 and paste the output here, please?

Here it is @iwilltry42

brandond · 2021-06-11T22:03:44Z

Yeah, our etcd ready check clearly isn't handling this edge case well. I do wonder what's going on with your Docker network that's causing the new member to get an EOF when trying to talk to the first member:

{"level":"warn","ts":"2021-06-11T09:02:24.937Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://172.18.0.2:2380/members","error":"Get \"https://172.18.0.2:2380/members\": EOF"}

iwilltry42 · 2021-06-11T22:19:09Z

@brandond exactly my thought... I'm a bit lost here 🙄
I thought maybe there was a routing issue or network overlaps or something similar, but nothing seems to apply as the request doesn't even leave the network 🤷‍♂️

brandond · 2021-06-11T22:29:38Z

Would it be possible to start up another container in the same network, and run a curl test? curl -vks https://172.18.0.2:2380/members

Note that this should be done from another container, NOT by execing into the server container.

nguyenvulong · 2021-06-12T06:25:23Z

@brandond @iwilltry42

Would it be possible to start up another container in the same network, and run a curl test? curl -vks https://172.18.0.2:2380/members

Note that this should be done from another container, NOT by execing into the server container.

Understood.

First I created the cluster, it's hanging as usual

Next I created a debug container from alpine in the same network, here are the IP addresses

Finally I tried to curl

so I guess the culprit is here?

* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7f54b9ddea90)
* OpenSSL SSL_write: Broken pipe, errno 32
* Failed sending HTTP2 data
* Failed sending HTTP request
* Connection #0 to host 172.20.0.2 left intact

(Updated: I guess not, I tried with the other working-physical-server, the output was the same) ...
So * OpenSSL SSL_write: Broken pipe, errno 32 is expected?

nguyenvulong · 2021-12-29T06:38:43Z

I re-ran a test with a new version of k3d and docker . The log has changed a bit but I guess it's the same issue.

WARN[0006] warning: encountered fatal log from node k3d-test3-server-1 (retrying 0/10): ¯time="2021-12-29T06:30:18.174131527Z" level=fatal msg="ETCD join failed: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs"

INFO

❯ k3d version
k3d version v5.2.2
k3s version v1.21.7-k3s1 (default)
❯ docker version
Client: Docker Engine - Community
 Version:           20.10.9
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc
 Built:             Mon Oct  4 16:08:29 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.9
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.8
  Git commit:       79ea9d3
  Built:            Mon Oct  4 16:06:34 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

LOGS --verbose

INFO[0002] Using the k3d-tools node to gather environment information 
DEBU[0002] no netlabel present on container /k3d-test3-tools 
DEBU[0002] failed to get IP for container /k3d-test3-tools as we couldn't find the cluster network 
DEBU[0002] DOCKER_SOCK=/var/run/docker.sock             
INFO[0002] HostIP: using network gateway 172.28.0.1 address 
INFO[0002] Starting cluster 'test3'                     
INFO[0002] Starting the initializing server...          
DEBU[0002] Deleting node k3d-test3-tools ...            
DEBU[0002] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0002] No fix enabled.                              
DEBU[0002] Node k3d-test3-server-0 Start Time: 2021-12-29 15:30:14.133389946 +0900 KST m=+2.576182749 
INFO[0002] Starting Node 'k3d-test3-server-0'           
DEBU[0002] Truncated 2021-12-29 06:30:14.503862759 +0000 UTC to 2021-12-29 06:30:14 +0000 UTC 
DEBU[0002] Waiting for node k3d-test3-server-0 to get ready (Log: 'Containerd is now running') 
DEBU[0004] Finished waiting for log message 'Containerd is now running' from node 'k3d-test3-server-0' 
INFO[0004] Starting servers...                          
DEBU[0004] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0004] No fix enabled.                              
DEBU[0004] Node k3d-test3-server-1 Start Time: 2021-12-29 15:30:16.336411023 +0900 KST m=+4.779203802 
INFO[0004] Starting Node 'k3d-test3-server-1'           
DEBU[0005] Truncated 2021-12-29 06:30:16.710825562 +0000 UTC to 2021-12-29 06:30:16 +0000 UTC 
DEBU[0005] Waiting for node k3d-test3-server-1 to get ready (Log: 'k3s is up and running') 
WARN[0006] warning: encountered fatal log from node k3d-test3-server-1 (retrying 0/10): ¯time="2021-12-29T06:30:18.174131527Z" level=fatal msg="ETCD join failed: cannot fetch cluster info from peer urls: could not retrieve cluster information from the given URLs" 
ERRO[0008] Failed Cluster Start: Failed to start server k3d-test3-server-1: Node k3d-test3-server-1 failed to get ready: error waiting for log line `k3s is up and running` from node 'k3d-test3-server-1': stopped returning log lines 
ERRO[0008] Failed to create cluster >>> Rolling Back    
INFO[0008] Deleting cluster 'test3'                     
DEBU[0008] Cluster Details: &{Name:test3 Network:{Name:k3d-test3 ID:6298c0e9dc07eaf8b71adbd1bc7ec70a9dc01af05840c6ff45b8e14f5fb7b896 External:false IPAM:{IPPrefix:172.28.0.0/16 IPsUsed:[] Managed:false} Members:[]} Token:ihzNQjTgfGjTLXPHzSGP Nodes:[0xc000148600 0xc000148780 0xc000148900 0xc000148a80 0xc000148c00 0xc000148d80 0xc000148f00] InitNode:0xc000148780 ExternalDatastore:<nil> KubeAPI:0xc00032cf00 ServerLoadBalancer:0xc000307150 ImageVolume:k3d-test3-images} 
DEBU[0008] Deleting node k3d-test3-serverlb ...         
DEBU[0008] Deleting node k3d-test3-server-0 ...         
DEBU[0009] Deleting node k3d-test3-server-1 ...         
DEBU[0009] Deleting node k3d-test3-server-2 ...         
DEBU[0009] Deleting node k3d-test3-agent-0 ...          
DEBU[0009] Deleting node k3d-test3-agent-1 ...          
DEBU[0009] Deleting node k3d-test3-agent-2 ...          
INFO[0009] Deleting cluster network 'k3d-test3'         
INFO[0009] Deleting image volume 'k3d-test3-images'     
FATA[0009] Cluster creation FAILED, all changes have been rolled back!

iwilltry42 · 2022-01-05T12:43:07Z

@nguyenvulong , that's still on one single physical host where it's failing, while working on all other hosts?
Is there anything like ufw or firewalld running?
You could try with k3d cluster create issue-619 --k3s-arg="--cluster-init" to just create a single server cluster with etcd enabled, followed by k3d node create --trace server-619 --role server --cluster issue-619 to add the second server node to it.
That way you have all the time to first inspect the initializing server node for any issues as you can already start using the cluster with kubectl.
Also you can use kubectl delete node and k3d node delete on the secondary server node to remove and retry without having to re-create the first server node.

nguyenvulong · 2022-08-15T07:10:49Z

Hello, finally i was able to make it work.
The problem is quite silly because it's related to iptables rules. I don't know which one, but i accidentally ran other test and had to flush the iptables to clean slate.
I'm sorry it took so long to get back because as i was about to give up this toy project

I leave some commands and logs below for those who might have similar problems.
Once again, thanks so much for you guidance. Especially @iwilltry42 . Have a great day

Commands executed

sudo iptables -P INPUT ACCEPT
sudo iptables -P FORWARD ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo iptables -t nat -F
sudo iptables -t mangle -F
sudo iptables -F

sudo ip6tables -P INPUT ACCEPT
sudo ip6tables -P FORWARD ACCEPT
sudo ip6tables -P OUTPUT ACCEPT
sudo ip6tables -t nat -F
sudo ip6tables -t mangle -F
sudo ip6tables -F
sudo ip6tables -X

iptables-save | awk '/^[*]/ { print $1 } 
                        /^:[A-Z]+ [^-]/ { print $1 " ACCEPT" ; }
                        /COMMIT/ { print $0; }' | iptables-restore
sudo iptables -nvL

Create `init` server (i had to add "@" to make it work), i guess the syntax changed a bit

❯ k3d cluster create multiserver --k3s-arg="--cluster-init@" --servers 1
WARN[0000] No node filter specified                     
INFO[0000] Prep: Network                                
INFO[0000] Re-using existing network 'k3d-multiserver' (218dca2d5aee91d3f96600bce485a7eab49fd0c0cba0e9639cc5b895a364a84d) 
INFO[0000] Created volume 'k3d-multiserver-images'      
INFO[0000] Starting new tools node...                   
INFO[0000] Starting Node 'k3d-multiserver-tools'        
INFO[0001] Creating node 'k3d-multiserver-server-0'     
INFO[0001] Creating LoadBalancer 'k3d-multiserver-serverlb' 
INFO[0001] Using the k3d-tools node to gather environment information 
INFO[0001] HostIP: using network gateway 172.29.0.1 address 
INFO[0001] Starting cluster 'multiserver'               
INFO[0001] Starting servers...                          
INFO[0001] Starting Node 'k3d-multiserver-server-0'     
INFO[0009] All agents already running.                  
INFO[0009] Starting helpers...                          
INFO[0009] Starting Node 'k3d-multiserver-serverlb'     
INFO[0015] Injecting '172.29.0.1 host.k3d.internal' into /etc/hosts of all nodes... 
INFO[0015] Injecting records for host.k3d.internal and for 3 network members into CoreDNS configmap... 
INFO[0016] Cluster 'multiserver' created successfully!

Join a server to the cluster, everything works fine now

❯ k3d node create --trace multiserver --role server --cluster multiserver

DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0000] Runtime Info:
&{Name:docker Endpoint:/var/run/docker.sock Version:20.10.17 OSType:linux OS:Ubuntu 20.04.4 LTS Arch:x86_64 CgroupVersion:1 CgroupDriver:cgroupfs Filesystem:extfs} 
INFO[0000] Adding 1 node(s) to the runtime local cluster 'multiserver'... 
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-serverlb 
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-server-0 
TRAC[0000] Reading path /etc/confd/values.yaml from node k3d-multiserver-serverlb... 
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
INFO[0000] Using the k3d-tools node to gather environment information 
INFO[0000] Starting new tools node...                   
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
TRAC[0000] Creating node from spec
&{Name:k3d-multiserver-tools Role:noRole Image:docker.io/rancher/k3d-tools:5.2.2 Volumes:[k3d-multiserver-images:/k3d/images /var/run/docker.sock:/var/run/docker.sock] Env:[] Cmd:[] Args:[noop] Ports:map[] Restart:false Created: RuntimeLabels:map[app:k3d k3d.cluster:multiserver k3d.version:v5.2.2] K3sNodeLabels:map[] Networks:[k3d-multiserver] ExtraHosts:[host.k3d.internal:host-gateway] ServerOpts:{IsInit:false KubeAPI:<nil>} AgentOpts:{} GPURequest: Memory: State:{Running:false Status: Started:} IP:{IP:zero IP Static:false} HookActions:[]} 
TRAC[0000] Creating docker container with translated config
&{ContainerConfig:{Hostname:k3d-multiserver-tools Domainname: User: AttachStdin:false AttachStdout:false AttachStderr:false ExposedPorts:map[] Tty:false OpenStdin:false StdinOnce:false Env:[K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml] Cmd:[noop] Healthcheck:<nil> ArgsEscaped:false Image:docker.io/rancher/k3d-tools:5.2.2 Volumes:map[] WorkingDir: Entrypoint:[] NetworkDisabled:false MacAddress: OnBuild:[] Labels:map[app:k3d k3d.cluster:multiserver k3d.role:noRole k3d.version:v5.2.2] StopSignal: StopTimeout:<nil> Shell:[]} HostConfig:{Binds:[k3d-multiserver-images:/k3d/images /var/run/docker.sock:/var/run/docker.sock] ContainerIDFile: LogConfig:{Type: Config:map[]} NetworkMode: PortBindings:map[] RestartPolicy:{Name: MaximumRetryCount:0} AutoRemove:false VolumeDriver: VolumesFrom:[] CapAdd:[] CapDrop:[] CgroupnsMode: DNS:[] DNSOptions:[] DNSSearch:[] ExtraHosts:[host.k3d.internal:host-gateway] GroupAdd:[] IpcMode: Cgroup: Links:[] OomScoreAdj:0 PidMode: Privileged:true PublishAllPorts:false ReadonlyRootfs:false SecurityOpt:[] StorageOpt:map[] Tmpfs:map[/run: /var/run:] UTSMode: UsernsMode: ShmSize:0 Sysctls:map[] Runtime: ConsoleSize:[0 0] Isolation: Resources:{CPUShares:0 Memory:0 NanoCPUs:0 CgroupParent: BlkioWeight:0 BlkioWeightDevice:[] BlkioDeviceReadBps:[] BlkioDeviceWriteBps:[] BlkioDeviceReadIOps:[] BlkioDeviceWriteIOps:[] CPUPeriod:0 CPUQuota:0 CPURealtimePeriod:0 CPURealtimeRuntime:0 CpusetCpus: CpusetMems: Devices:[] DeviceCgroupRules:[] DeviceRequests:[] KernelMemory:0 KernelMemoryTCP:0 MemoryReservation:0 MemorySwap:0 MemorySwappiness:<nil> OomKillDisable:<nil> PidsLimit:<nil> Ulimits:[] CPUCount:0 CPUPercent:0 IOMaximumIOps:0 IOMaximumBandwidth:0} Mounts:[] MaskedPaths:[] ReadonlyPaths:[] Init:0xc0002230af} NetworkingConfig:{EndpointsConfig:map[k3d-multiserver:0xc000378240]}} 
DEBU[0000] Created container k3d-multiserver-tools (ID: 9fbc60597753770c1ff87cb69101d0879ca938793a654721b57f84dceb8da89d) 
DEBU[0000] Node k3d-multiserver-tools Start Time: 2022-08-15 15:56:15.107324244 +0900 KST m=+0.279922353 
TRAC[0000] Starting node 'k3d-multiserver-tools'        
INFO[0000] Starting Node 'k3d-multiserver-tools'        
DEBU[0000] Truncated 2022-08-15 06:56:15.594737218 +0000 UTC to 2022-08-15 06:56:15 +0000 UTC 
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
TRAC[0000] GOOS: linux / Runtime OS: linux (Ubuntu 20.04.4 LTS) 
INFO[0000] HostIP: using network gateway 172.29.0.1 address 
DEBU[0000] Deleting node k3d-multiserver-tools ...      
TRAC[0000] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-server-0 
TRAC[0000] Dropping '--cluster-init' from source node's cmd 
DEBU[0000] Adding node k3d-multiserver-0 to cluster multiserver based on existing (sanitized) node k3d-multiserver-server-0 
TRAC[0000] Sanitized Source Node: &{Name:k3d-multiserver-server-0 Role:server Image:sha256:4cbf38ec7da6c499ecc50cc6c54ef8c89d88f02a95da66eab5e404ade7067c61 Volumes:[k3d-multiserver-images:/k3d/images] Env:[K3S_TOKEN=hNunmhtejzjPxTWIQTOv K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml] Cmd:[server --tls-san 0.0.0.0] Args:[] Ports:map[] Restart:true Created:2022-08-15T06:55:29.132020889Z RuntimeLabels:map[k3d.cluster:multiserver k3d.cluster.imageVolume:k3d-multiserver-images k3d.cluster.network:k3d-multiserver k3d.cluster.network.external:true k3d.cluster.network.id:218dca2d5aee91d3f96600bce485a7eab49fd0c0cba0e9639cc5b895a364a84d k3d.cluster.network.iprange:172.29.0.0/16 k3d.cluster.token:hNunmhtejzjPxTWIQTOv k3d.cluster.url:https://k3d-multiserver-server-0:6443 k3d.role:server k3d.server.api.host:0.0.0.0 k3d.server.api.hostIP:0.0.0.0 k3d.server.api.port:41121 k3d.version:v5.2.2] K3sNodeLabels:map[] Networks:[k3d-multiserver] ExtraHosts:[] ServerOpts:{IsInit:false KubeAPI:0xc0000bc4c0} AgentOpts:{} GPURequest: Memory: State:{Running:true Status:running Started:} IP:{IP:172.29.0.3 Static:false} HookActions:[]}
New Node: &{Name:k3d-multiserver-0 Role:server Image:docker.io/rancher/k3s:v1.21.7-k3s1 Volumes:[] Env:[] Cmd:[] Args:[] Ports:map[] Restart:true Created: RuntimeLabels:map[k3d.role:server] K3sNodeLabels:map[] Networks:[k3d-multiserver] ExtraHosts:[] ServerOpts:{IsInit:false KubeAPI:<nil>} AgentOpts:{} GPURequest: Memory: State:{Running:false Status: Started:} IP:{IP:zero IP Static:false} HookActions:[]} 
TRAC[0000] Reading path /etc/rancher/k3s/registries.yaml from node k3d-multiserver-server-0... 
TRAC[0000] Resulting node &{Name:k3d-multiserver-0 Role:server Image:docker.io/rancher/k3s:v1.21.7-k3s1 Volumes:[k3d-multiserver-images:/k3d/images] Env:[K3S_TOKEN=hNunmhtejzjPxTWIQTOv K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml] Cmd:[server --tls-san 0.0.0.0] Args:[] Ports:map[] Restart:true Created:2022-08-15T06:55:29.132020889Z RuntimeLabels:map[k3d.cluster:multiserver k3d.cluster.imageVolume:k3d-multiserver-images k3d.cluster.network:k3d-multiserver k3d.cluster.network.external:true k3d.cluster.network.id:218dca2d5aee91d3f96600bce485a7eab49fd0c0cba0e9639cc5b895a364a84d k3d.cluster.network.iprange:172.29.0.0/16 k3d.cluster.token:hNunmhtejzjPxTWIQTOv k3d.cluster.url:https://k3d-multiserver-server-0:6443 k3d.role:server k3d.server.api.host:0.0.0.0 k3d.server.api.hostIP:0.0.0.0 k3d.server.api.port:41121 k3d.version:v5.2.2] K3sNodeLabels:map[] Networks:[k3d-multiserver] ExtraHosts:[] ServerOpts:{IsInit:false KubeAPI:0xc0000bc4c0} AgentOpts:{} GPURequest: Memory: State:{Running:true Status:running Started:} IP:{IP:zero IP Static:false} HookActions:[]} 
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
TRAC[0000] Creating node from spec
&{Name:k3d-multiserver-0 Role:server Image:docker.io/rancher/k3s:v1.21.7-k3s1 Volumes:[k3d-multiserver-images:/k3d/images] Env:[K3S_TOKEN=hNunmhtejzjPxTWIQTOv K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml K3S_URL=https://k3d-multiserver-server-0:6443] Cmd:[server --tls-san 0.0.0.0] Args:[] Ports:map[] Restart:true Created:2022-08-15T06:55:29.132020889Z RuntimeLabels:map[k3d.cluster:multiserver k3d.cluster.imageVolume:k3d-multiserver-images k3d.cluster.network:k3d-multiserver k3d.cluster.network.external:true k3d.cluster.network.id:218dca2d5aee91d3f96600bce485a7eab49fd0c0cba0e9639cc5b895a364a84d k3d.cluster.network.iprange:172.29.0.0/16 k3d.cluster.token:hNunmhtejzjPxTWIQTOv k3d.cluster.url:https://k3d-multiserver-server-0:6443 k3d.role:server k3d.server.api.host:0.0.0.0 k3d.server.api.hostIP:0.0.0.0 k3d.server.api.port:41121 k3d.version:v5.2.2] K3sNodeLabels:map[] Networks:[k3d-multiserver] ExtraHosts:[] ServerOpts:{IsInit:false KubeAPI:0xc0000bc4c0} AgentOpts:{} GPURequest: Memory: State:{Running:false Status: Started:} IP:{IP:zero IP Static:false} HookActions:[]} 
TRAC[0000] Creating docker container with translated config
&{ContainerConfig:{Hostname:k3d-multiserver-0 Domainname: User: AttachStdin:false AttachStdout:false AttachStderr:false ExposedPorts:map[] Tty:false OpenStdin:false StdinOnce:false Env:[K3S_TOKEN=hNunmhtejzjPxTWIQTOv K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml K3S_URL=https://k3d-multiserver-server-0:6443 K3S_KUBECONFIG_OUTPUT=/output/kubeconfig.yaml] Cmd:[server --tls-san 0.0.0.0 --tls-san 0.0.0.0] Healthcheck:<nil> ArgsEscaped:false Image:docker.io/rancher/k3s:v1.21.7-k3s1 Volumes:map[] WorkingDir: Entrypoint:[] NetworkDisabled:false MacAddress: OnBuild:[] Labels:map[app:k3d k3d.cluster:multiserver k3d.cluster.imageVolume:k3d-multiserver-images k3d.cluster.network:k3d-multiserver k3d.cluster.network.external:true k3d.cluster.network.id:218dca2d5aee91d3f96600bce485a7eab49fd0c0cba0e9639cc5b895a364a84d k3d.cluster.network.iprange:172.29.0.0/16 k3d.cluster.token:hNunmhtejzjPxTWIQTOv k3d.cluster.url:https://k3d-multiserver-server-0:6443 k3d.role:server k3d.server.api.host:0.0.0.0 k3d.server.api.hostIP:0.0.0.0 k3d.server.api.port:41121 k3d.version:v5.2.2] StopSignal: StopTimeout:<nil> Shell:[]} HostConfig:{Binds:[k3d-multiserver-images:/k3d/images] ContainerIDFile: LogConfig:{Type: Config:map[]} NetworkMode: PortBindings:map[] RestartPolicy:{Name:unless-stopped MaximumRetryCount:0} AutoRemove:false VolumeDriver: VolumesFrom:[] CapAdd:[] CapDrop:[] CgroupnsMode: DNS:[] DNSOptions:[] DNSSearch:[] ExtraHosts:[] GroupAdd:[] IpcMode: Cgroup: Links:[] OomScoreAdj:0 PidMode: Privileged:true PublishAllPorts:false ReadonlyRootfs:false SecurityOpt:[] StorageOpt:map[] Tmpfs:map[/run: /var/run:] UTSMode: UsernsMode: ShmSize:0 Sysctls:map[] Runtime: ConsoleSize:[0 0] Isolation: Resources:{CPUShares:0 Memory:0 NanoCPUs:0 CgroupParent: BlkioWeight:0 BlkioWeightDevice:[] BlkioDeviceReadBps:[] BlkioDeviceWriteBps:[] BlkioDeviceReadIOps:[] BlkioDeviceWriteIOps:[] CPUPeriod:0 CPUQuota:0 CPURealtimePeriod:0 CPURealtimeRuntime:0 CpusetCpus: CpusetMems: Devices:[] DeviceCgroupRules:[] DeviceRequests:[] KernelMemory:0 KernelMemoryTCP:0 MemoryReservation:0 MemorySwap:0 MemorySwappiness:<nil> OomKillDisable:<nil> PidsLimit:<nil> Ulimits:[] CPUCount:0 CPUPercent:0 IOMaximumIOps:0 IOMaximumBandwidth:0} Mounts:[] MaskedPaths:[] ReadonlyPaths:[] Init:0xc00026e94f} NetworkingConfig:{EndpointsConfig:map[k3d-multiserver:0xc0002f0000]}} 
DEBU[0000] Created container k3d-multiserver-0 (ID: 590cbc1ce10825f30608303bbeacd52934c3dd8d051cede1da914b9360a24579) 
DEBU[0000] DOCKER_SOCK=/var/run/docker.sock             
DEBU[0000] No fix enabled.                              
DEBU[0000] Node k3d-multiserver-0 Start Time: 2022-08-15 15:56:15.780042321 +0900 KST m=+0.952640423 
TRAC[0000] Starting node 'k3d-multiserver-0'            
INFO[0000] Starting Node 'k3d-multiserver-0'            
TRAC[0001] [Docker] Deleted Container k3d-multiserver-tools 
DEBU[0001] Truncated 2022-08-15 06:56:16.369727944 +0000 UTC to 2022-08-15 06:56:16 +0000 UTC 
DEBU[0001] Waiting for node k3d-multiserver-0 to get ready (Log: 'k3s is up and running') 
TRAC[0001] NodeWaitForLogMessage: Node 'k3d-multiserver-0' waiting for log message 'k3s is up and running' since '2022-08-15 06:56:16 +0000 UTC' 
TRAC[0024] Found target message `k3s is up and running` in log line `Mtime="2022-08-15T06:56:39.144380185Z" level=info msg="k3s is up and running"` 
DEBU[0024] Finished waiting for log message 'k3s is up and running' from node 'k3d-multiserver-0' 
TRAC[0024] Node k3d-multiserver-0: Executing postStartAction: [ExecAction] Executing `[sh -c echo '172.29.0.1 host.k3d.internal' >> /etc/hosts]` (0 retries): Inject /etc/hosts record for host.k3d.internal 
TRAC[0024] ExecAction ([sh -c echo '172.29.0.1 host.k3d.internal' >> /etc/hosts] in k3d-multiserver-0) try 1/1 
DEBU[0024] Executing command '[sh -c echo '172.29.0.1 host.k3d.internal' >> /etc/hosts]' in node 'k3d-multiserver-0' 
TRAC[0024] Exec process '[sh -c echo '172.29.0.1 host.k3d.internal' >> /etc/hosts]' still running in node 'k3d-multiserver-0'.. sleeping for 1 second... 
DEBU[0025] Exec process in node 'k3d-multiserver-0' exited with '0' 
INFO[0025] Updating loadbalancer config to include new server node(s) 
TRAC[0025] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-0 
TRAC[0025] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-serverlb 
TRAC[0025] TranslateContainerDetailsToNode: Checking for default object label app=k3d on container /k3d-multiserver-server-0 
TRAC[0025] Reading path /etc/confd/values.yaml from node k3d-multiserver-serverlb... 
TRAC[0025] Current loadbalancer config:
{Ports:map[6443.tcp:[k3d-multiserver-server-0]] Settings:{WorkerConnections:1024 DefaultProxyTimeout:0}} 
TRAC[0025] New loadbalancer config:
{Ports:map[6443.tcp:[k3d-multiserver-server-0]] Settings:{WorkerConnections:1024 DefaultProxyTimeout:0}} 
DEBU[0025] Updating the loadbalancer with this diff: [Ports.map[6443.tcp].slice[1]: <no value> != k3d-multiserver-0 Settings.WorkerConnections: 1024 != 1026] 
DEBU[0025] Writing lb config:
ports:
  6443.tcp:
  - k3d-multiserver-server-0
  - k3d-multiserver-0
settings:
  workerConnections: 1026 
TRAC[0025] NodeWaitForLogMessage: Node 'k3d-multiserver-serverlb' waiting for log message 'start worker processes' since '2022-08-15 06:56:40 +0000 UTC' 
TRAC[0025] Found target message `start worker processes` in log line `;2022/08/15 06:56:40 [notice] 32#32: start worker processes` 
DEBU[0025] Finished waiting for log message 'start worker processes' from node 'k3d-multiserver-serverlb' 
INFO[0025] Successfully configured loadbalancer k3d-multiserver-serverlb! 
INFO[0026] Successfully created 1 node(s)!

nguyenvulong added the question Further information is requested label Jun 2, 2021

nguyenvulong mentioned this issue Jun 4, 2021

[BUG] Cluster create gets stuck while waiting for serverlb logs that exist already #621

Closed

iwilltry42 mentioned this issue Jun 7, 2021

[FEATURE] k3d IPAM (to prevent etcd failures) #550

Closed

iwilltry42 added this to the v4.5.0 milestone Jun 7, 2021

iwilltry42 self-assigned this Jun 7, 2021

iwilltry42 modified the milestones: v4.4.5, Backlog Jun 11, 2021

iwilltry42 assigned brandond Jun 14, 2021

iwilltry42 added the help wanted Extra attention is needed label Jun 14, 2021

arkadius mentioned this issue Jan 12, 2022

Error: Failed to create node ...The container name is already in use by container AbsaOSS/k3d-action#38

Closed

nguyenvulong closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster cannot be created because of (second) server node's issue #619

Cluster cannot be created because of (second) server node's issue #619

nguyenvulong commented Jun 2, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

iwilltry42 commented Jun 4, 2021

nguyenvulong commented Jun 4, 2021 •

edited

Loading

Jasper-Ben commented Jun 4, 2021 •

edited

Loading

nguyenvulong commented Jun 5, 2021 •

edited

Loading

Jasper-Ben commented Jun 5, 2021

nguyenvulong commented Jun 5, 2021 •

edited

Loading

Jasper-Ben commented Jun 5, 2021 •

edited

Loading

nguyenvulong commented Jun 7, 2021

nguyenvulong commented Jun 7, 2021 •

edited

Loading

iwilltry42 commented Jun 7, 2021 •

edited

Loading

nguyenvulong commented Jun 7, 2021 •

edited

Loading

brandond commented Jun 7, 2021

nguyenvulong commented Jun 8, 2021 •

edited

Loading

iwilltry42 commented Jun 10, 2021

nguyenvulong commented Jun 10, 2021 •

edited

Loading

nguyenvulong commented Jun 11, 2021

iwilltry42 commented Jun 11, 2021 •

edited

Loading

nguyenvulong commented Jun 11, 2021

brandond commented Jun 11, 2021

iwilltry42 commented Jun 11, 2021

brandond commented Jun 11, 2021 •

edited

Loading

nguyenvulong commented Jun 12, 2021 •

edited

Loading

nguyenvulong commented Dec 29, 2021

iwilltry42 commented Jan 5, 2022

nguyenvulong commented Aug 15, 2022 •

edited

Loading

Cluster cannot be created because of (second) server node's issue #619

Cluster cannot be created because of (second) server node's issue #619

Comments

nguyenvulong commented Jun 2, 2021 • edited Loading

nguyenvulong commented Jun 3, 2021 • edited Loading

nguyenvulong commented Jun 3, 2021 • edited Loading

nguyenvulong commented Jun 3, 2021 • edited Loading

iwilltry42 commented Jun 4, 2021

nguyenvulong commented Jun 4, 2021 • edited Loading

Jasper-Ben commented Jun 4, 2021 • edited Loading

nguyenvulong commented Jun 5, 2021 • edited Loading

Jasper-Ben commented Jun 5, 2021

nguyenvulong commented Jun 5, 2021 • edited Loading

Jasper-Ben commented Jun 5, 2021 • edited Loading

nguyenvulong commented Jun 7, 2021

nguyenvulong commented Jun 7, 2021 • edited Loading

This is a normally run (physical) server: left side is server-node-0, right side is server-node-1. Pay attention to the running processes

This is the troublesome (physical) server that I got issue, server-node-1 is not working properly ...

It is obvious that there are some services not running, they are

iwilltry42 commented Jun 7, 2021 • edited Loading

nguyenvulong commented Jun 7, 2021 • edited Loading

Information of the physical machine 1 (working)

Information of the physical machine 2 (NOT working)

Physical Machine 3 (working)

Physical Machine 4 (working)

brandond commented Jun 7, 2021

nguyenvulong commented Jun 8, 2021 • edited Loading

iwilltry42 commented Jun 10, 2021

nguyenvulong commented Jun 10, 2021 • edited Loading

nguyenvulong commented Jun 11, 2021

iwilltry42 commented Jun 11, 2021 • edited Loading

nguyenvulong commented Jun 11, 2021

brandond commented Jun 11, 2021

iwilltry42 commented Jun 11, 2021

brandond commented Jun 11, 2021 • edited Loading

nguyenvulong commented Jun 12, 2021 • edited Loading

nguyenvulong commented Dec 29, 2021

iwilltry42 commented Jan 5, 2022

nguyenvulong commented Aug 15, 2022 • edited Loading

Commands executed

Create init server (i had to add "@" to make it work), i guess the syntax changed a bit

Join a server to the cluster, everything works fine now

nguyenvulong commented Jun 2, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

nguyenvulong commented Jun 3, 2021 •

edited

Loading

nguyenvulong commented Jun 4, 2021 •

edited

Loading

Jasper-Ben commented Jun 4, 2021 •

edited

Loading

nguyenvulong commented Jun 5, 2021 •

edited

Loading

nguyenvulong commented Jun 5, 2021 •

edited

Loading

Jasper-Ben commented Jun 5, 2021 •

edited

Loading

nguyenvulong commented Jun 7, 2021 •

edited

Loading

iwilltry42 commented Jun 7, 2021 •

edited

Loading

nguyenvulong commented Jun 7, 2021 •

edited

Loading

nguyenvulong commented Jun 8, 2021 •

edited

Loading

nguyenvulong commented Jun 10, 2021 •

edited

Loading

iwilltry42 commented Jun 11, 2021 •

edited

Loading

brandond commented Jun 11, 2021 •

edited

Loading

nguyenvulong commented Jun 12, 2021 •

edited

Loading

nguyenvulong commented Aug 15, 2022 •

edited

Loading

Create `init` server (i had to add "@" to make it work), i guess the syntax changed a bit