Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cluster fails to create with servers > 1 #538

Closed
viq opened this issue Mar 12, 2021 · 13 comments
Closed

[BUG] Cluster fails to create with servers > 1 #538

viq opened this issue Mar 12, 2021 · 13 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed os/fedora Issues related to perks of Fedora Linux
Milestone

Comments

@viq
Copy link

viq commented Mar 12, 2021

What did you do

  • How was the cluster created?
    • k3d cluster create -a 1 -s 3 --registry-create viq-test

What did you expect to happen

Running cluster with three servers and one agent node.

Screenshots or terminal output

The first server starts:

$ docker logs -n 30 k3d-viq-test-server-0
{"level":"warn","ts":"2021-03-12T17:52:02.265Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://172.21.0.3:2380/version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:02.265Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:02.723Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2ffd25dd-830d-4198-aedc-e7a67f3ca8ec/127.0.0.1:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
time="2021-03-12T17:52:02.723774148Z" level=warning msg="Learner  stalled at RaftAppliedIndex=0 for 45.723765588s"
{"level":"warn","ts":"2021-03-12T17:52:04.370Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:04.370Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:06.266Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://172.21.0.3:2380/version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:06.267Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:09.374Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:09.374Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:10.268Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://172.21.0.3:2380/version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:10.268Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:14.268Z","caller":"etcdserver/cluster_util.go:315","msg":"failed to reach the peer URL","address":"https://172.21.0.3:2380/version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:14.268Z","caller":"etcdserver/cluster_util.go:168","msg":"failed to get version","remote-member-id":"3bf66290abb06d2f","error":"Get \"https://172.21.0.3:2380/version\": dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:14.374Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:14.374Z","caller":"rafthttp/probing_status.go:70","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"3bf66290abb06d2f","rtt":"0s","error":"dial tcp 172.21.0.3:2380: connect: connection refused"}
{"level":"warn","ts":"2021-03-12T17:52:17.724Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-2ffd25dd-830d-4198-aedc-e7a67f3ca8ec/127.0.0.1:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
time="2021-03-12T17:52:17.725071757Z" level=warning msg="Learner  stalled at RaftAppliedIndex=0 for 1m0.72482006s"
{"level":"info","ts":"2021-03-12T17:52:17.725Z","caller":"raft/raft.go:1530","msg":"c2ed0eb8fc7bc4fc switched to configuration voters=(14045899000511186172)"}
{"level":"info","ts":"2021-03-12T17:52:17.726Z","caller":"membership/cluster.go:422","msg":"removed member","cluster-id":"d94dd8791a0b7a16","local-member-id":"c2ed0eb8fc7bc4fc","removed-remote-peer-id":"3bf66290abb06d2f","removed-remote-peer-urls":["https://172.21.0.3:2380"]}
{"level":"info","ts":"2021-03-12T17:52:17.726Z","caller":"rafthttp/peer.go:333","msg":"stopping remote peer","remote-peer-id":"3bf66290abb06d2f"}
{"level":"warn","ts":"2021-03-12T17:52:17.726Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"3bf66290abb06d2f"}
{"level":"warn","ts":"2021-03-12T17:52:17.726Z","caller":"rafthttp/stream.go:301","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.726Z","caller":"rafthttp/pipeline.go:86","msg":"stopped HTTP pipelining with remote peer","local-member-id":"c2ed0eb8fc7bc4fc","remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.726Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"c2ed0eb8fc7bc4fc","remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.727Z","caller":"rafthttp/stream.go:459","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"c2ed0eb8fc7bc4fc","remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.727Z","caller":"rafthttp/peer.go:340","msg":"stopped remote peer","remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.727Z","caller":"rafthttp/transport.go:369","msg":"removed remote peer","local-member-id":"c2ed0eb8fc7bc4fc","removed-remote-peer-id":"3bf66290abb06d2f"}
{"level":"info","ts":"2021-03-12T17:52:17.727Z","caller":"etcdserver/server.go:1967","msg":"applied a configuration change through raft","local-member-id":"c2ed0eb8fc7bc4fc","raft-conf-change":"ConfChangeRemoveNode","raft-conf-change-node-id":"3bf66290abb06d2f"}
time="2021-03-12T17:52:17.728424850Z" level=warning msg="Removed learner  from etcd cluster"

but the second one never completes enough to report as ready:

$ docker logs -n 30 k3d-viq-test-server-1
{"level":"warn","ts":"2021-03-12T17:55:49.698Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:50.342Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:50.385Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:51.693Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:52.237Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:54.139Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:55.152Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:57.832Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:59.324Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:55:59.341Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-03-12T17:55:59.342028314Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
{"level":"warn","ts":"2021-03-12T17:55:59.385Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-03-12T17:55:59.386766390Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-03-12T17:56:04.342Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:04.387Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:05.342Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:05.387Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:07.002Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:07.239Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:09.200Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:09.770Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:12.706Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:14.341Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-03-12T17:56:14.342024025Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
{"level":"warn","ts":"2021-03-12T17:56:14.387Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
time="2021-03-12T17:56:14.387834487Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-03-12T17:56:19.342Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:19.388Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:20.342Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-12T17:56:20.389Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}

Which OS & Architecture

Linux (Qubes, Fedora-33 based Xen VM)

Which version of k3d

$ k3d version
k3d version v4.3.0
k3s version v1.20.4-k3s1 (default)

Which version of docker

$ docker version ; docker info
Client: Docker Engine - Community
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:19:16 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:16:56 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 6
  Running: 2
  Paused: 0
  Stopped: 4
 Images: 8
 Server Version: 20.10.5
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc version: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 5.11.4-1.fc25.qubes.x86_64
 Operating System: Fedora 33 (Thirty Three)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 2.506GiB
 Name: k8s-test
 ID: LLY7:NVU7:EA6V:5TEB:G37H:YZA7:4JUE:7CQV:JOBT:XBSA:HS5V:IOHD
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
@viq viq added the bug Something isn't working label Mar 12, 2021
@viq
Copy link
Author

viq commented Mar 12, 2021

Meanwhile creating for example k3d cluster create -a 5 -s 1 --registry-create single-server works properly.

@iwilltry42 iwilltry42 self-assigned this Mar 15, 2021
@iwilltry42 iwilltry42 added this to the v4.4.0 milestone Mar 15, 2021
@iwilltry42
Copy link
Member

Hi @viq, thanks for opening this issue!
I cannot seem to be able to replicate this (not on Fedora though).
Could it be related to some Firewall settings required on Fedora (read about such issues before in a different form, e.g. in #423)

@viq
Copy link
Author

viq commented Mar 15, 2021

Hm, I flushed all firewall rules and changed policies to ACCEPT as far as I can tell:

$ sudo iptables -vnL
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
 116K  121M DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
 116K  121M DOCKER-ISOLATION-STAGE-1  all  --  *      *       0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
    0     0 DOCKER     all  --  *      docker0  0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0           
    0     0 ACCEPT     all  --  docker0 docker0  0.0.0.0/0            0.0.0.0/0           

Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain DOCKER (1 references)
 pkts bytes target     prot opt in     out     source               destination         

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DOCKER-ISOLATION-STAGE-2  all  --  docker0 !docker0  0.0.0.0/0            0.0.0.0/0           
 234K  242M RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain DOCKER-ISOLATION-STAGE-2 (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 DROP       all  --  *      docker0  0.0.0.0/0            0.0.0.0/0           
59987 4017K RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain DOCKER-USER (1 references)
 pkts bytes target     prot opt in     out     source               destination         
 234K  242M RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0           

Chain QBS-FORWARD (0 references)
 pkts bytes target     prot opt in     out     source               destination         

and I'm still getting on second server loops of

{"level":"warn","ts":"2021-03-15T21:35:37.746Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}              
time="2021-03-15T21:35:37.746649982Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
{"level":"warn","ts":"2021-03-15T21:35:37.803Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}              
time="2021-03-15T21:35:37.803375956Z" level=info msg="Failed to test data store connection: context deadline exceeded"
{"level":"warn","ts":"2021-03-15T21:35:42.746Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:42.803Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:43.747Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:43.804Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:45.190Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:45.238Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:48.042Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:48.211Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:52.001Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
{"level":"warn","ts":"2021-03-15T21:35:52.746Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}              
time="2021-03-15T21:35:52.746507956Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"

It doesn't seem to be an issue with nftables, since

[root@k8s-test ~]# ls -l /usr/sbin/iptables
lrwxrwxrwx 1 root root 26 Dec 10 05:12 /usr/sbin/iptables -> /etc/alternatives/iptables
[root@k8s-test ~]# ls -l /etc/alternatives/iptables
lrwxrwxrwx 1 root root 25 Dec 10 05:12 /etc/alternatives/iptables -> /usr/sbin/iptables-legacy
[root@k8s-test ~]# ls -l /usr/sbin/iptables-legacy
lrwxrwxrwx 1 root root 20 Jan 28 15:40 /usr/sbin/iptables-legacy -> xtables-legacy-multi
[root@k8s-test ~]# ls -l /usr/sbin/xtables-legacy-multi 
-rwxr-xr-x 1 root root 101640 Jan 28 15:40 /usr/sbin/xtables-legacy-multi

It most likely is just a Fedora / Qubes thing, I'm just thinking aloud here, a bit in hopes you'll spot something or have some pointers ;)

@viq
Copy link
Author

viq commented Mar 15, 2021

Oh, I may have found something:

W0315 22:06:53.432911       9 proxier.go:651] Failed to read file /lib/modules/5.11.4-1.fc25.qubes.x86_64/modules.builtin with error open /lib/modules/5.11.4-1.fc25.qubes.x86_64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/m
odules                                                                                                                                                                                                                                                                                                                       
W0315 22:06:53.484766       9 proxier.go:661] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules                                                                                                                        
W0315 22:06:53.521118       9 proxier.go:661] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules                                                                                                                     
W0315 22:06:53.571583       9 proxier.go:661] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules                                                                                                                    
W0315 22:06:53.575478       9 proxier.go:661] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules                                                                                                                     
W0315 22:06:53.576683       9 proxier.go:661] Failed to load kernel module nf_conntrack with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules 

Is this a legitimate issue, or is this indeed "running inside container" and to be expected?

@iwilltry42 iwilltry42 added the help wanted Extra attention is needed label Mar 16, 2021
@iwilltry42
Copy link
Member

iwilltry42 commented Mar 16, 2021

It most likely is just a Fedora / Qubes thing, I'm just thinking aloud here, a bit in hopes you'll spot something or have some pointers ;)

Unfortunately, I'm completely unfamiliar with Fedora, as I never used it... but it seems like it introduces lots of issues when trying to use k3d 🤔

kube-proxy warnings
Is this a legitimate issue, or is this indeed "running inside container" and to be expected?

This is expected and doesn't bring in any issues (e.g. those logs appear for me as well and everything starts up normally).

Could it in some way be related to firewalld on Fedora? (like mentioned here: #423 (comment))
Maybe @Ka0o0 even has some idea here 🤔

Edit: the KIND docs have an additional entry regarding firewalld/nftables

@iwilltry42 iwilltry42 added the os/fedora Issues related to perks of Fedora Linux label Mar 16, 2021
@viq
Copy link
Author

viq commented Mar 18, 2021

It's probably more related to Qubes than Fedora as such. I'll try to play with different scenarios on different distros over the weekend and get back to you.

@viq
Copy link
Author

viq commented Mar 26, 2021

Sorry for the delay. I'll be investigating more, but I've just set up fresh VMs of Workstation versions of Fedora 33 and Ubuntu 20.04 with docker, and indeed creating a cluster with 3 servers succeeded on Ubuntu and failed on Fedora. I'll try to find out what the difference is.

@viq
Copy link
Author

viq commented Mar 28, 2021

It looks like on Fedora it fails on cgroup v2. Comparing logs of the first server between Fedora and Ubuntu, the difference on Fedora starts around here:

E0328 12:15:14.338719       8 kubelet.go:2264] node "k3d-multinode-server-0" not found
I0328 12:15:14.339865       8 kubelet_node_status.go:71] Attempting to register node k3d-multinode-server-0
I0328 12:15:14.356500       8 policy_none.go:43] [cpumanager] none policy: Start
I0328 12:15:14.366353       8 kubelet_node_status.go:74] Successfully registered node k3d-multinode-server-0
E0328 12:15:14.369934       8 node_container_manager_linux.go:57] Failed to create ["kubepods"] cgroup
F0328 12:15:14.370108       8 kubelet.go:1368] Failed to start ContainerManager cannot enter cgroupv2 "/sys/fs/cgroup/kubepods" with domain controllers -- it is in an invalid state
goroutine 10391 [running]:
github.com/rancher/k3s/vendor/k8s.io/klog/v2.stacks(0xc000120001, 0xc0100950e0, 0xb5, 0x1cf)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).output(0x70557a0, 0xc000000003, 0x0, 0x0, 0xc00806a0e0, 0x6d07591, 0xa, 0x558, 0x0)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
github.com/rancher/k3s/vendor/k8s.io/klog/v2.(*loggingT).printf(0x70557a0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x4661cf8, 0x23, 0xc00b88f940, 0x1, ...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:750 +0x191
github.com/rancher/k3s/vendor/k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1502
github.com/rancher/k3s/vendor/k8s.io/kubernetes/pkg/kubelet.(*Kubelet).initializeRuntimeDependentModules(0xc00a298000)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1368 +0x446

though there is before that a

W0328 12:15:09.172259       8 manager.go:159] Cannot detect current cgroup on cgroup v2

@viq
Copy link
Author

viq commented Mar 28, 2021

On a "normal" Fedora 33 k3d cluster create multinode -s 2 --no-lb --timestamps --no-rollback worked after setting in kernel parameters systemd.unified_cgroup_hierarchy=0. I'm now retrying on Qubes.

@nguyenvulong
Copy link

nguyenvulong commented May 27, 2021

same thing happened to me
I got similar log to @viq
when --timeout is set

Context deadline exceeded

If I do not set the timeout, it hangs forever.

But when i tried with 5 servers, it also did not work
I linked this to my issue for further investigation

image

@iwilltry42
Copy link
Member

Hey 👋
Is this still a thing with k3d v4.4.4 (or v4.4.3 and K3D_FIX_CGROUPV2=true set)?

@iwilltry42
Copy link
Member

Hey @viq , do you have any update on this issue or can it be closed after introducing the CGroupV2 fix?

@iwilltry42 iwilltry42 modified the milestones: v4.4.5, Backlog Jun 11, 2021
@iwilltry42
Copy link
Member

Closing this issue due to inactivity and a potential fix provided. Feel free to reopen if it still applies 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed os/fedora Issues related to perks of Fedora Linux
Projects
None yet
Development

No branches or pull requests

3 participants