Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start HA/embedded etcd through NAT #2850

Closed
optix2000 opened this issue Jan 23, 2021 · 7 comments
Closed

Cannot start HA/embedded etcd through NAT #2850

optix2000 opened this issue Jan 23, 2021 · 7 comments

Comments

@optix2000
Copy link

Environmental Info:
K3s Version:

k3s version v1.20.2+k3s1 (1d4adb03)
go version go1.15.5

Node(s) CPU architecture, OS, and Version: Stock Debian 10
Linux k3s2 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux

Cluster Configuration:
3 masters. 1 in a NAT with all traffic forwarded, the others have public IPs.

Host A:

  • VPS with weird networking
  • Host behind NAT
  • Private IP (10.x.x.x) IP on interface
  • Has publicly routable IP with all traffic NATed

Host B/C:

  • Standard VPS
  • Public IP on interface

Describe the bug:
Etcd fails to connect as it tries to use Host A's internal IP and softlocks k3s in an infinite loop.

Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:19.777Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://10.64.244.91:2380/members","error":"Get \"https://10.64.244.91:2380/members\": dial tcp 10.64.244.91:2380: i/o timeout"}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:19.778Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3s2-21335ec2","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://144.202.111.132:2379"]}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:19.779Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3s2-21335ec2","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://144.202.111.132:2379"]}

As an aside, etcd never retries as the server shuts down, which could cause issues if there's a transient network issue.

Steps To Reproduce:

  1. Installed K3s on Host A: curl -sfL https://get.k3s.io | K3S_TOKEN=test sh -s - --cluster-init
  2. Installed K3s on Host B: curl -sfL https://get.k3s.io | K3S_TOKEN=test sh -s - --server https://<host-A-external-IP>:6443

Trying to pass --node-external-ip <host-A-IP> on Host A doesn't affect etcd.

Reversing the roles (ie Host B cluster-init, Host A connect), doesn't work either, and results in a slightly different error

Expected behavior:
etcd bootstraps uses external IP of Host A instead of internal IP.

Actual behavior:
etcd tries to bootstrap using internal IP of Host A and fails to connect (as it's an internal IP).

Additional context / logs:
Host B logs:

Jan 23 10:13:49 k3s2 systemd[1]: Starting Lightweight Kubernetes...
Jan 23 10:13:49 k3s2 k3s[9216]: time="2021-01-23T10:13:49Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Jan 23 10:13:49 k3s2 k3s[9216]: time="2021-01-23T10:13:49Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/8f4b194129852507eab4a55117fc942e0688ec9a70ffdaa5911ccc6652220f76"
Jan 23 10:13:51 k3s2 k3s[9216]: time="2021-01-23T10:13:51.646991720Z" level=info msg="Starting k3s v1.20.2+k3s1 (1d4adb03)"
Jan 23 10:13:52 k3s2 k3s[9216]: time="2021-01-23T10:13:52.852626481Z" level=info msg="Managed etcd cluster not yet initialized"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.302173717Z" level=info msg="certificate CN=system:admin,O=system:masters signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.303310644Z" level=info msg="certificate CN=system:kube-controller-manager signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.304088771Z" level=info msg="certificate CN=system:kube-scheduler signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.304858188Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.305607942Z" level=info msg="certificate CN=system:kube-proxy signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.306327781Z" level=info msg="certificate CN=system:k3s-controller signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.307009780Z" level=info msg="certificate CN=cloud-controller-manager signed by CN=k3s-client-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.307866060Z" level=info msg="certificate CN=kube-apiserver signed by CN=k3s-server-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.308625557Z" level=info msg="certificate CN=system:auth-proxy signed by CN=k3s-request-header-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.309501071Z" level=info msg="certificate CN=etcd-server signed by CN=etcd-server-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.310114119Z" level=info msg="certificate CN=etcd-client signed by CN=etcd-server-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.310798291Z" level=info msg="certificate CN=etcd-peer signed by CN=etcd-peer-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.314596648Z" level=info msg="certificate CN=k3s,O=k3s signed by CN=k3s-server-ca@1611396783: notBefore=2021-01-23 10:13:03 +0000 UTC notAfter=2022-01-23 10:13:53 +0000 UTC"
Jan 23 10:13:53 k3s2 k3s[9216]: time="2021-01-23T10:13:53.315058588Z" level=info msg="Active TLS secret  (ver=) (count 7): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-144.202.111.132:144.202.111.132 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=6054879B80E51EF66AA88E51BFF27DB275633C62]"
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:13.757Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-65ce3af4-f120-4b3c-81ff-34b3771b3df5/10.64.244.91:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.64.244.91:2379: i/o timeout\""}
Jan 23 10:14:13 k3s2 k3s[9216]: time="2021-01-23T10:14:13.757637368Z" level=error msg="Failed to get member list from etcd cluster. Will assume this member is already added"
Jan 23 10:14:13 k3s2 k3s[9216]: time="2021-01-23T10:14:13.757680652Z" level=info msg="Starting etcd for cluster [k3s1-13136390=https://10.64.244.91:2380 k3s2-21335ec2=https://144.202.111.132:2380]"
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:13.769Z","caller":"embed/etcd.go:117","msg":"configuring peer listeners","listen-peer-urls":["https://144.202.111.132:2380"]}
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:13.770Z","caller":"embed/etcd.go:468","msg":"starting with peer TLS","tls-info":"cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:13.770Z","caller":"embed/etcd.go:127","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://144.202.111.132:2379"]}
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:13.771Z","caller":"embed/etcd.go:302","msg":"starting an etcd server","etcd-version":"3.4.13","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.15.5","go-os":"linux","go-arch":"amd64","max-cpu-set":1,"max-cpu-available":1,"member-initialized":false,"name":"k3s2-21335ec2","data-dir":"/var/lib/rancher/k3s/server/db/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/rancher/k3s/server/db/etcd/member","force-new-cluster":false,"heartbeat-interval":"500ms","election-timeout":"5s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["https://144.202.111.132:2380"],"advertise-client-urls":["https://144.202.111.132:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://144.202.111.132:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"k3s1-13136390=https://10.64.244.91:2380,k3s2-21335ec2=https://144.202.111.132:2380","initial-cluster-state":"existing","initial-cluster-token":"etcd-cluster","quota-size-bytes":2147483648,"pre-vote":false,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":""}
Jan 23 10:14:13 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:13.776Z","caller":"etcdserver/backend.go:80","msg":"opened backend db","path":"/var/lib/rancher/k3s/server/db/etcd/member/snap/db","took":"4.638771ms"}
Jan 23 10:14:18 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:18.315Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Jan 23 10:14:18 k3s2 k3s[9216]: time="2021-01-23T10:14:18.316532671Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Jan 23 10:14:18 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:18.316Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: context canceled\". Reconnecting..."}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:19.777Z","caller":"etcdserver/cluster_util.go:76","msg":"failed to get cluster response","address":"https://10.64.244.91:2380/members","error":"Get \"https://10.64.244.91:2380/members\": dial tcp 10.64.244.91:2380: i/o timeout"}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:19.778Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"k3s2-21335ec2","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://144.202.111.132:2379"]}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"info","ts":"2021-01-23T10:14:19.779Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"k3s2-21335ec2","data-dir":"/var/lib/rancher/k3s/server/db/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["https://144.202.111.132:2379"]}
Jan 23 10:14:19 k3s2 k3s[9216]: time="2021-01-23T10:14:19.779865470Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --feature-gates=ServiceAccountIssuerDiscovery=false --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:19.780Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://10.64.244.91:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 10.64.244.91:2379: operation was canceled\". Reconnecting..."}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:19.780Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: read tcp 127.0.0.1:53766->127.0.0.1:2379: read: connection reset by peer\". Reconnecting..."}
Jan 23 10:14:19 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:19.781Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jan 23 10:14:20 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:20.781Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
Jan 23 10:14:22 k3s2 k3s[9216]: {"level":"warn","ts":"2021-01-23T10:14:22.099Z","caller":"grpclog/grpclog.go:60","msg":"grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\". Reconnecting..."}
...
@brandond
Copy link
Member

brandond commented Jan 23, 2021

That is correct. At the moment, all etcd nodes must be able to reach each other via their internal IPs. Due etcd's requirement for low latency between nodes, we don't generally expect to see it deployed in a distributed configuration.

@optix2000
Copy link
Author

I'm a bit curious about this. Since this prevents using the embedded etcd for regional HA. This also prevents odd network configurations from working, even if they are low latency <5ms.

@optix2000
Copy link
Author

Due etcd's requirement for low latency between nodes, we don't generally expect to see it deployed in a distributed configuration.

Where does etcd require this? From their docs their default heartbeat interval is 100ms, which in turn allows for 50-150ms of latency between etcd nodes.

@brandond
Copy link
Member

At some point we might extend it so that the server can also advertise the external IPs in the client and peer endpoint lists, but for the initial implementation we wanted to keep things pretty basic and easy to support. This means traditional lan deployments only.

@Tchoupinax
Copy link

Hello @brandond,

This means k3s will never support multi-region cluster either external DB ? (because of the latency)

@brandond
Copy link
Member

This is just for embedded etcd not external databases. You're welcome to try whatever you like, but if the latency is high, performance will probably suffer.

@stale
Copy link

stale bot commented Oct 25, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Oct 25, 2021
@stale stale bot closed this as completed Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants