Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAILING TO CREATE HA CLUSTER #5249

Closed
rkitindi-kr opened this issue Jan 11, 2024 · 7 comments
Closed

FAILING TO CREATE HA CLUSTER #5249

rkitindi-kr opened this issue Jan 11, 2024 · 7 comments

Comments

@rkitindi-kr
Copy link

rkitindi-kr commented Jan 11, 2024

Hi Guys

I am pretty new to RKE2 and I need your help.

I am trying to create a HA RKE2 cluster but for some reason I am getting a bunch of errors and I am not sure what is the problem. Cluster join fails and these are error logs I see:

Jan 11 18:37:34 ecs1u rke2[1891416]: time="2024-01-11T18:37:34Z" level=info msg="Running containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd"
Jan 11 18:37:34 ecs1u rke2[1891416]: W0111 18:37:34.164206 1891416 logging.go:59] [core] [Channel #7 SubChannel #8] grpc: addrConn.createTransport failed to connect to {Addr: "/run/k3s/containerd/containerd.sock", ServerName: "%2Frun%2Fk3s%2Fcontainerd%2Fcontainerd.sock", }. Err: connection error: desc = "transport>
Jan 11 18:37:34 ecs1u rke2[1891416]: W0111 18:37:34.533320 1891416 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:35 ecs1u rke2[1891416]: time="2024-01-11T18:37:35Z" level=info msg="containerd is now running"
Jan 11 18:37:35 ecs1u rke2[1891416]: time="2024-01-11T18:37:35Z" level=debug msg="Deleting existing lease: {rke2 2024-01-11 18:22:31.557372149 +0000 UTC map[]}"
Jan 11 18:37:35 ecs1u rke2[1891416]: time="2024-01-11T18:37:35Z" level=info msg="Importing images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar"
Jan 11 18:37:36 ecs1u rke2[1891416]: W0111 18:37:36.825137 1891416 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:37 ecs1u rke2[1891416]: W0111 18:37:37.597159 1891416 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:40 ecs1u rke2[1891416]: W0111 18:37:40.437631 1891416 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:41 ecs1u rke2[1891416]: W0111 18:37:41.507851 1891416 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:45 ecs1u rke2[1891416]: W0111 18:37:45.712363 1891416 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:46 ecs1u rke2[1891416]: {"level":"warn","ts":"2024-01-11T18:37:46.685863Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000396c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded de>
Jan 11 18:37:46 ecs1u rke2[1891416]: {"level":"info","ts":"2024-01-11T18:37:46.6861Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 11 18:37:49 ecs1u rke2[1891416]: W0111 18:37:49.226046 1891416 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:52 ecs1u rke2[1891416]: time="2024-01-11T18:37:52Z" level=info msg="Pod for etcd is synced"
Jan 11 18:37:57 ecs1u rke2[1891416]: W0111 18:37:57.861430 1891416 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:37:58 ecs1u rke2[1891416]: W0111 18:37:58.559169 1891416 logging.go:59] [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:2379", ServerName: "127.0.0.1", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect>
Jan 11 18:38:01 ecs1u rke2[1891416]: time="2024-01-11T18:38:01Z" level=info msg="Waiting for agent to become ready before joining etcd cluster"
Jan 11 18:38:01 ecs1u rke2[1891416]: {"level":"warn","ts":"2024-01-11T18:38:01.685784Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000396c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded de>
Jan 11 18:38:01 ecs1u rke2[1891416]: time="2024-01-11T18:38:01Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Jan 11 18:38:01 ecs1u rke2[1891416]: {"level":"warn","ts":"2024-01-11T18:38:01.686706Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0000396c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded de>
Jan 11 18:38:01 ecs1u rke2[1891416]: {"level":"info","ts":"2024-01-11T18:38:01.686863Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 11 18:38:02 ecs1u rke2[1891416]: time="2024-01-11T18:38:02Z" level=info msg="Waiting for API server to become available"
Jan 11 18:38:02 ecs1u rke2[1891416]: time="2024-01-11T18:38:02Z" level=info msg="Waiting for etcd server to become available"

Can someone point me to the correct direction?

Thank you

@rkitindi-kr
Copy link
Author

rkitindi-kr commented Jan 11, 2024

Errors from another node:

Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Imported images from /var/lib/rancher/rke2/agent/images/rke2-images.linux-amd64.tar in 1m31.392378348s"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=debug msg="Kubelet image credential provider bin directory check failed: stat /var/lib/rancher/credentialprovider/bin: no such file or directory"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Running kubelet --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud>
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:07 http: TLS handshake error from 127.0.0.1:53452: remote error: tls: bad certificate"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Adding member ecs2u-973170ea=https://10.159.12.52:2380 to etcd cluster [ecs3u-b37102ae=https://10.159.12.53:2380]"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=debug msg="Tunnel server handing HTTP/1.1 GET request for /v1-rke2/connect from 127.0.0.1:53438"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Handling backend connection request [ecs2u]"
Jan 11 18:49:07 ecs2u rke2[495556]: time="2024-01-11T18:49:07Z" level=info msg="Starting etcd to join cluster with members [ecs3u-b37102ae=https://10.159.12.53:2380 ecs2u-973170ea=https://10.159.12.52:2380]"
Jan 11 18:49:08 ecs2u rke2[495556]: time="2024-01-11T18:49:08Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:12 ecs2u rke2[495556]: time="2024-01-11T18:49:12Z" level=debug msg="Wrote ping"
Jan 11 18:49:13 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:13.027761Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded des>
Jan 11 18:49:13 ecs2u rke2[495556]: time="2024-01-11T18:49:13Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Jan 11 18:49:13 ecs2u rke2[495556]: time="2024-01-11T18:49:13Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:13 http: TLS handshake error from 127.0.0.1:43528: remote error: tls: bad certificate"
Jan 11 18:49:13 ecs2u rke2[495556]: time="2024-01-11T18:49:13Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:17 ecs2u rke2[495556]: time="2024-01-11T18:49:17Z" level=debug msg="Wrote ping"
Jan 11 18:49:18 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:18.03341Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc>
Jan 11 18:49:18 ecs2u rke2[495556]: {"level":"info","ts":"2024-01-11T18:49:18.033492Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 11 18:49:18 ecs2u rke2[495556]: time="2024-01-11T18:49:18Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:18 http: TLS handshake error from 127.0.0.1:43564: remote error: tls: bad certificate"
Jan 11 18:49:18 ecs2u rke2[495556]: time="2024-01-11T18:49:18Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:22 ecs2u rke2[495556]: time="2024-01-11T18:49:22Z" level=debug msg="Wrote ping"
Jan 11 18:49:23 ecs2u rke2[495556]: time="2024-01-11T18:49:23Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:23 http: TLS handshake error from 127.0.0.1:56468: remote error: tls: bad certificate"
Jan 11 18:49:23 ecs2u rke2[495556]: time="2024-01-11T18:49:23Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:27 ecs2u rke2[495556]: time="2024-01-11T18:49:27Z" level=debug msg="Wrote ping"
Jan 11 18:49:28 ecs2u rke2[495556]: time="2024-01-11T18:49:28Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:28 http: TLS handshake error from 127.0.0.1:56504: remote error: tls: bad certificate"
Jan 11 18:49:28 ecs2u rke2[495556]: time="2024-01-11T18:49:28Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:32 ecs2u rke2[495556]: time="2024-01-11T18:49:32Z" level=debug msg="Wrote ping"
Jan 11 18:49:33 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:33.034982Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded des>
Jan 11 18:49:33 ecs2u rke2[495556]: {"level":"info","ts":"2024-01-11T18:49:33.035229Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 11 18:49:33 ecs2u rke2[495556]: time="2024-01-11T18:49:33Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:33 http: TLS handshake error from 127.0.0.1:47376: remote error: tls: bad certificate"
Jan 11 18:49:33 ecs2u rke2[495556]: time="2024-01-11T18:49:33Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:33 ecs2u rke2[495556]: time="2024-01-11T18:49:33Z" level=info msg="Waiting for etcd server to become available"
Jan 11 18:49:33 ecs2u rke2[495556]: time="2024-01-11T18:49:33Z" level=info msg="Waiting for API server to become available"
Jan 11 18:49:37 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:37.866549Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded des>
Jan 11 18:49:37 ecs2u rke2[495556]: time="2024-01-11T18:49:37Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
Jan 11 18:49:37 ecs2u rke2[495556]: time="2024-01-11T18:49:37Z" level=debug msg="Wrote ping"
Jan 11 18:49:38 ecs2u rke2[495556]: time="2024-01-11T18:49:38Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:38 http: TLS handshake error from 127.0.0.1:47406: remote error: tls: bad certificate"
Jan 11 18:49:38 ecs2u rke2[495556]: time="2024-01-11T18:49:38Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:42 ecs2u rke2[495556]: time="2024-01-11T18:49:42Z" level=debug msg="Wrote ping"
Jan 11 18:49:43 ecs2u rke2[495556]: time="2024-01-11T18:49:43Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:43 http: TLS handshake error from 127.0.0.1:40226: remote error: tls: bad certificate"
Jan 11 18:49:43 ecs2u rke2[495556]: time="2024-01-11T18:49:43Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 11 18:49:47 ecs2u rke2[495556]: time="2024-01-11T18:49:47Z" level=debug msg="Wrote ping"
Jan 11 18:49:48 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:48.028267Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded des>
Jan 11 18:49:48 ecs2u rke2[495556]: time="2024-01-11T18:49:48Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Jan 11 18:49:48 ecs2u rke2[495556]: {"level":"warn","ts":"2024-01-11T18:49:48.037024Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007628c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded des>
Jan 11 18:49:48 ecs2u rke2[495556]: {"level":"info","ts":"2024-01-11T18:49:48.037151Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 11 18:49:48 ecs2u rke2[495556]: time="2024-01-11T18:49:48Z" level=info msg="Cluster-Http-Server 2024/01/11 18:49:48 http: TLS handshake error from 127.0.0.1:40270: remote error: tls: bad certificate"
Jan 11 18:49:48 ecs2u rke2[495556]: time="2024-01-11T18:49:48Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

@brandond
Copy link
Member

brandond commented Jan 11, 2024

The logs suggest that you're only waiting a minute or so for RKE2 to start. Depending on how long it takes your nodes to pull the etcd and kubernetes images from the registry, it may take a bit longer to finish starting.

Assuming the issue does not resolve itself after application of a little bit of patience, I would check the kubelet.log and containerd.log files on the first server, as well as the files under /var/log/pods if any are present.

@rkitindi-kr
Copy link
Author

rkitindi-kr commented Jan 11, 2024 via email

@rkitindi-kr
Copy link
Author

rkitindi-kr commented Jan 11, 2024 via email

@brandond
Copy link
Member

Please take a look at the logs I suggested examining earlier.

Also, please avoid replying via e-mail, your client apparently formats the replies in a way that prevents the GitHub mail relay from properly filtering out the noise from your responses.

@rkitindi-kr
Copy link
Author

I checked logs you suggested I didn't find anything interesting

Can you help me to understand why do we see connection refused on port 2379 while port server is listening on that port and IP?


Jan 12 02:45:45 ecs2u rke2[520627]: time="2024-01-12T02:45:45Z" level=info msg="Cluster-Http-Server 2024/01/12 02:45:45 http: TLS handshake error from 127.0.0.1:58658: remote error: tls: bad certificate"
Jan 12 02:45:45 ecs2u rke2[520627]: {"level":"warn","ts":"2024-01-12T02:45:45.492857Z","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0003cfdc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused""}
Jan 12 02:45:45 ecs2u rke2[520627]: {"level":"info","ts":"2024-01-12T02:45:45.492948Z","logger":"etcd-client","caller":"[email protected]/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Jan 12 02:45:45 ecs2u rke2[520627]: time="2024-01-12T02:45:45Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Jan 12 02:45:50 ecs2u rke2[520627]: time="2024-01-12T02:45:50Z" level=debug msg="Wrote ping"
Jan 12 02:45:50 ecs2u rke2[520627]: time="2024-01-12T02:45:50Z" level=info msg="Cluster-Http-Server 2024/01/12 02:45:50 http: TLS handshake error from 127.0.0.1:58688: remote error: tls: bad certificate"
Jan 12 02:45:50 ecs2u rke2[520627]: time="2024-01-12T02:45:50Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

nga-admin@ecs2u:$ netstat -tunlp | grep 2379
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 10 0 127.0.0.1:2379 0.0.0.0:* LISTEN -
tcp 0 0 10.159.12.52:2379 0.0.0.0:* LISTEN -
nga-admin@ecs2u:
$


@caroline-suse-rancher
Copy link
Contributor

I'm going to convert this to a discussion, so we can continue to help you, but declassify this as an issue.

@rancher rancher locked and limited conversation to collaborators Jan 12, 2024
@caroline-suse-rancher caroline-suse-rancher converted this issue into discussion #5258 Jan 12, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants