failed to join the cluster #62

jicki · 2021-03-26T10:10:01Z

error job

level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: rpc error: code = Unknown desc = raft: stopped

The text was updated successfully, but these errors were encountered:

jicki · 2021-03-29T08:48:16Z

time="2021-03-29T08:47:35Z" level=error msg="failed to join the cluster" error="failed to start etcd: listen tcp 172.130.55.215:2380: bind: address already in use"
2021-03-29 08:47:35.461401 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-29 08:47:35.461527 E | rafthttp: failed to find member 3d298e0baea116b7 in cluster dec77a780dc17e55
2021-03-29 08:47:35.462263 E | rafthttp: failed to find member ee032a1060085af5 in cluster dec77a780dc17e55
2021-03-29 08:47:35.483293 E | rafthttp: failed to find member 3d298e0baea116b7 in cluster dec77a780dc17e55
2021-03-29 08:47:35.491297 E | rafthttp: failed to find member ee032a1060085af5 in cluster dec77a780dc17e55
2021-03-29 08:47:35.562004 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-29 08:47:35.581607 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-29 08:47:35.611941 E | rafthttp: failed to find member ee032a1060085af5 in cluster dec77a780dc17e55
2021-03-29 08:47:35.683490 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-29 08:47:35.692528 E | rafthttp: failed to find member 3d298e0baea116b7 in cluster dec77a780dc17e55
2021-03-29 08:47:35.761913 E | rafthttp: failed to find member ee032a1060085af5 in cluster dec77a780dc17e55
2021-03-29 08:47:35.762180 E | rafthttp: failed to find member 3d298e0baea116b7 in cluster dec77a780dc17e55
2021-03-29 08:47:35.762324 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-29 08:47:35.863423 E | rafthttp: failed to find member 3d298e0baea116b7 in cluster dec77a780dc17e55
2021-03-29 08:47:35.865259 E | rafthttp: failed to find member ee032a1060085af5 in cluster dec77a780dc17e55

Quentin-M · 2021-03-30T00:03:12Z

Hi @jicki,

A bit of a curious error message.

Are you running a 3 members cluster? On 3 different nodes in an ASG (using the provided Terraform code), otherwise what?
What are the other members saying?
I do see in your second message that the etcd address is already used locally? What/who is holding this?

jicki · 2021-03-30T01:21:57Z

Hi @jicki,

A bit of a curious error message.

Are you running a 3 members cluster? On 3 different nodes in an ASG (using the provided Terraform code), otherwise what?

What are the other members saying?

I do see in your second message that the etcd address is already used locally? What/who is holding this?

yes my running 3 members cluster

pods

kubectl -n demo-etcd get pods
NAME              READY   STATUS    RESTARTS   AGE
etcd-operator-0   1/1     Running   0          2m26s
etcd-operator-1   1/1     Running   0          2m13s
etcd-operator-2   1/1     Running   0          2m1s

pvc

kubectl -n demo-etcd get pvc
NAME                        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
user-data-etcd-operator-0   Bound    pvc-e3226260-313e-4e6b-8fc4-db853ad88947   10Gi       RWO            nfs-retain     4m44s
user-data-etcd-operator-1   Bound    pvc-5a91311b-db9c-4f5a-bf22-3e2916a13042   10Gi       RWO            nfs-retain     4m31s
user-data-etcd-operator-2   Bound    pvc-a6a494cc-2ed7-42aa-ad5f-c59511ee9189   10Gi       RWO            nfs-retain     4m19s

delete pods

kubectl -n demo-etcd delete pods etcd-operator-1 
pod "etcd-operator-1" deleted

logs etcd-operator-1

kubectl -n demo-etcd logs -f etcd-operator-1 
time="2021-03-30T01:16:25Z" level=info msg="loaded configuration file /etc/eco/eco.yaml"
time="2021-03-30T01:16:25Z" level=info msg="STATUS: Healthy + Not running -> Join"
time="2021-03-30T01:16:25Z" level=info msg="attempting to rejoin cluster under existing identity with local data"
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-03-30 01:16:25.972790 I | embed: peerTLS: cert = /var/lib/etcd/fixtures/peer/cert.pem, key = /var/lib/etcd/fixtures/peer/key.pem, trusted-ca = , client-cert-auth = false, crl-file = 
2021-03-30 01:16:26.006962 I | embed: name = etcd-operator-1
2021-03-30 01:16:26.007001 I | embed: data dir = /var/lib/etcd
2021-03-30 01:16:26.007015 I | embed: member dir = /var/lib/etcd/member
2021-03-30 01:16:26.007024 I | embed: heartbeat = 100ms
2021-03-30 01:16:26.007031 I | embed: election = 1000ms
2021-03-30 01:16:26.007040 I | embed: snapshot count = 100000
2021-03-30 01:16:26.007064 I | embed: advertise client URLs = https://172.130.151.122:2379
2021-03-30 01:16:26.007073 I | embed: initial advertise peer URLs = https://172.130.151.122:2380
2021-03-30 01:16:26.007085 I | embed: initial cluster = 
2021-03-30 01:16:26.007156 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxrwxrwx". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-03-30 01:16:26.846878 I | etcdserver: restarting member 573447da2f35d1f in cluster d8e38796de92f4c3 at commit index 4293
raft2021/03/30 01:16:26 INFO: 573447da2f35d1f switched to configuration voters=()
raft2021/03/30 01:16:26 INFO: 573447da2f35d1f became follower at term 2
raft2021/03/30 01:16:26 INFO: newRaft 573447da2f35d1f [peers: [], term: 2, commit: 4293, applied: 0, lastindex: 4293, lastterm: 2]
2021-03-30 01:16:27.002929 W | auth: simple token is not cryptographically signed
2021-03-30 01:16:27.061180 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
raft2021/03/30 01:16:27 INFO: 573447da2f35d1f switched to configuration voters=(2380116157566068054)
2021-03-30 01:16:27.066084 I | etcdserver/membership: added member 2107df08efe4a556 [https://172.130.55.213:2380] to cluster d8e38796de92f4c3
2021-03-30 01:16:27.066110 I | rafthttp: starting peer 2107df08efe4a556...
2021-03-30 01:16:27.066164 I | rafthttp: started HTTP pipelining with peer 2107df08efe4a556
2021-03-30 01:16:27.067419 I | rafthttp: started streaming with peer 2107df08efe4a556 (writer)
2021-03-30 01:16:27.068012 I | rafthttp: started streaming with peer 2107df08efe4a556 (writer)
2021-03-30 01:16:27.069454 I | rafthttp: started peer 2107df08efe4a556
2021-03-30 01:16:27.069523 I | rafthttp: added peer 2107df08efe4a556
2021-03-30 01:16:27.069591 I | rafthttp: started streaming with peer 2107df08efe4a556 (stream Message reader)
2021-03-30 01:16:27.069844 N | etcdserver/membership: set the initial cluster version to 3.4
2021-03-30 01:16:27.069955 I | etcdserver/api: enabled capabilities for version 3.4
raft2021/03/30 01:16:27 INFO: 573447da2f35d1f switched to configuration voters=(2380116157566068054 15333572839742947380)
2021-03-30 01:16:27.070312 I | etcdserver/membership: added member d4cbcb41ca74c034 [https://172.130.61.89:2380] to cluster d8e38796de92f4c3
2021-03-30 01:16:27.070352 I | rafthttp: starting peer d4cbcb41ca74c034...
2021-03-30 01:16:27.070396 I | rafthttp: started HTTP pipelining with peer d4cbcb41ca74c034
2021-03-30 01:16:27.072092 I | rafthttp: started streaming with peer d4cbcb41ca74c034 (writer)
2021-03-30 01:16:27.072171 I | rafthttp: started peer d4cbcb41ca74c034
2021-03-30 01:16:27.072206 I | rafthttp: added peer d4cbcb41ca74c034
2021-03-30 01:16:27.072749 I | rafthttp: started streaming with peer d4cbcb41ca74c034 (writer)
2021-03-30 01:16:27.072794 I | rafthttp: started streaming with peer 2107df08efe4a556 (stream MsgApp v2 reader)
2021-03-30 01:16:27.072846 I | rafthttp: started streaming with peer d4cbcb41ca74c034 (stream MsgApp v2 reader)
raft2021/03/30 01:16:27 INFO: 573447da2f35d1f switched to configuration voters=(392732898906823967 2380116157566068054 15333572839742947380)
2021-03-30 01:16:27.072945 I | etcdserver/membership: added member 573447da2f35d1f [https://172.130.151.86:2380] to cluster d8e38796de92f4c3
2021-03-30 01:16:27.072974 I | rafthttp: started streaming with peer d4cbcb41ca74c034 (stream Message reader)
time="2021-03-30T01:16:27Z" level=info msg="embedded etcd server is now running"
2021-03-30 01:16:27.094719 I | embed: ClientTLS: cert = /opt/etcd/ssl/etcd.pem, key = /opt/etcd/ssl/etcd-key.pem, trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = 
2021-03-30 01:16:27.094872 I | embed: listening for peers on 172.130.151.122:2380
2021-03-30 01:16:27.095061 I | embed: listening for metrics on http://172.130.151.122:2381
2021-03-30 01:16:27.095111 I | embed: listening for metrics on http://127.0.0.1:2381
2021-03-30 01:16:27.156831 I | rafthttp: peer d4cbcb41ca74c034 became active
2021-03-30 01:16:27.156913 I | rafthttp: established a TCP streaming connection with peer d4cbcb41ca74c034 (stream MsgApp v2 reader)
2021-03-30 01:16:27.224486 I | rafthttp: peer 2107df08efe4a556 became active
2021-03-30 01:16:27.224560 I | rafthttp: established a TCP streaming connection with peer 2107df08efe4a556 (stream MsgApp v2 reader)
2021-03-30 01:16:27.224669 I | rafthttp: established a TCP streaming connection with peer 2107df08efe4a556 (stream Message reader)
2021-03-30 01:16:27.226101 I | rafthttp: established a TCP streaming connection with peer d4cbcb41ca74c034 (stream Message reader)
2021-03-30 01:16:27.269742 I | etcdserver: 573447da2f35d1f initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
raft2021/03/30 01:16:27 INFO: raft.node: 573447da2f35d1f elected leader 2107df08efe4a556 at term 2
2021-03-30 01:16:27.366438 I | etcdserver: published {Name:etcd-operator-1 ClientURLs:[https://172.130.151.122:2379]} to cluster d8e38796de92f4c3
2021-03-30 01:16:27.366496 I | embed: ready to serve client requests
2021-03-30 01:16:27.371519 I | embed: serving client requests on 172.130.151.122:2379
2021-03-30 01:16:39.967624 I | embed: rejected connection from "172.130.61.89:39960" (error "EOF", ServerName "")
time="2021-03-30T01:16:40Z" level=info msg="STATUS: Healthy + Running -> Standby"
raft2021/03/30 01:16:46 INFO: 573447da2f35d1f switched to configuration voters=(2380116157566068054 15333572839742947380)
2021-03-30 01:16:46.337493 I | etcdserver/membership: removed member 573447da2f35d1f from cluster d8e38796de92f4c3
2021-03-30 01:16:46.337779 W | rafthttp: lost the TCP streaming connection with peer d4cbcb41ca74c034 (stream MsgApp v2 reader)
2021-03-30 01:16:46.337868 W | rafthttp: lost the TCP streaming connection with peer 2107df08efe4a556 (stream MsgApp v2 reader)
2021-03-30 01:16:46.338683 W | rafthttp: lost the TCP streaming connection with peer d4cbcb41ca74c034 (stream Message reader)
2021-03-30 01:16:46.340022 W | rafthttp: lost the TCP streaming connection with peer 2107df08efe4a556 (stream Message reader)
2021-03-30 01:16:46.345448 E | rafthttp: failed to write 2107df08efe4a556 on pipeline (unexpected EOF)
2021-03-30 01:16:46.345484 I | rafthttp: peer 2107df08efe4a556 became inactive (message send to peer failed)
2021-03-30 01:16:46.386799 E | rafthttp: failed to dial d4cbcb41ca74c034 on stream MsgApp v2 (the member has been permanently removed from the cluster)
2021-03-30 01:16:46.386830 I | rafthttp: peer d4cbcb41ca74c034 became inactive (message send to peer failed)
2021-03-30 01:16:46.386882 E | etcdserver: the member has been permanently removed from the cluster
2021-03-30 01:16:46.386909 I | etcdserver: the data-dir used by this member must be removed.
2021-03-30 01:16:46.388060 I | rafthttp: stopping peer 2107df08efe4a556...
2021-03-30 01:16:46.388103 I | rafthttp: stopped streaming with peer 2107df08efe4a556 (writer)
2021-03-30 01:16:46.388142 I | rafthttp: stopped streaming with peer 2107df08efe4a556 (writer)
2021-03-30 01:16:46.388190 I | rafthttp: stopped HTTP pipelining with peer 2107df08efe4a556
2021-03-30 01:16:46.388249 I | rafthttp: stopped streaming with peer 2107df08efe4a556 (stream MsgApp v2 reader)
2021-03-30 01:16:46.388283 I | rafthttp: stopped streaming with peer 2107df08efe4a556 (stream Message reader)
2021-03-30 01:16:46.388293 I | rafthttp: stopped peer 2107df08efe4a556
2021-03-30 01:16:46.388301 I | rafthttp: stopping peer d4cbcb41ca74c034...
2021-03-30 01:16:46.388323 I | rafthttp: stopped streaming with peer d4cbcb41ca74c034 (writer)
2021-03-30 01:16:46.388338 I | rafthttp: stopped streaming with peer d4cbcb41ca74c034 (writer)
2021-03-30 01:16:46.388382 I | rafthttp: stopped HTTP pipelining with peer d4cbcb41ca74c034
2021-03-30 01:16:46.388413 I | rafthttp: stopped streaming with peer d4cbcb41ca74c034 (stream MsgApp v2 reader)
2021-03-30 01:16:46.388435 I | rafthttp: stopped streaming with peer d4cbcb41ca74c034 (stream Message reader)
2021-03-30 01:16:46.388450 I | rafthttp: stopped peer d4cbcb41ca74c034
time="2021-03-30T01:16:46Z" level=warning msg="etcd server is stopping"
1.6170670073679156e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-d02ff7bc-6397-4bc4-8374-dd3ba70f7a32/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 01:16:49.865472 I | embed: rejected connection from "172.130.55.213:45440" (error "EOF", ServerName "")
1.6170670123684206e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-d02ff7bc-6397-4bc4-8374-dd3ba70f7a32/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2021-03-30T01:16:55Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.6170670156336913e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-919bb5ff-9235-462c-aede-a662f695ec12/172.130.61.89:2379", "attempt": 0, "error": "rpc error: code = Unknown desc = raft: stopped"}
time="2021-03-30T01:16:55Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: rpc error: code = Unknown desc = raft: stopped"
1.6170670173688858e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-d02ff7bc-6397-4bc4-8374-dd3ba70f7a32/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
1.6170670223696744e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-d02ff7bc-6397-4bc4-8374-dd3ba70f7a32/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.86:2379: i/o timeout\""}
2021-03-30 01:17:04.865899 I | embed: rejected connection from "172.130.55.213:45690" (error "EOF", ServerName "")
1.6170670273700879e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-d02ff7bc-6397-4bc4-8374-dd3ba70f7a32/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.86:2379: i/o timeout\""}
time="2021-03-30T01:17:10Z" level=info msg="STATUS: Healthy + Not running -> Join"
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-03-30 01:17:10.592319 I | embed: peerTLS: cert = /var/lib/etcd/fixtures/peer/cert.pem, key = /var/lib/etcd/fixtures/peer/key.pem, trusted-ca = , client-cert-auth = false, crl-file = 
1.6170670305989053e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-a0b211bf-6874-4053-aac7-db0ab47dc85d/172.130.61.89:2379", "attempt": 0, "error": "rpc error: code = Unknown desc = raft: stopped"}
1.6170670306201253e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-a0b211bf-6874-4053-aac7-db0ab47dc85d/172.130.61.89:2379", "attempt": 0, "error": "rpc error: code = Unknown desc = raft: stopped"}
time="2021-03-30T01:17:10Z" level=error msg="failed to join the cluster" error="failed to start etcd: listen tcp 172.130.151.122:2380: bind: address already in use"
2021-03-30 01:17:10.645065 E | rafthttp: failed to find member d4cbcb41ca74c034 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.656916 E | rafthttp: failed to find member d4cbcb41ca74c034 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.658378 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-30 01:17:10.671906 E | rafthttp: failed to find member 2107df08efe4a556 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.672779 E | rafthttp: failed to find member 2107df08efe4a556 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.773059 E | rafthttp: failed to find member d4cbcb41ca74c034 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.845863 E | rafthttp: failed to find member d4cbcb41ca74c034 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.853979 E | rafthttp: failed to find member 2107df08efe4a556 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.860586 E | rafthttp: failed to find member 2107df08efe4a556 in cluster d8e38796de92f4c3
2021-03-30 01:17:10.885153 W | rafthttp: failed to process raft message (raft: stopped)
2021-03-30 01:17:10.921513 W | rafthttp: failed to process raft message (raft: stopped)
.....
time="2021-03-30T01:17:25Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: rpc error: code = Unknown desc = raft: stopped"
1.6170670455224566e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-504e129c-7371-424d-96fd-eb94c6d257d8/172.130.61.89:2379", "attempt": 0, "error": "rpc error: code = Unknown desc = raft: stopped"}

logs etcd-operator-0

time="2021-03-30T01:16:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:16:18.473725 W | rafthttp: lost the TCP streaming connection with peer 573447da2f35d1f (stream MsgApp v2 reader)
2021-03-30 01:16:18.476112 W | rafthttp: lost the TCP streaming connection with peer 573447da2f35d1f (stream Message reader)
2021-03-30 01:16:18.548695 E | rafthttp: failed to dial 573447da2f35d1f on stream MsgApp v2 (peer 573447da2f35d1f failed to find local node d4cbcb41ca74c034)
2021-03-30 01:16:18.548713 I | rafthttp: peer 573447da2f35d1f became inactive (message send to peer failed)
2021-03-30 01:16:19.418600 I | embed: rejected connection from "10.9.9.48:24871" (error "read tcp 172.130.61.89:2379->10.9.9.48:24871: read: connection reset by peer", ServerName "")
2021-03-30 01:16:23.869174 W | rafthttp: lost the TCP streaming connection with peer 573447da2f35d1f (stream MsgApp v2 writer)
2021-03-30 01:16:23.869621 W | rafthttp: lost the TCP streaming connection with peer 573447da2f35d1f (stream Message writer)
time="2021-03-30T01:16:24Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:16:27.118410 I | embed: rejected connection from "10.9.9.48:30105" (error "read tcp 172.130.61.89:2379->10.9.9.48:30105: read: connection reset by peer", ServerName "")
2021-03-30 01:16:27.156575 I | rafthttp: peer 573447da2f35d1f became active
2021-03-30 01:16:27.156607 I | rafthttp: established a TCP streaming connection with peer 573447da2f35d1f (stream MsgApp v2 writer)
2021-03-30 01:16:27.224808 I | rafthttp: established a TCP streaming connection with peer 573447da2f35d1f (stream Message writer)
2021-03-30 01:16:27.974920 E | rafthttp: failed to dial 573447da2f35d1f on stream MsgApp v2 (dial tcp 172.130.151.86:2380: i/o timeout)
2021-03-30 01:16:27.974936 I | rafthttp: peer 573447da2f35d1f became inactive (message send to peer failed)
1.6170669882369318e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-cdf76436-0832-4c1a-8044-b771dbef19a2/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 01:16:29.095730 I | embed: rejected connection from "172.130.55.192:15923" (error "EOF", ServerName "")
2021-03-30 01:16:30.422564 I | embed: rejected connection from "10.9.9.48:43183" (error "read tcp 172.130.61.89:2379->10.9.9.48:43183: read: connection reset by peer", ServerName "")
1.6170669932371874e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-cdf76436-0832-4c1a-8044-b771dbef19a2/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 01:16:36.774647 I | embed: rejected connection from "10.9.9.48:51354" (error "EOF", ServerName "")
1.617066998237407e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-cdf76436-0832-4c1a-8044-b771dbef19a2/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2021-03-30T01:16:39Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:16:42.203827 W | rafthttp: health check for peer 573447da2f35d1f could not connect: dial tcp 172.130.151.86:2380: i/o timeout
2021-03-30 01:16:42.203867 W | rafthttp: health check for peer 573447da2f35d1f could not connect: dial tcp 172.130.151.86:2380: i/o timeout
1.6170670032375584e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-cdf76436-0832-4c1a-8044-b771dbef19a2/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.86:2379: i/o timeout\""}
raft2021/03/30 01:16:46 INFO: d4cbcb41ca74c034 switched to configuration voters=(2380116157566068054 15333572839742947380)
2021-03-30 01:16:46.336242 I | etcdserver/membership: removed member 573447da2f35d1f from cluster d8e38796de92f4c3
2021-03-30 01:16:46.336257 I | rafthttp: stopping peer 573447da2f35d1f...
2021-03-30 01:16:46.336803 I | rafthttp: closed the TCP streaming connection with peer 573447da2f35d1f (stream MsgApp v2 writer)
2021-03-30 01:16:46.336818 I | rafthttp: stopped streaming with peer 573447da2f35d1f (writer)
2021-03-30 01:16:46.341113 I | rafthttp: closed the TCP streaming connection with peer 573447da2f35d1f (stream Message writer)
2021-03-30 01:16:46.341136 I | rafthttp: stopped streaming with peer 573447da2f35d1f (writer)
2021-03-30 01:16:46.341572 I | rafthttp: stopped HTTP pipelining with peer 573447da2f35d1f
2021-03-30 01:16:46.341604 I | rafthttp: stopped streaming with peer 573447da2f35d1f (stream MsgApp v2 reader)
2021-03-30 01:16:46.341622 I | rafthttp: stopped streaming with peer 573447da2f35d1f (stream Message reader)
2021-03-30 01:16:46.341633 I | rafthttp: stopped peer 573447da2f35d1f
2021-03-30 01:16:46.341648 I | rafthttp: removed peer 573447da2f35d1f
2021-03-30 01:16:46.385245 W | rafthttp: rejected the stream from peer 573447da2f35d1f since it was removed
1.6170670082377918e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-cdf76436-0832-4c1a-8044-b771dbef19a2/172.130.151.86:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.86:2379: i/o timeout\""}
time="2021-03-30T01:16:48Z" level=info msg="removing member \"etcd-operator-1\" that's been unhealthy for 30s"
raft2021/03/30 01:16:48 INFO: d4cbcb41ca74c034 switched to configuration voters=(2380116157566068054 15333572839742947380)
1.6170670082805386e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-1ab04957-f7bb-45c1-9d64-9bc17137ef7b/172.130.61.89:2379", "attempt": 0, "error": "rpc error: code = NotFound desc = etcdserver: member not found"}
time="2021-03-30T01:16:54Z" level=info msg="STATUS: Healthy + Running -> Standby"
time="2021-03-30T01:17:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
raft2021/03/30 01:17:10 INFO: d4cbcb41ca74c034 switched to configuration voters=(2380116157566068054 15333572839742947380 17335131455808352196)
2021-03-30 01:17:10.162275 I | etcdserver/membership: added member f092c236ae282fc4 [https://172.130.151.122:2380] to cluster d8e38796de92f4c3
2021-03-30 01:17:10.162298 I | rafthttp: starting peer f092c236ae282fc4...
2021-03-30 01:17:10.162332 I | rafthttp: started HTTP pipelining with peer f092c236ae282fc4
2021-03-30 01:17:10.163998 I | rafthttp: started streaming with peer f092c236ae282fc4 (writer)
2021-03-30 01:17:10.165136 I | rafthttp: started streaming with peer f092c236ae282fc4 (writer)
2021-03-30 01:17:10.166524 I | rafthttp: started peer f092c236ae282fc4
2021-03-30 01:17:10.166680 I | rafthttp: started streaming with peer f092c236ae282fc4 (stream Message reader)
2021-03-30 01:17:10.166793 I | rafthttp: started streaming with peer f092c236ae282fc4 (stream MsgApp v2 reader)
2021-03-30 01:17:10.167203 I | rafthttp: added peer f092c236ae282fc4
time="2021-03-30T01:17:25Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:17:36.774532 I | embed: rejected connection from "10.9.9.48:52470" (error "EOF", ServerName "")
time="2021-03-30T01:17:39Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:17:45.167442 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:17:50.167592 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:17:54Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:17:55.167751 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:00.167902 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:05.168073 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:10.168215 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:15.168379 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:20.168526 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:24Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:25.168681 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:30.168826 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:34.845063 I | embed: rejected connection from "172.130.55.213:42352" (error "EOF", ServerName "")
2021-03-30 01:18:35.168982 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:36.774508 I | embed: rejected connection from "10.9.9.48:55210" (error "EOF", ServerName "")
time="2021-03-30T01:18:39Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:40.169201 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:45.169357 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:50.169501 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:54Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:55.169648 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:00.169807 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:05.169963 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:10.170139 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:15.170280 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:20.170464 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:24Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:25.170597 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:30.170763 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:35.170916 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:36.774644 I | embed: rejected connection from "10.9.9.48:57958" (error "EOF", ServerName "")
time="2021-03-30T01:19:39Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:40.171044 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:45.171195 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:50.171353 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:54Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:55.171526 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:00.171683 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:05.171878 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:10.172041 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:15.172221 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:20.172419 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:24Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:25.172597 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:30.172790 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:34.847300 I | embed: rejected connection from "172.130.55.213:51870" (error "EOF", ServerName "")
2021-03-30 01:20:35.172987 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:36.774541 I | embed: rejected connection from "10.9.9.48:60700" (error "EOF", ServerName "")
time="2021-03-30T01:20:39Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:40.173187 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:45.173358 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:50.173536 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:54Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:55.173721 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:00.173878 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error

logs etcd-operator-2

2021-03-30 01:17:10.193453 I | rafthttp: started streaming with peer f092c236ae282fc4 (stream Message reader)
time="2021-03-30T01:17:19Z" level=info msg="STATUS: Healthy + Running -> Standby"
time="2021-03-30T01:17:34Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:17:39.991120 I | embed: rejected connection from "172.130.61.89:60872" (error "EOF", ServerName "")
2021-03-30 01:17:44.484998 I | embed: rejected connection from "10.9.9.49:49004" (error "EOF", ServerName "")
2021-03-30 01:17:45.193959 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:17:45.194015 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:17:49Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:17:50.194275 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:17:50.194366 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:17:54.992506 I | embed: rejected connection from "172.130.61.89:33328" (error "EOF", ServerName "")
2021-03-30 01:17:55.194655 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:17:55.194734 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:00.194997 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:00.195050 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:04Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:05.195339 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:05.195413 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:09.994961 I | embed: rejected connection from "172.130.61.89:34008" (error "EOF", ServerName "")
2021-03-30 01:18:10.175480 I | embed: rejected connection from "172.130.151.122:35742" (error "EOF", ServerName "")
2021-03-30 01:18:10.195724 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:10.195777 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:15.196035 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:15.196115 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:19Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:19.880835 I | embed: rejected connection from "172.130.55.213:43020" (error "EOF", ServerName "")
2021-03-30 01:18:20.196317 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:20.196390 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:24.997107 I | embed: rejected connection from "172.130.61.89:34692" (error "EOF", ServerName "")
2021-03-30 01:18:25.184443 I | embed: rejected connection from "172.130.151.122:36076" (error "EOF", ServerName "")
2021-03-30 01:18:25.196638 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:25.196795 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:30.197591 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:30.197666 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:34Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:35.197897 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:35.198028 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:40.198271 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:40.198342 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:40.217402 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:18:40.217525 W | etcdserver: not enough started members, rejecting member add {ID:e2c105ebabbe332a RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:18:44.485323 I | embed: rejected connection from "10.9.9.49:53758" (error "EOF", ServerName "")
2021-03-30 01:18:45.198548 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:45.198610 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:18:49Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:18:50.198845 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:50.198933 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:55.199051 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:18:55.199182 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:00.199307 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:00.199332 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:04Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:05.199612 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:05.199712 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:10.000760 I | embed: rejected connection from "172.130.61.89:36762" (error "EOF", ServerName "")
2021-03-30 01:19:10.199911 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:10.200053 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:10.225730 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:19:10.225869 W | etcdserver: not enough started members, rejecting member add {ID:a712ae29ab30d7b1 RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:19:15.200289 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:15.200456 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:19Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:20.200557 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:20.200601 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:25.008565 I | embed: rejected connection from "172.130.61.89:37442" (error "EOF", ServerName "")
2021-03-30 01:19:25.186679 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:19:25.186853 W | etcdserver: not enough started members, rejecting member add {ID:a197287f4f0cad70 RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:19:25.200880 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:25.200974 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:30.201297 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:30.201372 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:34Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:35.201544 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:35.201582 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:40.204538 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:40.204599 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:40.214848 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:19:40.214958 W | etcdserver: not enough started members, rejecting member add {ID:e07fde3d4dd7199d RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:19:44.484952 I | embed: rejected connection from "10.9.9.49:58504" (error "EOF", ServerName "")
2021-03-30 01:19:45.204741 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:45.204821 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:19:49Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:19:50.205079 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:50.205203 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:55.005799 I | embed: rejected connection from "172.130.61.89:38816" (error "EOF", ServerName "")
2021-03-30 01:19:55.198775 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:19:55.198989 W | etcdserver: not enough started members, rejecting member add {ID:2d2ab99a4ae65e85 RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:19:55.205377 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:19:55.205531 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:00.205614 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:00.205806 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:04Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:05.205937 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:05.206011 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:10.004395 I | embed: rejected connection from "172.130.61.89:39500" (error "EOF", ServerName "")
2021-03-30 01:20:10.206286 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:10.206348 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:15.206613 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:15.206716 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:19Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:19.879359 I | embed: rejected connection from "172.130.55.213:52512" (error "EOF", ServerName "")
2021-03-30 01:20:20.206971 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:20.207045 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:25.207355 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:25.207441 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:30.207697 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:30.207795 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:34Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:35.207992 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:35.208031 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:40.208257 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:40.208328 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:44.485327 I | embed: rejected connection from "10.9.9.49:35054" (error "EOF", ServerName "")
2021-03-30 01:20:45.208658 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:45.208759 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:20:49Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:20:50.208996 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:50.209142 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:55.174336 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:20:55.174430 W | etcdserver: not enough started members, rejecting member add {ID:77595d6157d31391 RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:20:55.209301 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:20:55.209411 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:00.209564 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:00.209602 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:21:04Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:21:05.209863 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:05.209972 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:10.158782 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:21:10.158880 W | etcdserver: not enough started members, rejecting member add {ID:641228db1fafcd2d RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:21:10.210252 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:10.210341 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:15.210630 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:15.211016 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
time="2021-03-30T01:21:19Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:21:20.210983 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:20.211444 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:24.974626 I | embed: rejected connection from "172.130.61.89:42936" (error "EOF", ServerName "")
2021-03-30 01:21:25.211277 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:25.211636 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:28.380618 I | embed: rejected connection from "10.9.9.49:35164" (error "read tcp 172.130.55.213:2379->10.9.9.49:35164: read: connection timed out", ServerName "")
2021-03-30 01:21:30.211637 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:30.212044 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:30.428558 I | embed: rejected connection from "10.9.9.49:50978" (error "read tcp 172.130.55.213:2379->10.9.9.49:50978: read: connection timed out", ServerName "")
time="2021-03-30T01:21:34Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 01:21:35.212005 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:35.212414 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:36.572477 I | embed: rejected connection from "172.130.61.64:4238" (error "read tcp 172.130.55.213:2379->172.130.61.64:4238: read: connection timed out", ServerName "")
2021-03-30 01:21:39.969046 I | embed: rejected connection from "172.130.61.89:43620" (error "EOF", ServerName "")
2021-03-30 01:21:40.173880 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2021-03-30 01:21:40.173988 W | etcdserver: not enough started members, rejecting member add {ID:631050ed5f222514 RaftAttributes:{PeerURLs:[https://172.130.151.122:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 01:21:40.212297 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error
2021-03-30 01:21:40.212685 W | rafthttp: health check for peer f092c236ae282fc4 could not connect: remote error: tls: internal error

config

eco:
  # The interval between each cluster verification by the operator.
  check-interval: 15s
  # The time after which, an unhealthy member will be removed from the cluster.
  unhealthy-member-ttl: 30s
  # Defines whether the operator will attempt to seed a new cluster from a
  # snapshot after the managed cluster has lost quorum.
  auto-disaster-recovery: true
  # Configuration of the etcd instance.
  etcd:
    # The address that clients should use to connect to the etcd cluster (i.e.
    # load balancer public address - hostname only, no schema or port number).
    advertise-address: 
    # The directory where the etcd data is stored.
    data-dir: /var/lib/etcd
    # The TLS configuration for clients communication.
    client-transport-security:
      auto-tls: false
      cert-file: /opt/etcd/ssl/etcd.pem
      key-file: /opt/etcd/ssl/etcd-key.pem
      trusted-ca-file: /opt/etcd/ssl/ca.pem
      client-cert-auth: false
    # The TLS configuration for peers communications.
    peer-transport-security:
      auto-tls: true
      cert-file: 
      key-file: 
      trusted-ca-file: 
      peer-client-cert-auth: false
    # Defines the maximum amount of data that etcd can store, in bytes, before going into maintenance mode
    backend-quota: 2147483648
    # Defines the auto-compaction policy (set retention to 0 to disable).
    auto-compaction-mode: periodic
    auto-compaction-retention: "0"
    # Defines the initial acl that will be applied to the etcd during provisioning.
  # Configuration of the auto-scaling group provider.
  asg:
    provider: sts
  # Configuration of the snapshot provider.
  snapshot:
    provider: file
    # The interval between each snapshot.
    interval: 30m
    # The time after which a backup has to be deleted.
    ttl: 24h

Quentin-M · 2021-03-30T02:42:20Z

Hi again,

Thanks for all the details..! I don't really test the Kubernetes integration as it was a contribution from another developer.. not really surprised to see a few issues there.

Seems that etcd-operator-1 starts up 01:16:25, after etcd-operator-0 lost its connection to it around 01:16:18. I assume this was due to the kubectl delete pod etcd-operator-1, correct?
The new etcd-operator-1 became ready quickly at 01:16:27, as witnessed by etcd-operator-0at that time - but the operator itself ended up kicking etcd-operator-1 at 01:16:48after it appeared unhealthy for 30s.

I see two potential problems here:

I think that unhealthy-member-ttl: 30s might be too short given that the member cleaning job's interval is set to 15s by default.. Actually it does not hurt much to increase that TTL a bit as it only blocks new members (with empty data) from re-joining after another member died.. so making full reconciliation a bit slower. Setting it to 3min sounds fine to me.
I think that the etcd server in etcd-operator-1 did not fully stop by 01:17:10 (that's why we see the bind failure) - either it's still in progress or it's stuck - so that prevents etcd-operator-1 from coming back inside the cluster as we would expect. I believe that the "raft: stopped" message actually comes from the fact that the operator is trying to re-join, but it sends the message to any of the three endpoints behind the client.. and it ends up hitting the local etcd server which is stopped but still listening, which is quite insane. At least, we can see etcd-operator-1 is smart enough to understand it needs to reset its data to re-join. I know that the embed etcd server is not the greatest when it comes to clean shutdown. The operator's error watcher caught the etcd message about itself stopping but we don't actually act upon it further.. We may want to change the watcher to wait for a certain grace period (or somehow check if etcd is stopped yet or not) and then issue a force stop with c.Stop(false, false) rather than just doing c.isRunning = false. The idea would be to try fully killing it. On EC2, we have a safeguard for that kind of unclean etcd shutdown.. if the local etcd is not available for more than 15min, we shutdown the instance completely. We could also implement a change to the statefulset to actually do just that.. have a custom health check, and if the local etcd is dead for too long, restart the pod completely. That would be quite ideal, if we are not sure that even with c.Stop(false, false), we would be able to properly stop etcd completely.

jicki · 2021-03-30T03:43:24Z

Hi again,

Thanks for all the details..! I don't really test the Kubernetes integration as it was a contribution from another developer.. not really surprised to see a few issues there.

Seems that etcd-operator-1 starts up 01:16:25, after etcd-operator-0 lost its connection to it around 01:16:18. I assume this was due to the kubectl delete pod etcd-operator-1, correct?

The new etcd-operator-1 became ready quickly at 01:16:27, as witnessed by etcd-operator-0at that time - but the operator itself ended up kicking etcd-operator-1 at 01:16:48after it appeared unhealthy for 30s.

I see two potential problems here:

I think that unhealthy-member-ttl: 30s might be too short given that the member cleaning job's interval is set to 15s by default.. Actually it does not hurt much to increase that TTL a bit as it only blocks new members (with empty data) from re-joining after another member died.. so making full reconciliation a bit slower. Setting it to 3min sounds fine to me.

I think that the etcd server in etcd-operator-1 did not fully stop by 01:17:10 (that's why we see the bind failure) - either it's still in progress or it's stuck - so that prevents etcd-operator-1 from coming back inside the cluster as we would expect. I believe that the "raft: stopped" message actually comes from the fact that the operator is trying to re-join, but it sends the message to any of the three endpoints behind the client.. and it ends up hitting the local etcd server which is stopped but still listening, which is quite insane. At least, we can see etcd-operator-1 is smart enough to understand it needs to reset its data to re-join. I know that the embed etcd server is not the greatest when it comes to clean shutdown. The operator's error watcher caught the etcd message about itself stopping but we don't actually act upon it further.. We may want to change the watcher to wait for a certain grace period (or somehow check if etcd is stopped yet or not) and then issue a force stop with c.Stop(false, false) rather than just doing c.isRunning = false. The idea would be to try fully killing it. On EC2, we have a safeguard for that kind of unclean etcd shutdown.. if the local etcd is not available for more than 15min, we shutdown the instance completely. We could also implement a change to the statefulset to actually do just that.. have a custom health check, and if the local etcd is dead for too long, restart the pod completely. That would be quite ideal, if we are not sure that even with c.Stop(false, false), we would be able to properly stop etcd completely.

Hi
Very strange, if i use

      volumes:
      - name: data
        emptyDir: {}

Cluster join OK

kubectl -n demo-etcd delete pods etcd-operator-1
pod "etcd-operator-1" deleted

etcd-operator-1 logs

time="2021-03-30T03:39:25Z" level=info msg="loaded configuration file /etc/eco/eco.yaml"
time="2021-03-30T03:39:25Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.617075565169975e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-04294a3d-a4f6-4519-a337-4cf543c9f69b/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:39:25Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:39:40Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.6170755801426466e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-e144b77d-429e-4e43-83f9-be3b85d4b01a/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:39:40Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
root@shell:/srv/ops-kubeconfig# kubectl -n demo-etcd logs etcd-operator-1 -f
time="2021-03-30T03:39:25Z" level=info msg="loaded configuration file /etc/eco/eco.yaml"
time="2021-03-30T03:39:25Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.617075565169975e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-04294a3d-a4f6-4519-a337-4cf543c9f69b/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:39:25Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:39:40Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.6170755801426466e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-e144b77d-429e-4e43-83f9-be3b85d4b01a/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:39:40Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:39:55Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.6170755951473594e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-e6b5ba46-e13d-4e09-a1b1-828ec9f95fb6/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:39:55Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:40:10Z" level=info msg="STATUS: Healthy + Not running -> Join"
1.617075610142918e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-6dcd34b5-f0f1-4ff4-9282-2bcf144d3f55/172.130.61.101:2379", "attempt": 0, "error": "rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
time="2021-03-30T03:40:10Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:40:10Z" level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: etcdserver: unhealthy cluster"
time="2021-03-30T03:40:25Z" level=info msg="STATUS: Healthy + Not running -> Join"
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-03-30 03:40:25.186629 I | embed: peerTLS: cert = /var/lib/etcd/fixtures/peer/cert.pem, key = /var/lib/etcd/fixtures/peer/key.pem, trusted-ca = , client-cert-auth = false, crl-file = 
2021-03-30 03:40:25.214039 I | embed: name = etcd-operator-1
2021-03-30 03:40:25.214092 I | embed: data dir = /var/lib/etcd
2021-03-30 03:40:25.214107 I | embed: member dir = /var/lib/etcd/member
2021-03-30 03:40:25.214119 I | embed: heartbeat = 100ms
2021-03-30 03:40:25.214131 I | embed: election = 1000ms
2021-03-30 03:40:25.214143 I | embed: snapshot count = 100000
2021-03-30 03:40:25.214171 I | embed: advertise client URLs = https://172.130.151.119:2379
2021-03-30 03:40:25.214309 W | pkg/fileutil: check file permission: directory "/var/lib/etcd" exist, but the permission is "drwxrwxrwx". The recommended permission is "-rwx------" to prevent possible unprivileged access to the data.
2021-03-30 03:40:25.593012 I | etcdserver: starting member 6f273af7bfad8b8 in cluster f50188c8af6a2f58
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=()
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 became follower at term 0
raft2021/03/30 03:40:25 INFO: newRaft 6f273af7bfad8b8 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2021-03-30 03:40:25.601966 W | auth: simple token is not cryptographically signed
2021-03-30 03:40:25.665044 I | rafthttp: started HTTP pipelining with peer 582dc87539c99702
2021-03-30 03:40:25.665172 I | rafthttp: started HTTP pipelining with peer aab1bf705d6d42b3
2021-03-30 03:40:25.665217 I | rafthttp: starting peer 582dc87539c99702...
2021-03-30 03:40:25.665271 I | rafthttp: started HTTP pipelining with peer 582dc87539c99702
2021-03-30 03:40:25.667614 I | rafthttp: started streaming with peer 582dc87539c99702 (writer)
2021-03-30 03:40:25.669819 I | rafthttp: started streaming with peer 582dc87539c99702 (writer)
2021-03-30 03:40:25.680001 I | rafthttp: started peer 582dc87539c99702
2021-03-30 03:40:25.680207 I | rafthttp: added peer 582dc87539c99702
2021-03-30 03:40:25.680299 I | rafthttp: started streaming with peer 582dc87539c99702 (stream MsgApp v2 reader)
2021-03-30 03:40:25.681002 I | rafthttp: starting peer aab1bf705d6d42b3...
2021-03-30 03:40:25.681176 I | rafthttp: started HTTP pipelining with peer aab1bf705d6d42b3
2021-03-30 03:40:25.681250 I | rafthttp: started streaming with peer 582dc87539c99702 (stream Message reader)
2021-03-30 03:40:25.684474 I | rafthttp: started streaming with peer aab1bf705d6d42b3 (writer)
2021-03-30 03:40:25.687217 I | rafthttp: started streaming with peer aab1bf705d6d42b3 (writer)
2021-03-30 03:40:25.693814 I | rafthttp: started peer aab1bf705d6d42b3
2021-03-30 03:40:25.693901 I | rafthttp: added peer aab1bf705d6d42b3
2021-03-30 03:40:25.693937 I | rafthttp: started streaming with peer aab1bf705d6d42b3 (stream Message reader)
2021-03-30 03:40:25.694203 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
2021-03-30 03:40:25.694435 I | rafthttp: started streaming with peer aab1bf705d6d42b3 (stream MsgApp v2 reader)
2021-03-30 03:40:25.724944 I | embed: ClientTLS: cert = /opt/etcd/ssl/etcd-client.pem, key = /opt/etcd/ssl/etcd-client-key.pem, trusted-ca = /opt/etcd/ssl/ca.pem, client-cert-auth = false, crl-file = 
2021-03-30 03:40:25.725319 I | embed: listening for metrics on http://127.0.0.1:2381
time="2021-03-30T03:40:25Z" level=info msg="embedded etcd server is now running"
2021-03-30 03:40:25.725402 I | embed: listening for peers on 172.130.151.119:2380
2021-03-30 03:40:25.725647 I | embed: listening for metrics on http://172.130.151.119:2381
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 [term: 0] received a MsgHeartbeat message with higher term from 582dc87539c99702 [term: 2]
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 became follower at term 2
raft2021/03/30 03:40:25 INFO: raft.node: 6f273af7bfad8b8 elected leader 582dc87539c99702 at term 2
2021-03-30 03:40:25.726434 I | rafthttp: peer 582dc87539c99702 became active
2021-03-30 03:40:25.726501 I | rafthttp: established a TCP streaming connection with peer 582dc87539c99702 (stream Message writer)
2021-03-30 03:40:25.726638 I | rafthttp: peer aab1bf705d6d42b3 became active
2021-03-30 03:40:25.726784 I | rafthttp: established a TCP streaming connection with peer aab1bf705d6d42b3 (stream Message writer)
2021-03-30 03:40:25.727491 I | rafthttp: established a TCP streaming connection with peer 582dc87539c99702 (stream MsgApp v2 writer)
2021-03-30 03:40:25.729224 I | rafthttp: established a TCP streaming connection with peer aab1bf705d6d42b3 (stream MsgApp v2 writer)
2021-03-30 03:40:25.746354 I | etcdserver: 6f273af7bfad8b8 initialized peer connection; fast-forwarding 8 ticks (election ticks 10) with 2 active peer(s)
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(6353955055095879426)
2021-03-30 03:40:25.750356 I | etcdserver/membership: added member 582dc87539c99702 [https://172.130.55.227:2380] to cluster f50188c8af6a2f58
2021-03-30 03:40:25.750749 N | etcdserver/membership: set the initial cluster version to 3.4
2021-03-30 03:40:25.750880 I | etcdserver/api: enabled capabilities for version 3.4
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(4036908777817914600 6353955055095879426)
2021-03-30 03:40:25.752744 I | etcdserver/membership: added member 3805fb251c7e3ce8 [https://172.130.151.74:2380] to cluster f50188c8af6a2f58
2021-03-30 03:40:25.752803 I | rafthttp: starting peer 3805fb251c7e3ce8...
2021-03-30 03:40:25.752884 I | rafthttp: started HTTP pipelining with peer 3805fb251c7e3ce8
2021-03-30 03:40:25.754387 I | rafthttp: started streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:25.755326 I | rafthttp: started streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:25.756652 I | rafthttp: started peer 3805fb251c7e3ce8
2021-03-30 03:40:25.756701 I | rafthttp: started streaming with peer 3805fb251c7e3ce8 (stream MsgApp v2 reader)
2021-03-30 03:40:25.756731 I | rafthttp: added peer 3805fb251c7e3ce8
2021-03-30 03:40:25.757180 I | rafthttp: started streaming with peer 3805fb251c7e3ce8 (stream Message reader)
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(4036908777817914600 6353955055095879426 12299822546650219187)
2021-03-30 03:40:25.759150 I | etcdserver/membership: added member aab1bf705d6d42b3 [https://172.130.61.101:2380] to cluster f50188c8af6a2f58
2021-03-30 03:40:25.763217 I | rafthttp: established a TCP streaming connection with peer aab1bf705d6d42b3 (stream MsgApp v2 reader)
2021-03-30 03:40:25.818774 I | rafthttp: established a TCP streaming connection with peer 582dc87539c99702 (stream Message reader)
2021-03-30 03:40:25.833583 I | rafthttp: established a TCP streaming connection with peer aab1bf705d6d42b3 (stream Message reader)
2021-03-30 03:40:25.836032 I | rafthttp: established a TCP streaming connection with peer 582dc87539c99702 (stream MsgApp v2 reader)
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(6353955055095879426 12299822546650219187)
2021-03-30 03:40:25.867690 I | etcdserver/membership: removed member 3805fb251c7e3ce8 from cluster f50188c8af6a2f58
2021-03-30 03:40:25.867710 I | rafthttp: stopping peer 3805fb251c7e3ce8...
2021-03-30 03:40:25.867733 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:25.867748 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:25.867783 I | rafthttp: stopped HTTP pipelining with peer 3805fb251c7e3ce8
2021-03-30 03:40:25.867815 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (stream MsgApp v2 reader)
2021-03-30 03:40:25.867840 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (stream Message reader)
2021-03-30 03:40:25.867872 I | rafthttp: stopped peer 3805fb251c7e3ce8
2021-03-30 03:40:25.867886 I | rafthttp: removed peer 3805fb251c7e3ce8
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(6353955055095879426 12299822546650219187)
raft2021/03/30 03:40:25 INFO: 6f273af7bfad8b8 switched to configuration voters=(500589706128054456 6353955055095879426 12299822546650219187)
2021-03-30 03:40:25.869487 I | etcdserver/membership: added member 6f273af7bfad8b8 [https://172.130.151.119:2380] to cluster f50188c8af6a2f58
2021-03-30 03:40:25.872459 I | embed: ready to serve client requests
2021-03-30 03:40:25.873079 I | etcdserver: published {Name:etcd-operator-1 ClientURLs:[https://172.130.151.119:2379]} to cluster f50188c8af6a2f58
2021-03-30 03:40:25.876401 I | embed: serving client requests on 172.130.151.119:2379
2021-03-30 03:40:27.850965 I | embed: rejected connection from "172.130.61.101:43076" (error "EOF", ServerName "")
time="2021-03-30T03:40:40Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:40:42.848499 I | embed: rejected connection from "172.130.61.101:43164" (error "EOF", ServerName "")
2021-03-30 03:40:50.459593 I | embed: rejected connection from "10.9.9.46:40716" (error "EOF", ServerName "")
2021-03-30 03:40:53.798282 I | embed: rejected connection from "172.130.55.227:32864" (error "EOF", ServerName "")
time="2021-03-30T03:40:55Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:40:55.106705 I | embed: rejected connection from "172.130.151.119:33280" (error "EOF", ServerName "")
2021-03-30 03:41:08.799491 I | embed: rejected connection from "172.130.55.227:33142" (error "EOF", ServerName "")
time="2021-03-30T03:41:10Z" level=info msg="STATUS: Healthy + Running -> Standby"

etcd-operator-2 logs

2021-03-30 03:38:03.797182 I | embed: rejected connection from "10.9.9.49:38986" (error "EOF", ServerName "")
time="2021-03-30T03:38:08Z" level=info msg="STATUS: Healthy + Running -> Standby"
time="2021-03-30T03:38:23Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:38:27.866023 I | embed: rejected connection from "172.130.61.101:45798" (error "EOF", ServerName "")
time="2021-03-30T03:38:38Z" level=info msg="STATUS: Healthy + Running -> Standby"
time="2021-03-30T03:38:53Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:39:03.797294 I | embed: rejected connection from "10.9.9.49:40010" (error "EOF", ServerName "")
time="2021-03-30T03:39:09Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:39:12.867150 I | embed: rejected connection from "172.130.61.101:46074" (error "EOF", ServerName "")
2021-03-30 03:39:14.881968 I | embed: rejected connection from "172.130.61.64:19741" (error "EOF", ServerName "")
2021-03-30 03:39:16.377053 W | rafthttp: lost the TCP streaming connection with peer 3805fb251c7e3ce8 (stream MsgApp v2 reader)
2021-03-30 03:39:16.377219 W | rafthttp: lost the TCP streaming connection with peer 3805fb251c7e3ce8 (stream Message reader)
2021-03-30 03:39:16.384852 E | rafthttp: failed to dial 3805fb251c7e3ce8 on stream Message (EOF)
2021-03-30 03:39:16.384991 I | rafthttp: peer 3805fb251c7e3ce8 became inactive (message send to peer failed)
2021-03-30 03:39:16.457580 W | rafthttp: lost the TCP streaming connection with peer 3805fb251c7e3ce8 (stream MsgApp v2 writer)
2021-03-30 03:39:16.602742 W | rafthttp: lost the TCP streaming connection with peer 3805fb251c7e3ce8 (stream Message writer)
2021-03-30 03:39:21.828763 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:21.828832 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
time="2021-03-30T03:39:23Z" level=info msg="STATUS: Healthy + Running -> Standby"
2021-03-30 03:39:25.856688 W | etcdserver: read-only range request "key:\"key1130\" " with result "range_response_count:1 size:37" took too long (541.199757ms) to execute
2021-03-30 03:39:27.891204 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:27.891240 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
1.617075569606453e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-523c0dca-b889-498e-9b7d-a34c1c379f2a/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 03:39:32.866055 W | etcdserver: read-only range request "key:\"key1216\" " with result "range_response_count:1 size:37" took too long (103.197164ms) to execute
2021-03-30 03:39:33.246894 W | etcdserver: read-only range request "key:\"key1219\" " with result "range_response_count:1 size:37" took too long (122.110531ms) to execute
2021-03-30 03:39:33.894591 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:33.894633 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
1.6170755746068401e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-523c0dca-b889-498e-9b7d-a34c1c379f2a/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
time="2021-03-30T03:39:38Z" level=info msg="STATUS: Healthy + Running -> Standby"
1.6170755796071749e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-523c0dca-b889-498e-9b7d-a34c1c379f2a/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 03:39:39.895827 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:39.895880 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:42.871356 I | embed: rejected connection from "172.130.61.101:46400" (error "EOF", ServerName "")
1.6170755846075819e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-523c0dca-b889-498e-9b7d-a34c1c379f2a/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.74:2379: i/o timeout\""}
2021-03-30 03:39:45.217218 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:45.218081 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:45.897660 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:45.897720 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
1.6170755896080518e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-523c0dca-b889-498e-9b7d-a34c1c379f2a/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.74:2379: i/o timeout\""}
2021-03-30 03:39:50.217569 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:50.218354 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:51.907783 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:51.907855 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
time="2021-03-30T03:39:53Z" level=info msg="STATUS: Healthy + Running -> Standby"
1.6170755946557484e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-000c5d16-7f6d-4fc0-9833-cbdcc8632cf4/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 03:39:55.168102 W | etcdserver: not healthy for reconfigure, rejecting member add {ID:686b2644b04a5908 RaftAttributes:{PeerURLs:[https://172.130.151.119:2380] IsLearner:false} Attributes:{Name: ClientURLs:[]}}
2021-03-30 03:39:55.217910 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:55.218708 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:39:57.873308 I | embed: rejected connection from "172.130.61.101:46536" (error "EOF", ServerName "")
2021-03-30 03:39:57.910001 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:39:57.910036 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
1.6170755996561596e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-000c5d16-7f6d-4fc0-9833-cbdcc8632cf4/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 03:40:00.218252 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:40:00.218985 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:40:03.797380 I | embed: rejected connection from "10.9.9.49:41438" (error "EOF", ServerName "")
2021-03-30 03:40:03.912293 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:40:03.912391 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
1.6170756046566145e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-000c5d16-7f6d-4fc0-9833-cbdcc8632cf4/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
2021-03-30 03:40:05.218613 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:40:05.219334 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
time="2021-03-30T03:40:08Z" level=info msg="STATUS: Healthy + Running -> Standby"
1.6170756096569276e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-000c5d16-7f6d-4fc0-9833-cbdcc8632cf4/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.74:2379: i/o timeout\""}
2021-03-30 03:40:09.914009 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:40:09.914040 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:40:10.218931 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:40:10.219625 W | rafthttp: health check for peer 3805fb251c7e3ce8 could not connect: dial tcp 172.130.151.74:2380: i/o timeout
2021-03-30 03:40:12.870126 I | embed: rejected connection from "172.130.61.101:46668" (error "EOF", ServerName "")
1.6170756146573615e+09	warn	clientv3/retry_interceptor.go:62	retrying of unary invoker failed	{"target": "endpoint://client-000c5d16-7f6d-4fc0-9833-cbdcc8632cf4/172.130.151.74:2379", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.130.151.74:2379: i/o timeout\""}
time="2021-03-30T03:40:14Z" level=info msg="removing member \"etcd-operator-1\" that's been unhealthy for 1m0s"
raft2021/03/30 03:40:14 INFO: 582dc87539c99702 switched to configuration voters=(6353955055095879426 12299822546650219187)
2021-03-30 03:40:14.762882 I | etcdserver/membership: removed member 3805fb251c7e3ce8 from cluster f50188c8af6a2f58
2021-03-30 03:40:14.762921 I | rafthttp: stopping peer 3805fb251c7e3ce8...
2021-03-30 03:40:14.762974 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:14.762999 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (writer)
2021-03-30 03:40:14.770072 I | rafthttp: stopped HTTP pipelining with peer 3805fb251c7e3ce8
2021-03-30 03:40:14.770197 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (stream MsgApp v2 reader)
2021-03-30 03:40:14.770239 I | rafthttp: stopped streaming with peer 3805fb251c7e3ce8 (stream Message reader)
2021-03-30 03:40:14.770267 I | rafthttp: stopped peer 3805fb251c7e3ce8
2021-03-30 03:40:14.770301 I | rafthttp: removed peer 3805fb251c7e3ce8
2021-03-30 03:40:15.956059 W | etcdserver: failed to reach the peerURL(https://172.130.151.74:2380) of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
2021-03-30 03:40:15.956116 W | etcdserver: cannot get the version of member 3805fb251c7e3ce8 (Get https://172.130.151.74:2380/version: dial tcp 172.130.151.74:2380: i/o timeout)
raft2021/03/30 03:40:18 INFO: 582dc87539c99702 switched to configuration voters=(6353955055095879426 12299822546650219187)
time="2021-03-30T03:40:23Z" level=info msg="STATUS: Healthy + Running -> Standby"
raft2021/03/30 03:40:25 INFO: 582dc87539c99702 switched to configuration voters=(500589706128054456 6353955055095879426 12299822546650219187)
2021-03-30 03:40:25.152922 I | etcdserver/membership: added member 6f273af7bfad8b8 [https://172.130.151.119:2380] to cluster f50188c8af6a2f58
2021-03-30 03:40:25.152974 I | rafthttp: starting peer 6f273af7bfad8b8...
2021-03-30 03:40:25.153039 I | rafthttp: started HTTP pipelining with peer 6f273af7bfad8b8
2021-03-30 03:40:25.154215 I | rafthttp: started streaming with peer 6f273af7bfad8b8 (writer)
2021-03-30 03:40:25.159405 I | rafthttp: started streaming with peer 6f273af7bfad8b8 (writer)
2021-03-30 03:40:25.169689 I | rafthttp: started peer 6f273af7bfad8b8
2021-03-30 03:40:25.169809 I | rafthttp: added peer 6f273af7bfad8b8
2021-03-30 03:40:25.169972 I | rafthttp: started streaming with peer 6f273af7bfad8b8 (stream MsgApp v2 reader)
2021-03-30 03:40:25.170480 I | rafthttp: started streaming with peer 6f273af7bfad8b8 (stream Message reader)
2021-03-30 03:40:25.740733 I | rafthttp: peer 6f273af7bfad8b8 became active
2021-03-30 03:40:25.741193 I | rafthttp: established a TCP streaming connection with peer 6f273af7bfad8b8 (stream MsgApp v2 reader)
2021-03-30 03:40:25.741401 I | rafthttp: established a TCP streaming connection with peer 6f273af7bfad8b8 (stream Message reader)
2021-03-30 03:40:25.777996 I | rafthttp: established a TCP streaming connection with peer 6f273af7bfad8b8 (stream Message writer)
2021-03-30 03:40:25.852040 I | rafthttp: established a TCP streaming connection with peer 6f273af7bfad8b8 (stream MsgApp v2 writer)
time="2021-03-30T03:40:38Z" level=info msg="STATUS: Healthy + Running -> Standby"

Quentin-M · 2021-03-31T22:08:34Z

Hi @jicki,

Yeah, it's just a timing issue. In those new logs with TTL=1min, the unhealthy member gets removed at 3:40:15, and new rejoins at 03:40:25 - which is fine.

Let me give you more context on the unhealthy-member-ttl.

When using volumes, ideally, you want that value to be higher (with good margin) than the time it would take for a dead pod to restart and rejoin - so scheduling, attaching volume, starting. This makes the operator tolerate the fact that the member is disconnected for that time, and gives that member some time to rejoin with its own data. After a member has not been seen for that period of time, the operator will drop the member from the cluster, invalidating its state and allowing a brand new member to join. This prompts the operator, upon attempting to join, to see that he has been removed and needs to clear its data / get a new identity and request to join again - that new member will be sent the current data and finally join. So, if a valid member with valid data takes 1min to restart (pod scheduling, volume attachment, pod start...), then TTL set to 3min will be good as it will be able to rejoin with its own data quickly. Otherwise, no big deal, it can discard its data and ask to rejoin. The problem you had initially was related to TTL being 30sec and member removal check interval being 15sec, which made the cluster remove the member after it's actually joined properly.. and the embed etcd server did not exit cleanly.. making the operating not able to re-create an etcd server to join again.
When volumes are not used, a restarting member will have no data anyways and therefore not attempt to simply rejoin but will ask to join from scratch. The etcd cluster will be at 2/3 at this stage, and therefore will not allow the new member to join because it would make the cluster lose quorum (2/4). The restarting member will keep trying.. until the operator will remove the unhealthy member after the TTL expires, making it 2/2 - at which point the restarting member will be welcomed, making the cluster 2/3 again during data sync and soon enough 3/3. It is the case where a long TTL can make full cluster recovery slower.

jicki changed the title ~~error "EOF", ServerName ""~~ failed to join the cluster Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to join the cluster #62

failed to join the cluster #62

jicki commented Mar 26, 2021 •

edited

Loading

jicki commented Mar 29, 2021

Quentin-M commented Mar 30, 2021

jicki commented Mar 30, 2021

Quentin-M commented Mar 30, 2021

jicki commented Mar 30, 2021

Quentin-M commented Mar 31, 2021

failed to join the cluster #62

failed to join the cluster #62

Comments

jicki commented Mar 26, 2021 • edited Loading

jicki commented Mar 29, 2021

Quentin-M commented Mar 30, 2021

jicki commented Mar 30, 2021

config

Quentin-M commented Mar 30, 2021

jicki commented Mar 30, 2021

Quentin-M commented Mar 31, 2021

jicki commented Mar 26, 2021 •

edited

Loading