etcd go client fails when querying a cluster with a down node #9949

gnuoy · 2018-07-21T10:46:18Z

Describe the bug
The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.

To Reproduce
I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.

https client:

	ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

	cfg := clientv3.Config{
		Endpoints:   []string{"https://10.53.82.119:2379" ,"https://10.53.82.150:2379", "https://10.53.82.157:2379"},
		DialTimeout: 5 * time.Second,
	}

	cert := "/home/liam/tls_vault_certs/etcd-cert.pem"
	key := "/home/liam/tls_vault_certs/etcd.key"
	ca := "/home/liam/tls_vault_certs/etcd-ca.pem"
	tls := transport.TLSInfo{
		TrustedCAFile: ca,
		CertFile:      cert,
		KeyFile:       key,
	}

	tlscfg, err := tls.ClientConfig()
	cfg.TLS = tlscfg

	cli, err := clientv3.New(cfg)
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()
	kv := clientv3.NewKV(cli)

Fails with:
2018-07-21 11:34:05.728613 I | context deadline exceeded

http client:

        ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

        cfg := clientv3.Config{
                Endpoints:   []string{"http://10.53.82.119:2378" ,"http://10.53.82.150:2378", "http://10.53.82.157:2378"},
                DialTimeout: 5 * time.Second,
        }

        cli, err := clientv3.New(cfg)
        if err != nil {
                log.Fatal(err)
        }
        defer cli.Close()
        kv := clientv3.NewKV(cli)

The text was updated successfully, but these errors were encountered:

The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. etcd-io/etcd#9949

gnuoy · 2018-09-04T08:57:27Z

In my setup each etcd unit has an individual server certificate that includes an IP SAN with the ip address of the unit.

If I generate a new certificate that has IP SAN entries for all units in the cluster and then use then use that certificate on all units the problem goes away. I don't like this workaround though as it means I need to cert every time I scale out.

gnuoy · 2018-09-04T10:30:10Z

In my current setup I have three etcd nodes: 10.53.82.141, 10.53.82.217 and 10.53.82.233
I have shutdown etcd on 10.53.82.141 and 10.53.82.141 is the first entry in my Endpoints list. Running the client and capturing the error message from https://github.com/etcd-io/etcd/blob/master/vendor/google.golang.org/grpc/credentials/credentials.go#L166 gives:

{"level":"info","ts":1536056536.818413,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bo4mljc94cxt","addresses":["https://10.53.82.141:2379","https://10.53.82.217:2379","https://10.53.82.233:2379"]}
{"level":"info","ts":1536056536.8184912,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185225,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185527,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8217986,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.233, not 10.53.82.141
{"level":"info","ts":1536056536.8290203,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.217, not 10.53.82.141
{"level":"info","ts":1536056536.8310082,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}```

gnuoy · 2018-09-04T14:33:53Z

It looks like the chain of events (when servername is not used):

newClient is called from clientv3/client.go. This first endpoint is
extracted from the Endpoint list and used to call client.dial.

func newClient(cfg *Config) (*Client, error) {
...
	dialEndpoint := cfg.Endpoints[0]
...
	conn, err := client.dial(dialEndpoint, grpc.WithBalancerName(roundRobinBalancerName))

client.dial in turn calls DialContext from ./vendor/google.golang.org/grpc/clientconn.go

func (c *Client) dial(ep string, dopts ...grpc.DialOption) (*grpc.ClientConn, error) {
...
	conn, err := grpc.DialContext(dctx, target, opts...)

DialContext sets the authority to be url for the first endpoint ./vendor/google.golang.org/grpc/clientconn.go

func DialContext(ctx context.Context, target string, opts ...DialOption) (conn *ClientConn, err error) {
...
        cc := &ClientConn{
                target: target,
                csMgr:  &connectivityStateManager{},
                conns:  make(map[*addrConn]struct{}),
                blockingpicker: newPickerWrapper(),
        }
...
        creds := cc.dopts.copts.TransportCredentials
        if creds != nil && creds.Info().ServerName != "" {
                cc.authority = creds.Info().ServerName
        } else if cc.dopts.insecure && cc.dopts.copts.Authority != "" {
                cc.authority = cc.dopts.copts.Authority
        } else {
                // Use endpoint from "scheme://authority/endpoint" as the default
                // authority for ClientConn.
                cc.authority = cc.parsedTarget.Endpoint
        }
...

Connection attempts to any endpoint other than the first now fail because
their IP does not match the IP of the first endpoint. ./vendor/google.golang.org/grpc/credentials/credentials.go

func (c *tlsCreds) ClientHandshake(ctx context.Context, authority string, rawConn net.Conn) (_ net.Conn, _ AuthInfo, err error) {
	// use local cfg to avoid clobbering ServerName if using multiple endpoints
	cfg := cloneTLSConfig(c.config)
	if cfg.ServerName == "" {
		colonPos := strings.LastIndex(authority, ":")
		if colonPos == -1 {
			colonPos = len(authority)
		}
		cfg.ServerName = authority[:colonPos]

jsok · 2018-12-13T06:58:22Z

Experiencing the same issue - however with SRV discovery which will use FQDNs for the endpoint list not IP addresses.

Similarly only the first endpoint will ever successfully be balanced.

My client is Hashicorp vault also, so related to hashicorp/vault#4961.
Using env ETCD_CLIENT_DEBUG=on GRPC_GO_LOG_VERBOSITY_LEVEL=INFO has helped debug.

Highlights from the log lines follow...

All endpoints are successfully enumerated:

{"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}

However only a single endpoint will result in a READY subconn (etcd3 in this case):

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}

It will generally pick a different etcd[1,2,3] server each time I restart vault. Just depends on the order the SRV record is returned.

TLS CN and SANs

As per the TLS advice for SRV discovery - the SRV domain, fqdn and IP address. Ditto for etcd[2,3] servers.

Subject: CN = etcd1.nodes.example.com
X509v3 Subject Alternative Name:
    DNS:etcd1.nodes.example.com, DNS:*.etcd.example.com, DNS:etcd.example.com, IP Address:10.xx.xx.xx

Full startup logs

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5091121,"caller":"balancer/balancer.go:39","msg":"registered balancer","policy":"etcd-client-roundrobin-balanced","name":"etcd-etcd-client-roundrobin-balanced"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5548,"caller":"balancer/balancer.go:82","msg":"built balancer","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","resolver-target":"endpoint://client-bqhf6xf64qh7/etcd3.nodes.example.com.:2379"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5549786,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550096,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550244,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.572446,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5726738,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.573283,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5733666,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com.:2379 (0xc000868b90)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5749798,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com:2379","https://etcd2.nodes.example.com:2379","https://etcd3.nodes.example.com:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5750618,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd3.nodes.example.com.:2379","subconn":"0xc000868b90"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575098,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd2.nodes.example.com.:2379","subconn":"0xc000868bb0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57512,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd1.nodes.example.com.:2379","subconn":"0xc000868bd0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751421,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575167,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751889,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57521,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"READY","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5752327,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575254,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575295,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: ==> Vault server configuration:
Dec 13 02:30:40 runc[31079]:              Api Address: https://etcd2.nodes.example.com:8200
Dec 13 02:30:40 runc[31079]:                      Cgo: disabled
Dec 13 02:30:40 runc[31079]:          Cluster Address: https://etcd2.nodes.example.com:8201
Dec 13 02:30:40 runc[31079]:               Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Dec 13 02:30:40 runc[31079]:                Log Level: (not set)
Dec 13 02:30:40 runc[31079]:                    Mlock: supported: true, enabled: true
Dec 13 02:30:40 runc[31079]:                  Storage: etcd (HA available)
Dec 13 02:30:40 runc[31079]:                  Version: Vault v0.11.5
Dec 13 02:30:40 runc[31079]:              Version Sha: a59ffa4a0f09bbf198241fe6793a96722789b639
Dec 13 02:30:40 runc[31079]: ==> Vault server started! Log data will stream in below:
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.575287,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5754151,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5860226,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dmic-16-dp68-h1-09irec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5864868,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: vault is unsealed
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: entering standby mode

jsok · 2018-12-14T03:59:45Z

The code tracing done by @gnuoy is illuminating in that the actual sequence of events is:

3 subconns are spawned
They race to ClientHandshake, the winner will set the ServerName on the TLS config, e.g. etcd1.example.com
The remaining subconns (which seems to share the parent clientconn and therefore TLS config) will now be forced to verify against the same ServerName

So I can only conclude that using the same authority for all subconnections in the balancer is fundamentally flawed when used with TLS.

Under the current implementation there are 3 possible solutions:

Disable TLS verification - Giant 👎
Add *.example.com as a SAN to each member - not great
All members must list every other member as a SAN - defeats the purpose of a HA cluster since all members must have their certs rolled each time the cluster is reconfigured

alexbrand · 2019-02-04T20:52:15Z

/subscribe

xiang90 · 2019-02-06T23:02:30Z

/cc @gyuho @jpbetz

@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?

gyuho · 2019-02-19T18:39:41Z

@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.

dunjut · 2019-02-22T07:16:21Z

We're experiencing exactly the same issue as @gnuoy :

One certificate per server IP address and etcdv3 client fails if first endpoint unavailable.

pravinsinghal · 2019-04-16T06:26:16Z

@gyuho I am also running into the same issue - From the comments above it looks you have been working on a patch. Any update on it.

timothysc · 2019-05-15T15:29:23Z

@gyuho - It's difficult to decipher if this has been fixed or if it's still pending now as there are about a dozen cross-referenced issues and PRs.

alexbrand · 2019-05-15T16:10:54Z

I looked at the status of this last week. I believe this is still pending on #10489. I could be wrong though.

bahar-p · 2019-05-15T19:45:14Z

Any updates on this? We're still running into this with Kubernetes v1.14.1 and etcd 3.3.12

timothysc · 2019-05-15T21:11:39Z

We should really get an update on this for k8s - v1.15

gyuho · 2019-06-13T00:52:55Z

Just discussed with gRPC team, and got some good feedback.
I will rework on this in the next few weeks.

Protopopys · 2019-07-02T11:07:22Z

Any updates on this? We're still running into this with Kubernetes v1.15.0 and etcd 3.3.13

ymmt2005 · 2019-07-17T22:33:34Z

FWIW, we work around this problem by placing TCP reverse proxies
on each node that connect to etcd. Each client can connect etcd
via localhost:12379. Since TLS certificates of etcd servers have "localhost"
SAN and "127.0.0.1" IP-SAN, the problem can be avoided.

A possible better workaround would be to place TLS-terminating
TCP reverse proxies. That is, it terminates TLS both for client connections
and for etcd servers while validating server certificates with their public
IP addresses.

gyuho · 2019-07-27T05:41:18Z

The fix #10911 has been merged, and will be released in 3.4 #10943.

frittentheke · 2019-07-28T21:59:07Z

@bahar-p @Protopopys ... I was just wondering the same thing.
Question still is ... will Kubernetes / kubeadm or other installer tools move etcd 3.4 (when released) or will etcd do a backport for etcd 3.3.xx and the tools just got to jump the point release.

See my question at #10911 (comment)

abezard · 2019-08-07T23:45:37Z

@gyuho I tried this code on my staging environment and I still face the same issue:

ETCD_CLIENT_DEBUG=1 ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints 10.10.10.242:2379,10.10.10.252:2379,10.10.10.139:2379 --cacert /etc/ssl/etcd-root-ca.pem --cert /etc/ssl/client.pem --key /etc/ssl/client-key.pem put foo bar
{"level":"debug","ts":"2019-08-07T22:54:09.070Z","caller":"balancer/balancer.go:60","msg":"registered balancer","policy":"picker-roundrobin-balanced","name":"etcd-picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:09.074Z","caller":"balancer/balancer.go:97","msg":"built balancer","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","resolver-target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:148","msg":"resolved","picker":"picker-error","balancer-id":"bw3rmsvwfygd","addresses":["10.10.10.139:2379","10.10.10.242:2379","10.10.10.252:2379"]}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.252:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.139:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.098Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.602Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.603Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"warn","ts":"2019-08-07T22:54:14.074Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.10.10.242:2379: connect: connection refused\""}

10.10.10.242 is down on purpose.

And from the same machine I can check the health of those etcd nodes just fine:

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.252:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.252:2379 is healthy: successfully committed proposal: took = 18.506072ms
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.139:2379 is healthy: successfully committed proposal: took = 16.068855ms

So I'm pretty confused here. I double checked to make sure I was working off a branch with your changes and I'm...

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint status
[...]
https://10.10.10.139:2379, 146b0b16a4192d16, 3.5.0-pre, 20 kB, false, false, 48, 71, 71,

~/go/src/github.com/etcd-io/etcd (release-3.4)$ make build
GO_BUILD_FLAGS="-v" ./build
./bin/etcd --version
etcd Version: 3.5.0-pre
Git SHA: 524278c18
Go Version: go1.12
Go OS/Arch: linux/amd64
./bin/etcdctl version
etcdctl version: 3.5.0-pre
API version: 3.5

(sorry for commenting a closed ticket)

abezard · 2019-08-09T00:04:11Z

fixed it with that: moondev/kubernetes@45f6cb2#diff-c9fae1df26aedd520ef93a008d255581R136-R148

Contains an important fix in clientv3 that allows vault to successfully failover to another etcdv3 endpoint in the event that the current active connection becomes unavailable. See also: * etcd-io/etcd#9949 * etcd-io/etcd#10911 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3314-2019-08-16 Fixes hashicorp#4961

gnuoy mentioned this issue Jul 21, 2018

Vault is inaccessible if an etcd unit is lost hashicorp/vault#4961

Closed

gyuho added the area/clientv3 label Jul 23, 2018

mitsutaka pushed a commit to cybozu-go/cke that referenced this issue Sep 4, 2018

Update mtest to avoid known bug

1c11959

The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. etcd-io/etcd#9949

mitsutaka pushed a commit to cybozu-go/cke that referenced this issue Sep 4, 2018

Update mtest to avoid known bug

af8b084

The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. etcd-io/etcd#9949

ymmt2005 mentioned this issue Sep 4, 2018

Workaround etcd/issues/9949 cybozu-go/etcdutil#4

Closed

gyuho added this to the etcd-v3.4 milestone Sep 4, 2018

jsok mentioned this issue Dec 13, 2018

Vault with etcd backend storage with multiple adresses uses only the first server's ServerName for all endpoints, generating bad cert tls error on etcd side hashicorp/vault#4349

Closed

JishanXing mentioned this issue Dec 20, 2018

kube-apiserver 1.13.x refuses to work when first etcd-server is not available. kubernetes/kubernetes#72102

Closed

This was referenced Jan 15, 2019

etcd 3.3.x benign TLS error when etcdctl uses 3 --endpoints values #10040

Closed

Multiple endpoints in ETCDCTL_ENDPOINTS lead to rejected connection from "1.2.3.4:5678" (error "EOF", ServerName "") spam in server log #10391

Closed

gyuho self-assigned this Feb 11, 2019

This was referenced Feb 14, 2019

clientv3: clarify retry interceptor logging #10475

Closed

clientv3: clarify retry interceptor logging #10476

Merged

gyuho pinned this issue Feb 19, 2019

This was referenced Feb 21, 2019

[gRPC] clientconn: set authority with the latest dial target #10489

Closed

clientconn: set dial target "Authority" with target address grpc/grpc-go#2650

Closed

hexfusion mentioned this issue Mar 1, 2019

How to deal with programing by clientv3 library when one etcd node is down #10507

Closed

danbeaulieu mentioned this issue Mar 1, 2019

kubeadm reset success but this node ip still in kubeadm-config configmap kubernetes/kubeadm#1300

Closed

This was referenced Mar 22, 2019

kubeadm join is not fault tolerant to etcd endpoint failures kubernetes/kubeadm#1432

Closed

kubeadm: check etcd member health before trying to sync kubernetes/kubernetes#75600

Closed

jcrowthe mentioned this issue Mar 26, 2019

(error "EOF", ServerName "") error on etcd servers #10587

Closed

hexfusion mentioned this issue Apr 27, 2019

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs openshift/machine-config-operator#676

Merged

tbg unpinned this issue May 1, 2019

jingyih mentioned this issue May 12, 2019

TCP connections between apiserver and etcd looks strange #10717

Closed

gyuho mentioned this issue Jul 19, 2019

clientv3: fix secure endpoint failover, refactor with gRPC 1.22 upgrade #10911

Merged

3 tasks

gyuho closed this as completed in #10911 Jul 26, 2019

liuxu623 mentioned this issue Aug 22, 2019

support etcd inseure client kubernetes-sigs/kubespray#5102

Closed

jsok mentioned this issue Sep 2, 2019

Upgrade etcd dependency to v3.3.15 hashicorp/vault#7403

Closed

klizhentas mentioned this issue Sep 3, 2019

Single node failure of etcd brings Teleport 3.1.8 down. gravitational/teleport#2762

Closed

klizhentas mentioned this issue Sep 6, 2019

Updating dependencies for etcd v3.3.15 gravitational/teleport#2962

Closed

echlebek mentioned this issue Sep 13, 2019

Embedded etcd has a critical bug sensu/sensu-go#3278

Closed

cfc4n mentioned this issue Jun 13, 2020

balancer: create many tcp connections use the same endpoint #11371

Closed

justinsb mentioned this issue Feb 4, 2023

MemberList doesn't work after adding a new member to one-node cluster #15243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd go client fails when querying a cluster with a down node #9949

etcd go client fails when querying a cluster with a down node #9949

gnuoy commented Jul 21, 2018

gnuoy commented Sep 4, 2018

gnuoy commented Sep 4, 2018

gnuoy commented Sep 4, 2018

jsok commented Dec 13, 2018 •

edited

Loading

jsok commented Dec 14, 2018 •

edited

Loading

alexbrand commented Feb 4, 2019

xiang90 commented Feb 6, 2019

gyuho commented Feb 19, 2019

dunjut commented Feb 22, 2019

pravinsinghal commented Apr 16, 2019

timothysc commented May 15, 2019

alexbrand commented May 15, 2019

bahar-p commented May 15, 2019 •

edited

Loading

timothysc commented May 15, 2019

gyuho commented Jun 13, 2019

Protopopys commented Jul 2, 2019

ymmt2005 commented Jul 17, 2019 •

edited

Loading

gyuho commented Jul 27, 2019

frittentheke commented Jul 28, 2019

abezard commented Aug 7, 2019 •

edited

Loading

abezard commented Aug 9, 2019

etcd go client fails when querying a cluster with a down node #9949

etcd go client fails when querying a cluster with a down node #9949

Comments

gnuoy commented Jul 21, 2018

gnuoy commented Sep 4, 2018

gnuoy commented Sep 4, 2018

gnuoy commented Sep 4, 2018

jsok commented Dec 13, 2018 • edited Loading

TLS CN and SANs

Full startup logs

jsok commented Dec 14, 2018 • edited Loading

alexbrand commented Feb 4, 2019

xiang90 commented Feb 6, 2019

gyuho commented Feb 19, 2019

dunjut commented Feb 22, 2019

pravinsinghal commented Apr 16, 2019

timothysc commented May 15, 2019

alexbrand commented May 15, 2019

bahar-p commented May 15, 2019 • edited Loading

timothysc commented May 15, 2019

gyuho commented Jun 13, 2019

Protopopys commented Jul 2, 2019

ymmt2005 commented Jul 17, 2019 • edited Loading

gyuho commented Jul 27, 2019

frittentheke commented Jul 28, 2019

abezard commented Aug 7, 2019 • edited Loading

abezard commented Aug 9, 2019

jsok commented Dec 13, 2018 •

edited

Loading

jsok commented Dec 14, 2018 •

edited

Loading

bahar-p commented May 15, 2019 •

edited

Loading

ymmt2005 commented Jul 17, 2019 •

edited

Loading

abezard commented Aug 7, 2019 •

edited

Loading