-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd go client fails when querying a cluster with a down node #9949
Comments
The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. etcd-io/etcd#9949
The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable. etcd-io/etcd#9949
In my setup each etcd unit has an individual server certificate that includes an IP SAN with the ip address of the unit. If I generate a new certificate that has IP SAN entries for all units in the cluster and then use then use that certificate on all units the problem goes away. I don't like this workaround though as it means I need to cert every time I scale out. |
In my current setup I have three etcd nodes: 10.53.82.141, 10.53.82.217 and 10.53.82.233
|
It looks like the chain of events (when servername is not used):
|
Experiencing the same issue - however with SRV discovery which will use FQDNs for the endpoint list not IP addresses. Similarly only the first endpoint will ever successfully be balanced. My client is Hashicorp vault also, so related to hashicorp/vault#4961. Highlights from the log lines follow... All endpoints are successfully enumerated:
However only a single endpoint will result in a
It will generally pick a different TLS CN and SANsAs per the TLS advice for SRV discovery - the SRV domain, fqdn and IP address. Ditto for
Full startup logs
|
The code tracing done by @gnuoy is illuminating in that the actual sequence of events is:
So I can only conclude that using the same authority for all subconnections in the balancer is fundamentally flawed when used with TLS. Under the current implementation there are 3 possible solutions:
|
/subscribe |
We're experiencing exactly the same issue as @gnuoy : One certificate per server IP address and etcdv3 client fails if first endpoint unavailable. |
@gyuho I am also running into the same issue - From the comments above it looks you have been working on a patch. Any update on it. |
@gyuho - It's difficult to decipher if this has been fixed or if it's still pending now as there are about a dozen cross-referenced issues and PRs. |
I looked at the status of this last week. I believe this is still pending on #10489. I could be wrong though. |
Any updates on this? We're still running into this with Kubernetes v1.14.1 and etcd 3.3.12 |
We should really get an update on this for k8s - v1.15 |
Just discussed with gRPC team, and got some good feedback. |
Any updates on this? We're still running into this with Kubernetes v1.15.0 and etcd 3.3.13 |
FWIW, we work around this problem by placing TCP reverse proxies A possible better workaround would be to place TLS-terminating |
@bahar-p @Protopopys ... I was just wondering the same thing. See my question at #10911 (comment) |
@gyuho I tried this code on my staging environment and I still face the same issue:
And from the same machine I can check the health of those etcd nodes just fine:
So I'm pretty confused here. I double checked to make sure I was working off a branch with your changes and I'm...
(sorry for commenting a closed ticket) |
fixed it with that: moondev/kubernetes@45f6cb2#diff-c9fae1df26aedd520ef93a008d255581R136-R148 |
Contains an important fix in clientv3 that allows vault to successfully failover to another etcdv3 endpoint in the event that the current active connection becomes unavailable. See also: * etcd-io/etcd#9949 * etcd-io/etcd#10911 * https://github.com/etcd-io/etcd/blob/master/CHANGELOG-3.3.md#v3314-2019-08-16 Fixes hashicorp#4961
Describe the bug
The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.
To Reproduce
I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.
https client:
Fails with:
2018-07-21 11:34:05.728613 I | context deadline exceeded
http client:
The text was updated successfully, but these errors were encountered: