Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

Closed
kayrus opened this issue Sep 23, 2015 · 8 comments

Comments

@kayrus
Copy link
Contributor

kayrus commented Sep 23, 2015

@domq
I've copied from google groups

Hello fellow CoreOS users, we[1] are busy setting up a CoreOS cluster on bare metal (60 nodes or so) for a university in Switzerland.

We ran into a somewhat worrying failure scenario with etcd2 version 2.0.12, and a cluster of 4 nodes. An excerpt of the logs of node #6 (which was configured as a proxy) is at https://gist.github.com/domq/e23e08fab098d915f88f.

From what I can figure out, the following happened:

  1. around 2015/09/11 14:11:09, node Add some defaults.  #7 (member of a quorum of four) goes off-line for a planned reinstall. As long as node Add some defaults.  #7 stays off-line, things work as expected i.e. the proxy spews more errors to the logs but is otherwise happy.
  2. around 15:32:33 (line 30 in the gist), node Add some defaults.  #7 comes back online but its configuration is somehow faulty; the surviving quorum members go on strike (as can be confirmed from their own log stream)
  3. immediately after that, node clean command.go #6 starts to notice that the cluster is down ("zero endpoints currently available"), so far so good...
  4. ... but at 15:33:37, the node clean command.go #6 in turn goes on strike (line 233 in the gist, https://gist.github.com/domq/e23e08fab098d915f88f#file-journalctl-log-L233) and decides that it better proxy to itself from now on!

When we restored the quorum today[2], node #6 stayed in that same state. Restarting it after purging /var/lib/etcd2 kicked it back into shape, I think. All other proxy nodes had the same fate, i.e. fleetctl only shows them back into the cluster after we "systemctl stop etcd2; rm -rf /var/lib/etcd2/proxy; systemctl start etcd2" them.

Could someone please explain whether event 4. above is a bug?

@yichengq @xiang90 probably relates to #3215

@yichengq
Copy link
Contributor

From the line that you pointed to me Sep 11 15:33:37 c06.ne.cloud.epfl.ch etcd2[3406]: 2015/09/11 15:33:37 proxy: updated peer urls in cluster file from [http://192.168.11.1:2380 http://192.168.11.3:2380 http://192.168.11.7:2380 http://192.1, I cannot see that it tries to proxy to itself.

Moreover, could you check the member list in your etcd cluster?

@xiang90
Copy link
Contributor

xiang90 commented Oct 3, 2015

@kayrus Kindly ping

@domq
Copy link

domq commented Oct 4, 2015

@yichengq , interesting, it looks like gist cut the the "to" part of that line, which read something like [http://127.0.0.1:2380]

@domq
Copy link

domq commented Oct 4, 2015

The bug described here happened in CoreOS 723.3.
We recently upgraded to CoreOS 766.3 and haven't witnessed the mass leave event again. (Every now and then a proxy fails to join the cluster upon reinstall, but this is easily cured with systemctl stop etcd2.service; rm -rf /var/lib/etcd2/proxy; systemctl start etcd2.service)

@yichengq
Copy link
Contributor

yichengq commented Oct 5, 2015

The story here is that proxy always refresh its target endpoints to advertised client urls at /v2/members in one random etcd member. Considering you started some faulty server, and it may have [http://127.0.0.1:2380] as its advertised client urls, it may mislead proxy to maintain wrong endpoint list.

In conclusion, I think this is caused by the misoperations.

@kayrus
Copy link
Contributor Author

kayrus commented Oct 5, 2015

@xiang90 asked customer for detailed logs https://groups.google.com/forum/#!topic/coreos-user/OuqvJIRAtho

@yichengq
Copy link
Contributor

@domq Any update?

@xiang90
Copy link
Contributor

xiang90 commented Nov 6, 2015

I am closing this due to low activity. I think @yichengq has an answer.

@xiang90 xiang90 closed this as completed Nov 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants