All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

kayrus · 2015-09-23T09:23:43Z

Hello fellow CoreOS users, we[1] are busy setting up a CoreOS cluster on bare metal (60 nodes or so) for a university in Switzerland.

We ran into a somewhat worrying failure scenario with etcd2 version 2.0.12, and a cluster of 4 nodes. An excerpt of the logs of node #6 (which was configured as a proxy) is at https://gist.github.com/domq/e23e08fab098d915f88f.

From what I can figure out, the following happened:

around 2015/09/11 14:11:09, node Add some defaults. #7 (member of a quorum of four) goes off-line for a planned reinstall. As long as node Add some defaults. #7 stays off-line, things work as expected i.e. the proxy spews more errors to the logs but is otherwise happy.
around 15:32:33 (line 30 in the gist), node Add some defaults. #7 comes back online but its configuration is somehow faulty; the surviving quorum members go on strike (as can be confirmed from their own log stream)
immediately after that, node clean command.go #6 starts to notice that the cluster is down ("zero endpoints currently available"), so far so good...
... but at 15:33:37, the node clean command.go #6 in turn goes on strike (line 233 in the gist, https://gist.github.com/domq/e23e08fab098d915f88f#file-journalctl-log-L233) and decides that it better proxy to itself from now on!

When we restored the quorum today[2], node #6 stayed in that same state. Restarting it after purging /var/lib/etcd2 kicked it back into shape, I think. All other proxy nodes had the same fate, i.e. fleetctl only shows them back into the cluster after we "systemctl stop etcd2; rm -rf /var/lib/etcd2/proxy; systemctl start etcd2" them.

Could someone please explain whether event 4. above is a bug?

@yichengq @xiang90 probably relates to #3215

yichengq · 2015-09-23T15:00:05Z

From the line that you pointed to me Sep 11 15:33:37 c06.ne.cloud.epfl.ch etcd2[3406]: 2015/09/11 15:33:37 proxy: updated peer urls in cluster file from [http://192.168.11.1:2380 http://192.168.11.3:2380 http://192.168.11.7:2380 http://192.1, I cannot see that it tries to proxy to itself.

Moreover, could you check the member list in your etcd cluster?

xiang90 · 2015-10-03T16:57:46Z

@kayrus Kindly ping

domq · 2015-10-04T16:31:45Z

@yichengq , interesting, it looks like gist cut the the "to" part of that line, which read something like [http://127.0.0.1:2380]

domq · 2015-10-04T16:40:10Z

The bug described here happened in CoreOS 723.3.
We recently upgraded to CoreOS 766.3 and haven't witnessed the mass leave event again. (Every now and then a proxy fails to join the cluster upon reinstall, but this is easily cured with systemctl stop etcd2.service; rm -rf /var/lib/etcd2/proxy; systemctl start etcd2.service)

yichengq · 2015-10-05T15:06:51Z

The story here is that proxy always refresh its target endpoints to advertised client urls at /v2/members in one random etcd member. Considering you started some faulty server, and it may have [http://127.0.0.1:2380] as its advertised client urls, it may mislead proxy to maintain wrong endpoint list.

In conclusion, I think this is caused by the misoperations.

kayrus · 2015-10-05T16:06:17Z

@xiang90 asked customer for detailed logs https://groups.google.com/forum/#!topic/coreos-user/OuqvJIRAtho

yichengq · 2015-10-14T07:23:58Z

@domq Any update?

xiang90 · 2015-11-06T19:54:20Z

I am closing this due to low activity. I think @yichengq has an answer.

xiang90 closed this as completed Nov 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

kayrus commented Sep 23, 2015

yichengq commented Sep 23, 2015

xiang90 commented Oct 3, 2015

domq commented Oct 4, 2015

domq commented Oct 4, 2015

yichengq commented Oct 5, 2015

kayrus commented Oct 5, 2015

yichengq commented Oct 14, 2015

xiang90 commented Nov 6, 2015

All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

All proxies leave the etcd2 cluster upon quorum loss ?? (4 nodes, version 2.0.12) #3580

Comments

kayrus commented Sep 23, 2015

yichengq commented Sep 23, 2015

xiang90 commented Oct 3, 2015

domq commented Oct 4, 2015

domq commented Oct 4, 2015

yichengq commented Oct 5, 2015

kayrus commented Oct 5, 2015

yichengq commented Oct 14, 2015

xiang90 commented Nov 6, 2015