Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.7.3->0.8.4 upgrade leads to protocol version (2) is incompatible: [1, 0] #3217

Closed
kamaradclimber opened this issue Jul 3, 2017 · 24 comments
Labels
needs-investigation The issue described is detailed and complex. theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Milestone

Comments

@kamaradclimber
Copy link
Contributor

kamaradclimber commented Jul 3, 2017

consul version for both Client and Server

Client:

  • most use Consul v0.7.3-criteo1-criteo1 (f3d518bc+CHANGES) Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
  • some already updated version 0.8.4 (same as servers)
  • very few has 0.8.5 (experiment done by one team)

Server: Consul v0.8.4 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

The custom criteo version is based f3d518b and only contains patches from #2474 and #2657.

consul info for both Client and Server

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 2
	services = 3
build:
	prerelease = criteo1
	revision = 'f3d518b
	version = 0.7.3
consul:
	known_servers = 0
	server = false
runtime:
	arch = amd64
	cpu_count = 24
	goroutines = 30
	max_procs = 2
	os = windows
	version = go1.7.4
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 3
build:
	prerelease = 
	revision = f436077
	version = 0.8.4
consul:
	bootstrap = false
	known_datacenters = 9
	leader = false
	leader_addr = 10.71.4.141:8300
	server = true
raft:
	applied_index = 213985112
	commit_index = 213985112
	fsm_pending = 0
	last_contact = 12.873464ms
	last_log_index = 213985113
	last_log_term = 5511
	last_snapshot_index = 213979994
	last_snapshot_term = 5511
	latest_configuration = [{Suffrage:Voter ID:10.71.4.143:8300 Address:10.71.4.143:8300} {Suffrage:Voter ID:10.71.4.142:8300 Address:10.71.4.142:8300} {Suffrage:Voter ID:10.71.4.141:8300 Address:10.71.4.141:8300}]
	latest_configuration_index = 129059017
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 5511
runtime:
	arch = amd64
	cpu_count = 24
	goroutines = 8348
	max_procs = 23
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1872
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 10
	member_time = 1095554
	members = 1872
	query_queue = 0
	query_time = 370
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 66250
	members = 27
	query_queue = 0
	query_time = 1

Operating system and Environment details

50% servers are linux (centos7 mostly), 50% are windows server 2012r2.

Description of the Issue (and unexpected/desired result)

consul agent already present in the serf cluster had upgraded without issues.
new consul agent added during the upgrade cannot join the cluster.
Here is the logs from an agent (on windows):

2017/07/03 14:35:46 [WARN] manager: No servers available
2017/07/03 14:35:46 [ERR] agent: failed to sync remote state: No known Consul servers
2017/07/03 14:35:50 [INFO] agent: (LAN) joining: [consul02-ty5.central.criteo.prod consul01-ty5.central.criteo.prod consul03-ty5.central.criteo.prod]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul02-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.141:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul01-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.142:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [DEBUG] memberlist: TCP-first lookup failed for 'consul03-ty5.central.criteo.prod:8301', falling back to UDP: open /etc/resolv.conf: The system cannot find the path specified.
2017/07/03 14:35:50 [DEBUG] memberlist: Initiating push/pull sync with: 10.71.4.143:8301
2017/07/03 14:35:50 [DEBUG] memberlist: Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:

* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
2017/07/03 14:35:50 [WARN] agent: Join failed: <nil>, retrying in 30s
2017/07/03 14:35:50 [WARN] manager: No servers available

Seems weird for many reasons:

  • "Failed to join xxx" messages mention consul servers ip address but unrelated nodes fqdn
  • all agents I've tried claim to support protocol 2-3 (I'm not absolutely sure it serf protocol version or raft since it's not mentioned anywhere and both protocols have similar version at the moment).
  • node should be able to join during an upgrade (that will last ~1week)
@slackpad
Copy link
Contributor

slackpad commented Jul 3, 2017

Hi @kamaradclimber these errors are related to the Serf protocol:

* Failed to join 10.71.4.141: Node 'web-fbx019-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'hostw145-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'couchs01e23-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]

Consul 0.7 dropped support for protocol version 1, it that hasn't been used since Consul 0.3. What version of Consul is running on web-fbx019-ty5.ty5, hostw145-ty5.ty5, and couchs01e23-ty5?

@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jul 3, 2017

according to consul members, those nodes are running 0.7.3-criteo1:

couchs01e23-ty5.storage.criteo.prod             10.211.136.240:8301  alive   client  0.7.3criteo1  2         ty5
hostw145-ty5.ty5.ad.criteo.prod                 10.211.161.238:8301  alive   client  0.7.3criteo1  2         ty5
web-fbx019-ty5.ty5.ad.criteo.prod               10.211.165.91:8301   alive   client  0.7.3criteo1  2         ty5

here is a summary of versions in that consul cluster:

   1692 0.7.3criteo1
     70 0.8.4
    110 0.8.5

@kamaradclimber
Copy link
Contributor Author

We finally solved the issue by restarting a consul server that looks weird.

This server has been detected because many agents in the serf cluster were not seeing it as part of the serf cluster.

For instance, some agents were in one case:

consul members|grep server
consul01-ty5.central.criteo.prod                10.71.4.142:8301     alive    server  0.8.4         2         ty5
consul02-ty5.central.criteo.prod                10.71.4.141:8301     alive    server  0.8.4         2         ty5

while others (including the faulty server, consul03-ty5.central.criteo.prod):

consul01-ty5.central.criteo.prod                10.71.4.142:8301     alive    server  0.8.4         2         ty5
consul02-ty5.central.criteo.prod                10.71.4.141:8301     alive    server  0.8.4         2         ty5
consul03-ty5.central.criteo.prod                10.71.4.143:8301     alive    server  0.8.4         2         ty5

kamaradclimber added a commit to criteo-forks/consul that referenced this issue Jul 5, 2017
Just display information about how min/max protocols are computed

Change-Id: I91d264ac90c7f37cbbb006a0efdcf012bdfe8b37
@kamaradclimber
Copy link
Contributor Author

Using code from criteo-forks@4e58f83 during the incident, we could see:

Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxpmin to 1
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on minpmax to 5
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on maxdmin to 2
Node 'mems12e06-ty5.storage.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-rtb032-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxpmin to 1
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on minpmax to 5
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on maxdmin to 2
Node 'web-fbx024-ty5.ty5.ad.criteo.prod' put constraint on mindmax to 4
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on minpmax to 0
Node 'mesos-slave035-ty5.central.criteo.prod' put constraint on mindmax to 0
End of iteration on remote nodes
    2017/07/04 18:59:22 [INFO] agent: (LAN) joined: 0 Err: 3 error(s) occurred:

* Failed to join 10.71.4.141: Node 'mems12e06-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.142: Node 'web-rtb032-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
* Failed to join 10.71.4.143: Node 'web-fbx024-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0]
    2017/07/04 18:59:22 [WARN] agent: Join failed: <nil>, retrying in 30s

I think the weird part can be seen when minpmax is set to 0.

@kamaradclimber
Copy link
Contributor Author

Anyway incident is closed on our side, I'd be happy to contribute to a discussion to detect such scenario better.

@slackpad
Copy link
Contributor

Thanks for the follow up note @kamaradclimber - not sure how things got into that state though, and I've never seen something like this before. It's odd that the server was related to the issue but the weird version in the log is associated with mesos-slave035-ty5.central.criteo.prod. Am I missing something about mesos-slave035-ty5.central.criteo.prod?

@kamusin
Copy link

kamusin commented Nov 27, 2017

We are facing a similar issue in our environment which we haven't been able to fix so far. At the moment we have 462 nodes running consul, where from a 462 universe, 375 are 0.9.x - 61 are 0.8.x and finally 10 0.7.x agents. Consul servers are running 0.9.x, respectively.

Some of the messages we have seen so far are (which look very similar each other and may have the same root cause):

Error1

2017/11/27 17:33:54 [ERR] memberlist: Push/Pull with i-asdasdasd failed: Node 'i-asdasdsda' protocol version (2) is incompatible: [1, 0]

Error2

 2017/11/27 17:33:42 [ERR] memberlist: Failed push/pull merge: Node 'i-asdasdsadd' protocol version (2) is incompatible: [1, 0] from=10.x.x.x:60109

we have been wondering how gets those values [1, 0], as far as we know the default protocol version spoken within the cluster is version 2, which is set in the source code as the minimum version as well.

Feel free to reach me if you guys need more information :)

@slackpad slackpad added this to the 1.0.2 milestone Nov 27, 2017
@slackpad slackpad modified the milestones: 1.0.2, 1.0.3 Dec 13, 2017
@kamusin
Copy link

kamusin commented Jan 3, 2018

thanks @kamaradclimber for the patch provided we were able to identify a node that was causing the issue. pretty much the output we got was:

./consul agent -config-dir /etc/consul
==> Starting Consul agent...
==> Joining cluster...
Node 'i-000000000' put constraint on maxpmin to 1
Node 'i-000000000' put constraint on minpmax to 5
Node 'i-000000000' put constraint on maxdmin to 2
Node 'i-000000000' put constraint on mindmax to 5
Node 'i-555555555' put constraint on minpmax to 4
Node 'i-555555555' put constraint on mindmax to 4
End of iteration on remote nodes
==> 1 error(s) occurred:

* Failed to join 10.51.229.135: Node 'i-abcdfe' protocol version (0) is incompatible: [1, 4]

after forcing the removal of the node from the cluster and wiped out the data directory we were able to re-join the failing nodes.

@jarro2783
Copy link

I work with @kamusin, we saw this again yesterday and used the patch above to debug. There was a node that allowed inbound traffic on 8301, but not outbound, so somehow it became part of the cluster at one point, but subsequently it couldn't initiate connections to anything. Eventually it seemed like the server thought it's maximum supported protocol version was 0, and nothing new could join the cluster.

We resolved it by stopping consul on the node and then doing a force-leave.

@ianwestcott
Copy link

I just experienced this same bug on a Consul cluster with 6 servers and 800+ agents, all running Consul v0.9.3. This was during normal operation of the cluster, not an upgrade as described in the original post. But as in earlier comments, new nodes could not join the cluster, and existing nodes were logging the following error at high frequency:

[ERR] memberlist: Failed push/pull merge: Node 'i-xxxxxxxxxxxxxxx' protocol version (2) is incompatible: [1, 0]

Oddly, each instance of the above error referenced a random node, seemingly indicating that a large number of nodes were faulty, which was not actually the case.

Using @kamaradclimber's patch, I found what turned out to be a single culprit, despite the numerous errors. After forcing the bad node out of the cluster and wiping its data directory, it rejoined fine. Other nodes were then able to join without issue as well, and the protocol version error messages ceased.

Another difference from the original post is that in my case the culprit was an agent, not a server.

@slackpad slackpad added type/bug Feature does not function as expected theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization labels Jan 4, 2018
@slackpad
Copy link
Contributor

slackpad commented Jan 4, 2018

Tagging this as a bug (which will likely end up fixed in memberlist or Serf). It looks like there's some case where we can poison the version checking algorithm with some zero-valued node entries in member list.

@slackpad slackpad modified the milestones: 1.0.3, Next Jan 13, 2018
@mauriziod
Copy link

This is also happening on version 0.9.3, using same version on server and clients.

@pierresouchay
Copy link
Contributor

Same error while migrating from Consul 1.0.6 with patches to 1.1.0 with patches (but nothing related to serf protocol)

@kamaradclimber
Copy link
Contributor Author

Would someone from hashicorp have any idea regarding the casue of this issue?

@mkeeler
Copy link
Member

mkeeler commented Jun 29, 2018

@kamaradclimber I just started to dive into the memberlist code yesterday afternoon and am going to continue some more today. Unfortunately I don't have much to report yet.

Am I correct in thinking this only happens during upgrades? If so how do you go about performing them? (just kill Consul and restart with a newer version?)

@pierresouchay
Copy link
Contributor

@mkeeler Yes, it seems it happens only when upgrading clients.

It probably means that auto-negotiation does not work well.

Basically we:

  • upgraded Consul Servers to 1.1.0 a few weeks ago
  • upgraded some of the Consul Agents to 1.10 a few weeks ago (few hours after servers)

At this point all those already migrated servers had been upgraded

Yesterday, we upgraded some of the remaining clients still in 1.0.6, but during the upgrade, some of the agents could not join the cluster with this message.

Our Cluster contains ONLY 1.0.6 and 1.1.0 clients (and only 1.1.0 servers)

After several hours of investigation, it seems the only way is to restart sequentially all Consul Servers, then the new agents can join properly the cluster.

@mkeeler
Copy link
Member

mkeeler commented Jun 29, 2018

@pierresouchay Are you doing Gossip Encryption? It shouldn't matter, just trying to rule things out.

@pierresouchay
Copy link
Contributor

Yes.

Note that in that case, some of the nodes in error see each other's (but no server in the list)

@pearkes
Copy link
Contributor

pearkes commented Oct 26, 2018

Related issue: #4342

@banks banks added the needs-investigation The issue described is detailed and complex. label Nov 16, 2018
@ianwestcott
Copy link

Update: I experienced this issue again recently on the same Consul cluster as previously reported, which has been running Consul v1.0.2 for a while now. As with the last time this occurred for us, it was not during a Consul upgrade. However, we've noticed that the problem correlates to operations wherein a large number of agents are restarted, meaning those agents have to leave the cluster and re-join it. I gather the protocol negotiation is a normal part of restarting a cluster, and that if a corrupt node somewhere in the cluster is reporting a bad protocol version, that negotiation cannot succeed. The result in our case is that agents cannot rejoin the cluster after being restarted.

Again, the only method I'm aware of for finding and removing the corrupt node from the cluster is to run a patched version of Consul with criteo-forks@4e58f83 included, and use the output from its own failed negotiation attempt to determine the bad node. Without that patch, the logs are not clear on which node is the corrupt one. Perhaps a good place to start in determining the root cause of this bug may be to add more precise logging to the memberlist and/or Serf code that can more clearly identify the culprit?

One more note about this most recent occurrence: Judging by the cluster logs, the corrupt node went bad on a Friday afternoon (coincident to an operation that restarted a large number of agents), but the problem wasn't discovered until a subsequent restart-triggering operation was performed the following Monday. I noticed that it took a long time for the cluster to recover after the corrupt node was forcibly removed – around 15 minutes, which is much longer than it took in previous occurrences. I don't know enough about Consul's internals to speculate as to what that could mean, but it seemed worth mentioning.

@pierresouchay
Copy link
Contributor

@ianwestcott Yes, we confirm this exact behavior

We often have this when restarting massively agents. Sometimes, a simple restart is enough, but sometimes, the only reliable way to fix it is to restart sequentially all Consul servers.

@banks banks added needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release and removed needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release labels Nov 26, 2018
@pierresouchay
Copy link
Contributor

Here is more context in a new incident we had:

Jan 21 08:36:44 consul03-ty5 consul[444]: 2019/01/21 08:36:44 [ERR] consul.rpc: multiplex conn accept failed: keepalive timeout from=10.232.22.21:38828
Jan 21 08:36:51 consul03-ty5 consul[444]: 2019/01/21 08:36:51 [INFO] serf: EventMemberUpdate: mesos-slave068-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave048-ty5.central.criteo.prod
Jan 21 08:36:57 consul03-ty5 consul[444]: 2019/01/21 08:36:57 [INFO] serf: EventMemberUpdate: mesos-slave026-ty5.central.criteo.prod
Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13
Jan 21 08:37:15 consul03-ty5 consul[444]: 2019/01/21 08:37:15 [ERR] memberlist: Push/Pull with d8-c4-97-b5-1d-e8.inventory.criteo.prod failed: Node 'web-cat039-ty5.ty5.ad.criteo.prod' protocol version (2) is i
Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:37:24 consul03-ty5 consul[444]: 2019/01/21 08:37:24 [INFO] serf: EventMemberUpdate: mesos-slave110-ty5.central.criteo.prod
Jan 21 08:37:30 consul03-ty5 consul[444]: 2019/01/21 08:37:30 [INFO] serf: EventMemberUpdate: mesos-slave037-ty5.central.criteo.prod
Jan 21 08:37:58 consul03-ty5 consul[444]: 2019/01/21 08:37:58 [ERR] memberlist: Failed push/pull merge: Node 'web-rtb336-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3
Jan 21 08:38:48 consul03-ty5 consul[444]: 2019/01/21 08:38:48 [ERR] memberlist: Failed push/pull merge: Node 'hostw842-ty5.ty5.ad.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.232.14.26:421
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberFailed: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [INFO] serf: EventMemberJoin: d8-c4-97-71-a3-99.inventory.criteo.prod 10.232.64.28
Jan 21 08:38:51 consul03-ty5 consul[444]: 2019/01/21 08:38:51 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave049-ty5.central.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.13

We can see clearly that before

Jan 21 08:36:58 consul03-ty5 consul[444]: 2019/01/21 08:36:58 [ERR] memberlist: Failed push/pull merge: Node 'mesos-slave026-ty5.central.criteo.prod' protocol version (0) is incompatible: [1, 5] from=10.211.13

No problem was seen, but the line after do a:

Jan 21 08:37:20 consul03-ty5 consul[444]: 2019/01/21 08:37:20 [ERR] memberlist: Failed push/pull merge: Node 'mems19e08-ty5.storage.criteo.prod' protocol version (2) is incompatible: [1, 0] from=10.211.130.2:3

and this message will keep repeating...

In order to fix, I had to restart all the servers sequentially.

@mkeeler What it means is that a single agent DID change the range of acceptable protocols, then it change the acceptable range of protocols on servers side. Quite a good start to find the root cause, what do you think? We are gonna try to provide a PR if we find the error that causes this corruption.

pierresouchay added a commit to pierresouchay/memberlist that referenced this issue Jan 24, 2019
On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn
This causes a corruption of the acceptable versions of protocol
and thus requiring version = [0, 1].

After this corruption occurs, all new nodes cannot join anymore, it
then force the restart of all Consul servers to resume normal
operations.

While not fixing the root cause, this patch discards alive nodes
claiming version 0,0,0 and will avoid this breakage.

See hashicorp/consul#3217
@pearkes pearkes modified the milestones: Upcoming, 1.4.3 Feb 4, 2019
mkeeler pushed a commit to hashicorp/memberlist that referenced this issue Feb 4, 2019
* Avoid to take into account wrong versions of protocols in Vsn.

On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn
This causes a corruption of the acceptable versions of protocol
and thus requiring version = [0, 1].

After this corruption occurs, all new nodes cannot join anymore, it
then force the restart of all Consul servers to resume normal
operations.

While not fixing the root cause, this patch discards alive nodes
claiming version 0,0,0 and will avoid this breakage.

See hashicorp/consul#3217

* Always set the Vsn when creating state, so race condition cannot happen

* Do not move m.encodeBroadcastNotify(a.Node, aliveMsg, a, notify) since not needed

* Test the bare minimum for size of Vsn

Co-Authored-By: pierresouchay <[email protected]>

* Fixed test TestMemberList_ProbeNode_Awareness_OldProtocol

* Avoid to crash when len(Vsn) is incorrect and ignore the message when there is an Alive delegate
@mkeeler
Copy link
Member

mkeeler commented Feb 4, 2019

Now that the PR got merged into memberlist we will now need to revendor to pull in the changes.

pierresouchay added a commit to pierresouchay/consul that referenced this issue Feb 4, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
@pierresouchay
Copy link
Contributor

@mkeeler PR in Consul: #5313

pierresouchay added a commit to criteo-forks/consul that referenced this issue Feb 5, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
@mkeeler mkeeler closed this as completed in bfcfcc0 Feb 5, 2019
ShimmerGlass pushed a commit to criteo-forks/consul that referenced this issue Feb 8, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
ShimmerGlass pushed a commit to criteo-forks/consul that referenced this issue Feb 8, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
LeoCavaille pushed a commit to DataDog/consul that referenced this issue Mar 30, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. theme/internal-cleanup Used to identify tech debt, testing improvements, code refactoring, and non-impactful optimization type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

10 participants