-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.7.3->0.8.4 upgrade leads to protocol version (2) is incompatible: [1, 0] #3217
Comments
Hi @kamaradclimber these errors are related to the Serf protocol:
Consul 0.7 dropped support for protocol version 1, it that hasn't been used since Consul 0.3. What version of Consul is running on web-fbx019-ty5.ty5, hostw145-ty5.ty5, and couchs01e23-ty5? |
according to consul members, those nodes are running 0.7.3-criteo1:
here is a summary of versions in that consul cluster:
|
We finally solved the issue by restarting a consul server that looks weird. This server has been detected because many agents in the serf cluster were not seeing it as part of the serf cluster. For instance, some agents were in one case:
while others (including the faulty server, consul03-ty5.central.criteo.prod):
|
Just display information about how min/max protocols are computed Change-Id: I91d264ac90c7f37cbbb006a0efdcf012bdfe8b37
Using code from criteo-forks@4e58f83 during the incident, we could see:
I think the weird part can be seen when minpmax is set to 0. |
Anyway incident is closed on our side, I'd be happy to contribute to a discussion to detect such scenario better. |
Thanks for the follow up note @kamaradclimber - not sure how things got into that state though, and I've never seen something like this before. It's odd that the server was related to the issue but the weird version in the log is associated with |
We are facing a similar issue in our environment which we haven't been able to fix so far. At the moment we have 462 nodes running consul, where from a 462 universe, 375 are Some of the messages we have seen so far are (which look very similar each other and may have the same root cause): Error1
Error2
we have been wondering how gets those values [1, 0], as far as we know the default protocol version spoken within the cluster is version Feel free to reach me if you guys need more information :) |
thanks @kamaradclimber for the patch provided we were able to identify a node that was causing the issue. pretty much the output we got was:
after forcing the removal of the node from the cluster and wiped out the data directory we were able to re-join the failing nodes. |
I work with @kamusin, we saw this again yesterday and used the patch above to debug. There was a node that allowed inbound traffic on 8301, but not outbound, so somehow it became part of the cluster at one point, but subsequently it couldn't initiate connections to anything. Eventually it seemed like the server thought it's maximum supported protocol version was 0, and nothing new could join the cluster. We resolved it by stopping consul on the node and then doing a |
I just experienced this same bug on a Consul cluster with 6 servers and 800+ agents, all running Consul v0.9.3. This was during normal operation of the cluster, not an upgrade as described in the original post. But as in earlier comments, new nodes could not join the cluster, and existing nodes were logging the following error at high frequency:
Oddly, each instance of the above error referenced a random node, seemingly indicating that a large number of nodes were faulty, which was not actually the case. Using @kamaradclimber's patch, I found what turned out to be a single culprit, despite the numerous errors. After forcing the bad node out of the cluster and wiping its data directory, it rejoined fine. Other nodes were then able to join without issue as well, and the protocol version error messages ceased. Another difference from the original post is that in my case the culprit was an agent, not a server. |
Tagging this as a bug (which will likely end up fixed in memberlist or Serf). It looks like there's some case where we can poison the version checking algorithm with some zero-valued node entries in member list. |
This is also happening on version 0.9.3, using same version on server and clients. |
Same error while migrating from Consul 1.0.6 with patches to 1.1.0 with patches (but nothing related to serf protocol) |
Would someone from hashicorp have any idea regarding the casue of this issue? |
@kamaradclimber I just started to dive into the memberlist code yesterday afternoon and am going to continue some more today. Unfortunately I don't have much to report yet. Am I correct in thinking this only happens during upgrades? If so how do you go about performing them? (just kill Consul and restart with a newer version?) |
@mkeeler Yes, it seems it happens only when upgrading clients. It probably means that auto-negotiation does not work well. Basically we:
At this point all those already migrated servers had been upgraded Yesterday, we upgraded some of the remaining clients still in 1.0.6, but during the upgrade, some of the agents could not join the cluster with this message. Our Cluster contains ONLY 1.0.6 and 1.1.0 clients (and only 1.1.0 servers) After several hours of investigation, it seems the only way is to restart sequentially all Consul Servers, then the new agents can join properly the cluster. |
@pierresouchay Are you doing Gossip Encryption? It shouldn't matter, just trying to rule things out. |
Yes. Note that in that case, some of the nodes in error see each other's (but no server in the list) |
Related issue: #4342 |
Update: I experienced this issue again recently on the same Consul cluster as previously reported, which has been running Consul v1.0.2 for a while now. As with the last time this occurred for us, it was not during a Consul upgrade. However, we've noticed that the problem correlates to operations wherein a large number of agents are restarted, meaning those agents have to leave the cluster and re-join it. I gather the protocol negotiation is a normal part of restarting a cluster, and that if a corrupt node somewhere in the cluster is reporting a bad protocol version, that negotiation cannot succeed. The result in our case is that agents cannot rejoin the cluster after being restarted. Again, the only method I'm aware of for finding and removing the corrupt node from the cluster is to run a patched version of Consul with criteo-forks@4e58f83 included, and use the output from its own failed negotiation attempt to determine the bad node. Without that patch, the logs are not clear on which node is the corrupt one. Perhaps a good place to start in determining the root cause of this bug may be to add more precise logging to the memberlist and/or Serf code that can more clearly identify the culprit? One more note about this most recent occurrence: Judging by the cluster logs, the corrupt node went bad on a Friday afternoon (coincident to an operation that restarted a large number of agents), but the problem wasn't discovered until a subsequent restart-triggering operation was performed the following Monday. I noticed that it took a long time for the cluster to recover after the corrupt node was forcibly removed – around 15 minutes, which is much longer than it took in previous occurrences. I don't know enough about Consul's internals to speculate as to what that could mean, but it seemed worth mentioning. |
@ianwestcott Yes, we confirm this exact behavior We often have this when restarting massively agents. Sometimes, a simple restart is enough, but sometimes, the only reliable way to fix it is to restart sequentially all Consul servers. |
Here is more context in a new incident we had:
We can see clearly that before
No problem was seen, but the line after do a:
and this message will keep repeating... In order to fix, I had to restart all the servers sequentially. @mkeeler What it means is that a single agent DID change the range of acceptable protocols, then it change the acceptable range of protocols on servers side. Quite a good start to find the root cause, what do you think? We are gonna try to provide a PR if we find the error that causes this corruption. |
On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn This causes a corruption of the acceptable versions of protocol and thus requiring version = [0, 1]. After this corruption occurs, all new nodes cannot join anymore, it then force the restart of all Consul servers to resume normal operations. While not fixing the root cause, this patch discards alive nodes claiming version 0,0,0 and will avoid this breakage. See hashicorp/consul#3217
* Avoid to take into account wrong versions of protocols in Vsn. On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn This causes a corruption of the acceptable versions of protocol and thus requiring version = [0, 1]. After this corruption occurs, all new nodes cannot join anymore, it then force the restart of all Consul servers to resume normal operations. While not fixing the root cause, this patch discards alive nodes claiming version 0,0,0 and will avoid this breakage. See hashicorp/consul#3217 * Always set the Vsn when creating state, so race condition cannot happen * Do not move m.encodeBroadcastNotify(a.Node, aliveMsg, a, notify) since not needed * Test the bare minimum for size of Vsn Co-Authored-By: pierresouchay <[email protected]> * Fixed test TestMemberList_ProbeNode_Awareness_OldProtocol * Avoid to crash when len(Vsn) is incorrect and ignore the message when there is an Alive delegate
Now that the PR got merged into memberlist we will now need to revendor to pull in the changes. |
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
consul version
for both Client and ServerClient:
Consul v0.7.3-criteo1-criteo1 (f3d518bc+CHANGES) Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Server:
Consul v0.8.4 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
The custom criteo version is based f3d518b and only contains patches from #2474 and #2657.
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
50% servers are linux (centos7 mostly), 50% are windows server 2012r2.
Description of the Issue (and unexpected/desired result)
consul agent already present in the serf cluster had upgraded without issues.
new consul agent added during the upgrade cannot join the cluster.
Here is the logs from an agent (on windows):
Seems weird for many reasons:
The text was updated successfully, but these errors were encountered: