Node health flapping - EC2 #1212

djenriquez · 2015-09-01T17:55:24Z

We have a five node Consul cluster handling roughly 30 nodes across 4 different AWS accounts in a shared VPC across different availability zones. For the most part, everything works great. However, quite frequently, a random node will flap from healthy to critical. The flapping happens on completely random nodes and no consistency whatsoever.

Every time a node "flaps" it causes our consul-template, which populates our NGINX reverse-proxy config to reload. This is causes things like our Apache benchmark tests to fail.

We are looking to use Consul for production, but this issue has caused a lot of people to worry about consistency.

We also have all required TCP/UDP ports open through all the nodes, as well.

We believe the issue is just a latency problem with the polling of serf. Is there a way to modify the serf health-check interval to adjust to geographical latency?

Heres the log from one of the Consul servers:

    2015/09/01 17:46:13 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:13 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:46:13 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:46:15 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:46:15 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:16 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:46:26 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:46:32 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:05 [INFO] serf: EventMemberFailed: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:47:19 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:47:19 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:23 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:47:44 [INFO] memberlist: Marking ip-10-190-71-44 as failed, suspect timeout reached
    2015/09/01 17:47:44 [INFO] serf: EventMemberFailed: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:47:45 [INFO] serf: EventMemberJoin: ip-10-190-71-44 10.190.71.44
    2015/09/01 17:47:49 [INFO] memberlist: Marking ip-10-170-76-170 as failed, suspect timeout reached
    2015/09/01 17:47:49 [INFO] serf: EventMemberFailed: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-76-170 10.170.76.170
    2015/09/01 17:47:50 [INFO] serf: EventMemberJoin: ip-10-170-138-228 10.170.138.228
    2015/09/01 17:48:00 [INFO] memberlist: Marking ip-10-170-155-168 as failed, suspect timeout reached
    2015/09/01 17:48:00 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:48:02 [INFO] serf: EventMemberFailed: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:16 [INFO] serf: EventMemberJoin: ip-10-185-23-211 10.185.23.211
    2015/09/01 17:48:32 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:48:32 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:33 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:48:45 [INFO] serf: EventMemberFailed: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:46 [INFO] serf: EventMemberJoin: ip-10-185-23-210 10.185.23.210
    2015/09/01 17:48:55 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:00 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:00 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:20 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:32 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:32 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:38 [INFO] serf: EventMemberFailed: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:38 [INFO] serf: EventMemberJoin: ip-10-170-155-168 10.170.155.168
    2015/09/01 17:49:40 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:51 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:49:51 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:52 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:49:56 [INFO] serf: EventMemberFailed: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:49:56 [INFO] serf: EventMemberJoin: ip-10-185-15-217 10.185.15.217
    2015/09/01 17:50:04 [INFO] memberlist: Marking ip-10-190-13-188 as failed, suspect timeout reached
    2015/09/01 17:50:04 [INFO] serf: EventMemberFailed: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:05 [INFO] serf: EventMemberJoin: ip-10-190-13-188 10.190.13.188
    2015/09/01 17:50:20 [INFO] serf: EventMemberFailed: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:24 [INFO] memberlist: Marking ip-10-185-65-7 as failed, suspect timeout reached
    2015/09/01 17:50:24 [INFO] serf: EventMemberFailed: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:50:31 [INFO] serf: EventMemberJoin: ip-10-185-77-94 10.185.77.94
    2015/09/01 17:50:47 [INFO] serf: EventMemberJoin: ip-10-185-65-7 10.185.65.7
    2015/09/01 17:51:01 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:02 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:09 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:51:09 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:43 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:51:45 [INFO] memberlist: Marking ip-10-170-15-71 as failed, suspect timeout reached
    2015/09/01 17:51:45 [INFO] serf: EventMemberFailed: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:51:45 [INFO] serf: EventMemberJoin: ip-10-170-15-71 10.170.15.71
    2015/09/01 17:52:22 [INFO] memberlist: Marking ip-10-190-82-4 as failed, suspect timeout reached
    2015/09/01 17:52:22 [INFO] serf: EventMemberFailed: ip-10-190-82-4 10.190.82.4
    2015/09/01 17:52:30 [INFO] serf: EventMemberJoin: ip-10-190-82-4 10.190.82.4

The text was updated successfully, but these errors were encountered:

slackpad · 2015-09-04T23:49:24Z

Those parameters are not configuration-tunable but it would be simple to patch and build Consul with a different set of values. Do you know the approximate round-trip times between your different AZs? Also, is there anything interesting in the logs on the node side during one of these flapping events?

djenriquez · 2015-09-09T06:48:35Z

@slackpad Sorry for the late response! Pinging between the three AZs in us-west-2 is roughly 1.3ms.

Nothing interesting is happening in the logs or being reported in Sysdig. Our nodes are pretty under-utilized right now, but the highest traffic belongs to Consul and Consul-template with an average rate of about 7 KiB/s, spiking to 20KiB/s about every 30 seconds. Our nodes are reporting an average rate of 7KiB/s, so yea pretty much all Consul traffic.

winggundamth · 2015-09-11T13:35:29Z

We got this problem too. At first it works fine with 8 nodes. But now we have 12 nodes and will increase more and we face with node health flapping all the times. This is happen randomly with random node and it happens every few seconds. The error would be below repeat with random node

2015/09/11 13:31:59 [INFO] memberlist: Suspect pre-service-03 has failed, no acks received
2015/09/11 13:31:57 [WARN] memberlist: Refuting a suspect message (from: prod-service-07)

FYI We run Consul in Docker container. Host run on EC2 with VPC on the same subnet and ping really working fine with below 2ms without any loss.

jsternberg · 2015-09-16T17:53:31Z

We have this problem too, but we're running with a slightly larger cluster than those in this thread. We're running with approximately 160 nodes in our production cluster and I'm testing this problem in a cluster with a little over 60 healthy nodes. Both of them have this flapping issue.

I enabled metrics in our testing environment and have been analyzing them to try and find some kind of pattern. I first thought maybe the probes were regularly taking 400ms and that some just happened to take slightly longer at 500ms and was failing. The mean time for probing nodes is 5-25ms with just an occasional outlier. The standard deviation has a max of 100ms from the metrics I've taken the last day.

The curious part is that while the times are very low for probing a node, the sum indicates that at 1-2 probes fail a minute. The sum hops between a little above 500ms and 1000ms which seems to indicate 1-2 probes failed.

I tried checking the serf queues to see if they were backed up. The metric that serf reports for intents, events, and queries seem to consistently be zero. I have no idea if these queues are the same as the memberlist queue though. I also don't know if that has anything to do with acks. A cursory look at the code seems to indicate that these queues shouldn't affect acks: https://github.com/hashicorp/memberlist/blob/master/net.go#L234

At this point, I'm confused about why this is happening. I know EC2's network isn't the best, but it failing so often doesn't seem to happen anywhere else that I'm aware of. I have already checked all security groups and we're operating inside of a VPC. I see traffic traveling over TCP and UDP so I know that it's not a configuration issue at this point.

jsternberg · 2015-09-16T19:57:59Z

A correction to the above. There appeared to be a few nodes that had UDP blocked in our testing environment. I've confirmed we no longer have those and am gathering metrics again.

I think we still have this problem though as our production environment doesn't have those ports blocked and we still regularly get nodes failing.

djenriquez · 2015-09-16T20:31:03Z

so @jsternberg, just to clarify, you're still having the flapping problem, but have solved the bad metrics issue you were having?

jsternberg · 2015-09-16T20:52:51Z

@djenriquez yep. I'm now attempting to get more data to try and find some kind of root cause. I'll be running with it in this state and will likely be able to confirm after monitoring the metrics for a couple of days about what is happening. I already see that one probe failed, but no dead node happened.

Since this is such a common issue, it may be worth adding some additional logging for when probes fail or a test command for environment validation.

slackpad · 2015-09-16T21:20:16Z

Since this is such a common issue, it may be worth adding some additional logging for when probes fail or a test command for environment validation.

This is definitely a good idea and we've been talking about this internally as well. Given the randomized nature of the pings, and the fact that it will try indirect pings via other nodes Serf/Consul can pave over a lot of different types of failures and configuration issues in ways that can be confusing. In Consul 0.6 we have added a TCP fallback ping which helps keep the cluster stable while providing some log messages about a possible misconfiguration ("node X was reachable via TCP but not UDP").

I forgot to ask, @djenriquez are you running Consul inside Docker containers as well?

This can be a helpful debug message when trying to find misconfigurations of firewalls or EC2 security groups that cause udp pings to occasionally fail. When a udp ping fails, that might indicate a problem in configuration and having some debug message about it, with the information about the node its trying to ping, can be useful in finding the source of the failure. This should help with ambiguous issues with flapping as found in hashicorp/consul#1212.

djenriquez · 2015-09-16T23:23:07Z

@slackpad Yes sir.

winggundamth · 2015-09-17T04:53:24Z

This is the problem on Docker. Until it fixed please see my workaround here

moby/moby#8795 (comment)

djenriquez · 2015-09-17T07:23:47Z

@winggundamth I do not believe that this is the same issue. I am actually very familiar with the conntrack fix and have experienced it with Consul before in the past. The UDP issue fixed by conntrack is much more consistent in failures than the flapping problem that we are having here.

The flapping issue that we are seeing is downtime for roughly 30-90 seconds for probably every few hours; the nodes are up 90-95% of the time. But when you start increasing the amount of nodes, your cluster will see failures more often because the chances of a single node being in that 5-10% failure time increases.

winggundamth · 2015-09-17T07:37:32Z

@djenriquez So does it means that work around can not fix the problem for you?

djenriquez · 2015-09-17T08:09:34Z

@winggundamth correct, this does not fix the problem for us.

jsternberg · 2015-09-17T18:28:09Z

I retract my previous comments from this thread. It appears our core problem was something mentioned in a Google Groups mailing list message about this.

After resolving network errors in our testing environment, I looked at our staging environment. Our staging environment was repeatedly failing. Luckily, the logs messages mentioned a node that I know has been having trouble due to having too much IO load. I'll have more data within the next couple of days, but I think this will probably fix our issue. I'll report back if I'm wrong and there is still an issue, but otherwise assume that we're fine and have no issues.

The testing environment has been working perfectly with no node flapping. The PR I referenced above helped in figuring out which nodes were having problems and were failing their UDP ping checks. I also made another fix to the metrics that I'll open an issue for that caused "alive" metrics to get reported at invalid times.

@djenriquez I'm not sure if my issue is the same as yours, but I would suggest looking at the metrics and see if you can make any heads or tails of them. It may point you to the problem.

slackpad · 2015-09-17T18:41:48Z

Thanks for the update @jsternberg - could you link to that Google Groups thread here?

jsternberg · 2015-09-17T18:46:28Z

It was in response to the reason why this issue was created to begin with and how I found this issue number.

https://groups.google.com/d/msg/consul-tool/zyh8Kbifv6M/c1WWpknQ8H8J

One of the first responses so I'm a little embarrassed that was our underlying issue. It is certainly difficult with the gossip protocol to find who is causing the failure as it turns out I was always looking at the wrong nodes.

djenriquez · 2015-09-17T21:20:00Z

Awesome, glad to see its working better for you @jsternberg. Unfortunately in our case, its not a single node but a completely random node that will fail for a short period of time, including nodes in the same VPC as the consul servers. We have all traffic set up to pass through the required consul ports across all servers.

At first I had thought there was a VPN issue between our AZ, but that wouldn't answer why the nodes in the same AZ and VPC as the consul servers would also flap periodically.

I haven't spent much time analyzing the data because the issue is minor, just more of an annoyance. I'll go ahead and start looking deeper at this issue.

jsternberg · 2015-09-17T21:26:38Z

@djenriquez to clarify what happened to us, it was a random node that would fail. That's the reason why it was so hard to find was because the server that was failing would not be the one that actually failed.

Loaded node A sends out ping to node B
Node B responds to ping with an ack
Node A doesn't respond to the ack before 500 ms timeout
Node A thinks the ack failed, even though it succeeded
Node A tries fallback methods, somehow they fail too
Node A sends out suspect message about Node B to the cluster
Node C receives suspect message about Node B
Node C hits suspect timeout and declares Node B dead

A single instance of this happening isn't too bad, but if it happens with every ping you get a bunch of random suspect messages being sent to the cluster. Even if it's only sent to a fraction (5%) of them, you get 3 suspect messages a minute. Eventually, the suspect timeout gets hit before the node can refute the message and you end up with dead nodes.

Unfortunately, I don't have enough evidence that this is exactly what happened, but removing the loaded node from our cluster seems to be making our cluster healthier. There could also be other reasons why the probe has failed.

djenriquez · 2015-09-17T22:45:03Z

Ah, @jsternberg seems logical. The other problem with my issue however, is that none of our nodes are reaching over 50% CPU utilization, with the average utilization ~15% for our entire infrastructure, with the heaviest traffic in our nodes belonging to Consul at a whopping 40KiB/s average (sarcasm 😛 ). These are all new nodes that we're looking to utilize soon.

sstarcher · 2015-09-19T15:15:55Z

My old cluster had 5 leaders that were t2.micro instances with around 50 agents connected and using it as DNS for auto discovery with TTL's for services and nodes of 5 seconds along with allow_stale turned on. All of these were running in docker with net=host. I have been seeing between 2-5 leader elections a day.

I relaunched all of the leaders on new nodes and put them on m3.mediums yesterday. This new cluster just had its first random leader election. It was running for about 8 hours before the first event occurred.

Some stats on this m3.medium cluster
CPU: 20% max - 15% average
Network I/O - 1-2MB/s average

Node: 10.81

consul_1 |     2015/09/18 22:15:14 [WARN] raft: Failed to contact 10.0.20.177:8300 in 500.153009ms
consul_1 |     2015/09/18 22:15:15 [WARN] raft: Failed to contact 10.0.20.177:8300 in 946.791541ms
consul_1 |     2015/09/18 22:15:15 [WARN] raft: Failed to contact 10.0.20.177:8300 in 1.480900366s
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since we have a leader: 10.0.10.81:8300
consul_1 |     2015/09/18 22:15:20 [ERR] raft: peer 10.0.20.177:8300 has newer term, stopping replication
consul_1 |     2015/09/18 22:15:20 [INFO] raft: Node at 10.0.10.81:8300 [Follower] entering Follower state
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.20.177:8300
consul_1 |     2015/09/18 22:15:20 [INFO] consul: cluster leadership lost
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.10.64:8300
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.10.250:8300
consul_1 |     2015/09/18 22:15:20 [INFO] raft: aborting pipeline replication to peer 10.0.20.87:8300
consul_1 |     2015/09/18 22:15:21 [WARN] raft: Rejecting vote from 10.0.20.177:8300 since our last index is greater (23800, 23798)
consul_1 |     2015/09/18 22:15:22 [WARN] raft: Heartbeat timeout reached, starting election
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.10.81:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Election won. Tally: 3
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.10.81:8300 [Leader] entering Leader state
consul_1 |     2015/09/18 22:15:22 [INFO] consul: cluster leadership acquired
consul_1 |     2015/09/18 22:15:22 [INFO] consul: New leader elected: ip-10-0-10-81
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.10.250:8300
consul_1 |     2015/09/18 22:15:22 [WARN] raft: AppendEntries to 10.0.20.177:8300 rejected, sending older logs (next: 23799)
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.10.64:8300
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.20.87:8300
consul_1 |     2015/09/18 22:15:22 [INFO] raft: pipelining replication to peer 10.0.20.177:8300

Node: 20.71

consul_1 |     2015/09/18 22:15:15 [WARN] raft: Heartbeat timeout reached, starting election
consul_1 |     2015/09/18 22:15:15 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:17 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:17 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:18 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:18 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:20 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:20 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:21 [ERR] raft-net: Failed to flush response: write tcp 10.0.10.81:55933: connection reset by peer
consul_1 |     2015/09/18 22:15:21 [WARN] raft: Election timeout reached, restarting election
consul_1 |     2015/09/18 22:15:21 [INFO] raft: Node at 10.0.20.177:8300 [Candidate] entering Candidate state
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] agent: failed to sync remote state: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [ERR] http: Request /v1/health/service/XXX?index=23799&passing=1&wait=60000ms, error: No cluster leader
consul_1 |     2015/09/18 22:15:22 [INFO] raft: Node at 10.0.20.177:8300 [Follower] entering Follower state
consul_1 |     2015/09/18 22:15:22 [WARN] raft: Failed to get previous log: 23800 log not found (last: 23798)
consul_1 |     2015/09/18 22:15:22 [INFO] consul: New leader elected: ip-10-0-10-81
consul_1 |     2015/09/18 22:15:26 [INFO] serf: attempting reconnect to ip-10-0-30-87 10.0.30.87:8301

djenriquez · 2015-09-25T23:26:32Z

Any status updates or additional info we can provide to get this issue up and moving?

The flapping is starting to affect some of our services since consul-templating is removing the routes to our NGINX because the nodes are considered unhealthy. Eventually, when they flap back, all is fine, but this service interruption is very problematic for our nodes hosting important services.

djenriquez · 2015-09-30T17:44:11Z

Updates?

slackpad · 2015-10-01T19:02:18Z

Hi @djenriquez - haven't forgotten about this but working through a backlog.

skippy · 2015-10-01T20:07:32Z

hey @djenriquez I'm not sure if you follow the list serve (https://groups.google.com/forum/#!forum/consul-tool) but there are a number of threads on the topic.

I saw some flapping issues but they are now massively reduced through a few tweaks (if this helps):

moved to M4.large instances for the consul servers; the network performance, for my case, was much improved, even for a consul cluster that spans AZs.
allowed for dns, service, and node stale caches. This helps keep things up and running during leader-flapping
I was routing all dns through consul, but I now use dnsmasq to only put *.consul. requests through consul.
I use --net host for the consul docker containers. Before I added that flag to docker, I was running into issues with docker dropping UDP packets, which I noticed quite a bit when I was running all dns requests through consul. using --net host has helped quite a bit (though of course you lose all the iptable benefits when you do that)

I'm now seeing 1-2 leader elections a day on a small cluster (3 consul servers, another 6-12 agents, and ~ 16 services)

feel free to ping me if you want to dive into details.

ghost · 2016-07-08T21:12:37Z

I think I read through the whole thread. Did I miss the solution? It says it's closed. What's the solution? I am having the exact same problem.

I have two VPCs connected via VPC Peering. In one VPC, there are four consul servers and all agents in both VPCs register to them. When I look at the Consul UI, I can see that random nodes are going orange and then back to green. There are ~70 servers in my infrastructure and they all have at least the consul agent running.

I have everything going to Logstash so I can mine logs quickly for patterns. I am running consul 0.6.4. I am using consul-template to update HAProxy.

The infrastructure is built by Cloudformation and all security groups are open to each other with respect to Consul and Consul-template.

The flapping is not bad most of the time, but there have been several instances where the flapping was such that there were no servers in rotation in HAProxy but only for less than 1 minute. I don't know how to solve this issue.

slackpad · 2016-07-08T21:25:21Z

Hi @pong-takepart sorry the paper trail isn't very clear on this one. We've got some changes teed up to go out in the next release of Consul to make this much better - here's the PR that pulled them in - #2101.

ghost · 2016-07-08T23:38:11Z

@slackpad OSSUM!!! Thanks!!

MrMMorris · 2016-07-09T02:17:12Z

@pong-takepart My issue turned out to be a low memory issue caused by running monitoring containers. When something happened in the Consul cluster which resulted in an increase in logging, it meant the monitoring container used more memory which lead to more issues with Consul due to low memory and more logging. I ended up removing the monitoring containers and the Consul cluster has been rock solid ever since.

sstarcher · 2016-07-09T14:22:21Z

@pong-takepart what size are your servers? And if you are not collecting the consul metrics I would recommend it.

mrwilby · 2016-07-14T01:05:32Z

@slackpad Hello. Is there any tentative guesstimate about when you will cut a release that contains these changes? I am guessing a lot of folks will be interested to pick up these improvements and see if they address the flapping issues. Thanks for spending cycles investigating this!

slackpad · 2016-07-14T06:23:50Z

Hi @mrwilby I don't have a firm date on the final release but hopefully a release candidate in the next several weeks. We are burning down a few more things before we cut a release, though this feature is fully integrated in master now.

mrwilby · 2016-07-14T11:57:39Z

Thanks @slackpad - We're trying out master now in one of our test environments. Fingers folded for finally fixing flapping faults!

slackpad · 2016-07-14T14:20:07Z

@mrwilby excellent - would appreciate any feedback. The way these fixes should manifest is that the node experiencing non real-time behavior (CPU exhaustion, network performance issues, dropped packets, etc.) should still get marked failed, but it shouldn't be able to falsely accuse other, healthy nodes of being failed and cause them to flap.

mrwilby · 2016-07-14T14:23:20Z

@slackpad - ok. Is there (or will there be) tuning parameters we can use to adjust the "non real-time behavior" tolerances ? We, and I am sure others, would prefer to avoid having to provision costly/large cloud instance types just for a very small cluster (a few handfuls of nodes) simply because consul is over-aggressive in its determinations of whether a node has failed or otherwise...

slackpad · 2016-07-14T14:35:29Z

We are definitely planning to do that for Raft and the servers as part of this release. Didn't plan manual tunes at the Serf level, but this new algorithm should be much more forgiving in that regard since it requires independent confirmations in order to quickly declare a failure so depending on what's causing the NRT behavior and how bad it is, you may find that the degraded node itself isn't getting marked failed at all. Especially if your load spikes are short lived and erratic then this should perform much better. As we continue testing and get feedback we may consider some Serf tunes as well, but hopefully we won't need to.

teluka · 2016-07-18T12:52:50Z

we may consider some Serf tunes as well

Would be great to have access to Serf tunables at least to some extent.

kingpong · 2016-07-31T22:44:33Z

@slackpad I am running a build from master (0.7.0dev @ 6af6baf) and unfortunately I am still experiencing leader elections several times a day. This is a 5-node cluster that is completely idle except for Consul (which is just sitting there, not yet in use). The servers are relatively small (t2.small) spread across 3 AWS availability zones. I have tried using m3.medium and had the same result.

What data can I provide that will help?

sstarcher · 2016-08-01T00:49:52Z

@kingpong Coming from an AWS setup I can tell you I always experience leader election running t2 instance types. It was not until I moved to m3 and larger that the election issue went away.

MrMMorris · 2016-08-01T06:16:50Z

I am running my 3 node EC2 cluster on t2.small's and do not have the election issues.

skippy · 2016-08-01T14:01:12Z

@kingpong check out this thread: hashicorp/vault#1585

it is related to vault going down because of consul flapping. TL/DR: like noted in this thread and elsewhere, you can't run consul on super super small instances,

In summary, i found two things:

consul is sensitive to network and disk IO; so if the underlying OS does anything to impact that, you'll see Consul flap. For example, if you are running docker and/or CoreOS you will have issues with smaller instances (see kswapd0 cpu usage coreos/bugs#1424, systemd-networkd: doesn't remove "links" file coreos/bugs#1081, and https://bugzilla.kernel.org/show_bug.cgi?id=65201); or docker logging buffer overflows (Docker logging blocks and never recovers moby/moby#22502), journalD IO pegging, etc.... I've found so many little issues that have nothing to do with consul but they impact consul's internal health checks. I just run on larger instances as those instances have enough spare capacity that weird stuff can happen and consul most likely won't be impacted.
consul had an issue where a single flapping node, in the right circumstances, could propagate errors to the rest of the consul cluster, causing otherwise healthy instances to register as unhealthy. I found this occurs (for me anyway) when an unhealthy node flaps quickly and continuously, such as triggered by any of the issues listed above, and it doesn't have time to stay marked as unhealthy or healthy. This is supposedly fixed on master.

kingpong · 2016-08-02T03:39:03Z

Thanks for the additional info, @skippy.

tl;dr: I think GOMAXPROCS=2 was the culprit. Changing to 10 seems to have solved it.

These servers are completely unused outside of Consul, and Consul itself is literally doing nothing except gossiping with itself. Even a t2.small is ridiculously overpowered for that task. They are 97% idle all of the time, including during the times when these leader elections have occurred.

This smells like an application bug because I have been monitoring with vmstat, ping and tcpdump for the last 24 hours, and none of those tools indicate system load or link-layer network instability during the elections. I do see some TCP retransmits and connection resets, but other traffic (e.g. pings) between the hosts at literally the very same second is unaffected.

I have been using the Chef consul cookbook to deploy Consul. The cookbook automatically sets the consul process's GOMAXPROCS to the number of cores or 2, whichever is greater. So on my single core machine, that means GOMAXPROCS=2. On a whim, I set it to GOMAXPROCS=10 (picked out of thin air) across the cluster. It's been six hours without an election so far (which is a record by a margin of about 5 hours).

Tomorrow I will try removing GOMAXPROCS altogether.

MrMMorris · 2016-08-02T05:08:34Z

Just to provide my perspective again: I use an Ansible playbook to install and run consul, and the default GOMAXPROCS is also 2. Haven't had issues with it. ¯_(ツ)_/¯

sstarcher · 2016-08-02T11:06:21Z

@MrMMorris are you running inside of docker? If not that's likely why you can run on t2's by default. I noticed similar results to @kingpong with t2 instance types not being utilized by default.

MrMMorris · 2016-08-02T11:07:33Z

yep, no docker for consul here

mrwilby · 2016-08-23T13:01:31Z

@kingpong Did your gomaxprocs=10 change have any measurable effect over time?

We are running 0.7.x (with the latest flapping-related improvements) but still seeing a few flaps now and then (on c4.large in AWS). This is with a small cluster of ~30 nodes (incl. 3 consul servers).

@slackpad It would be very helpful if there was some way to help diagnose the root cause of flaps - i.e. is this due to networking issues, CPU starvation or something else... Right now we have consul debug logging turned on, but the logs are not particularly enlightening.

We also have the consul metrics piped into our stats backend, but again, don't see any real correlations with the flap times. We probably are not looking in the correct place inside the consul logs and/or stats.

IMO the collection of flapping-related issues are definitely something that Hashicorp should prioritize - so far we've yet to find any recipe to solve these issues. The only recommended solution is massively overprovision your instance types, but when the agents themselves seem to contribute (not just the consul servers) then this effectively means every instance type has to be over-provisioned which is simply not viable in order to use consul itself.

sstarcher · 2016-08-23T13:28:28Z

@mrwilby is consul eating up your cpu on the m4.large or is the cpu not doing a lot of work? When we were on m3.mediums we had 0 flapping until we pegged out CPUs at 80% + and moving to c4.larges resolved that. We run a large cluster and had a lot going on, but we could likely go down to m3.mediums again.

With docker+non-t2 instance types CPU should only be an issue if the entire server's CPU is pegged. Setting gomaxprocs should not be necessary.

mrwilby · 2016-08-23T13:38:52Z

@sstarcher Sorry, my mistake - we're actually running consul servers on c4.large (I edited my post to correct). The consul servers are listed as 100% idle from our metrics (of course, I am sure there are small bursts of activity which our metric granularity doesn't reflect). But anyway, massively over-provisioned for our very light workloads.

The 0.7x changes appear to be a lot more stable for us than 0.6.x, but we do still see occasional flaps with no substantiating metrics to enable us to pin down why. Again, we are most likely not looking in the correct place which is why I asked if there was any info about how to diagnose the root cause of flaps.

We have a small kafka cluster of 3 brokers. The last flap involved 2 of the brokers deciding that the 3rd had failed.

2016/08/19 22:38:44 [INFO] serf: EventMemberFailed: 10.0.3.94 10.0.3.94
2016/08/19 22:38:44 [INFO] consul: member '10.0.3.94' failed, marking health critical
2016/08/19 22:38:44 [INFO] serf: EventMemberJoin: 10.0.3.94 10.0.3.94
2016/08/19 22:38:44 [INFO] consul: member '10.0.3.94' joined, marking health alive

From looking at the stats & logs of the 3rd broker, it was >90% idle during this time. Our test environment was running kafka on m4.large instance type, so it wasn't an under-provisioned t2 series instance.

We moved consul from docker to native EC2 a long time ago due to all the docker UDP issues and in the hope that it would resolve the flapping issues, but to no avail.

sstarcher · 2016-08-23T13:42:36Z

@mrwilby your case certainly sounds like a networking issue. I would recommend looking at the logs of the node that is the leader.

slackpad · 2016-08-23T15:22:37Z

@mrwilby thanks for the feedback - fixing these flaps was a big area of focus for the 0.7 release and hopefully we can tune things before the release to get performance even better. It's not in master yet but we are also going to expose Raft timing controls that should let you run your servers on less powerful instances as well.

The 0.7x changes appear to be a lot more stable for us than 0.6.x, but we do still see occasional flaps with no substantiating metrics to enable us to pin down why. Again, we are most likely not looking in the correct place which is why I asked if there was any info about how to diagnose the root cause of flaps.

Are you running 0.7.0-rc1, and is it on all nodes in this test cluster or just some of them? To get the full flap reduction benefits you'd want most or ideally all nodes to have the new logic.

You can see who actually detected the node as failed by looking for Suspect <node name> has failed, no acks received in Consul logs across the cluster. That's the node that performed a probe via UDP direct, UDP via a few peers, and TCP direct and never heard back, so started the suspicion of the node. Looking on the flapping node is usually not helpful to determine who accused it because it hears about it via gossip so it doesn't report where the suspicion originated. I agree this is hard to diagnose, we've got plans to make it better but that will come after 0.7.

banupriya20 · 2016-10-05T07:39:56Z

when Adding new node and distribute data w/o downtime and will the performance will degrade

er-kiran · 2016-12-13T09:26:24Z

We're currently running Consul 0.6.4 on a cluster of ~100 nodes with both Windows and Linux boxes. We're seeing a high rate of failure and subsequent join events (almost immediately) when there's a max CPU usage situation on a few of the nodes (4 or 5). The number of failure events (EventMemberFailure) seen at a peer is around 200-250 per hour. The susprising thing is that failures reported are not just for the nodes with max CPU, but is across entire cluster. Though, the majority (a little over 50% ) are for the nodes with max CPU.

Force stopping the Consul agent on the slammed nodes makes all the failures go away. Therefore, we suspect false alarms are being triggered by the slammed nodes, due to them not being able to send/receive/process UDP pkts in time, causing the entire cluster to experience a churn. We're still surprised that false alarms can progress to the point that many healthy nodes are getting reported as failed.

We see that 0.7.0 adds guards for similar/related scenarios. Before we put in a effort to upgrade and re-validate our deployments, would be great if we could confirm that this is 'as expected' behavior with 0.6.4 when a few nodes are at max CPU, and that we haven't missed some tricks in the book to address this with 0.6.4. Thanks!

slackpad · 2016-12-13T15:55:53Z

Hi @er-kiran you are correct - Consul 0.7's Lifeguard changes were specifically targeted at limiting the damage that a degraded node could do to the rest of the cluster. Previously, one degraded node could start making false accusations that could lead to churn around the cluster.

isegal · 2017-01-24T00:18:13Z

Unfortunately, 0.7 does not seem to fix this. We had a few incidents and we were surprised to find out that our staging instances were affecting production (even though staging were set up to only be able to read consul information, not write to it, via auth tokens).

We upgraded everything to 0.7 but are still seeing flapping.

It would be nice to be able to tune timeouts or at least disallow certain instances from having a vote (i.e. via access tokens for example).

djenriquez mentioned this issue Sep 3, 2015

health check add "max_fails" support #1208

Closed

cruatta mentioned this issue Sep 7, 2015

Serf health check failures #1180

Closed

slackpad self-assigned this Sep 9, 2015

jsternberg mentioned this issue Sep 16, 2015

Log a debug message when a udp ping fails due to timeout hashicorp/memberlist#52

Merged

djenriquez mentioned this issue Sep 26, 2015

EC2 Node Flapping - Docker related? gliderlabs/docker-consul#109

Closed

skippy mentioned this issue Jul 14, 2016

Vault goes down in production so frequently hashicorp/vault#1585

Closed

Node health flapping - EC2 #1212

Node health flapping - EC2 #1212

Comments

djenriquez commented Sep 1, 2015

slackpad commented Sep 4, 2015

djenriquez commented Sep 9, 2015

winggundamth commented Sep 11, 2015

jsternberg commented Sep 16, 2015

jsternberg commented Sep 16, 2015

djenriquez commented Sep 16, 2015

jsternberg commented Sep 16, 2015

slackpad commented Sep 16, 2015

djenriquez commented Sep 16, 2015

winggundamth commented Sep 17, 2015

djenriquez commented Sep 17, 2015

winggundamth commented Sep 17, 2015

djenriquez commented Sep 17, 2015

jsternberg commented Sep 17, 2015

slackpad commented Sep 17, 2015

jsternberg commented Sep 17, 2015

djenriquez commented Sep 17, 2015

jsternberg commented Sep 17, 2015

djenriquez commented Sep 17, 2015

sstarcher commented Sep 19, 2015

djenriquez commented Sep 25, 2015

djenriquez commented Sep 30, 2015

slackpad commented Oct 1, 2015

skippy commented Oct 1, 2015

ghost commented Jul 8, 2016

slackpad commented Jul 8, 2016

ghost commented Jul 8, 2016

MrMMorris commented Jul 9, 2016

sstarcher commented Jul 9, 2016

mrwilby commented Jul 14, 2016

slackpad commented Jul 14, 2016

mrwilby commented Jul 14, 2016

slackpad commented Jul 14, 2016

mrwilby commented Jul 14, 2016

slackpad commented Jul 14, 2016

teluka commented Jul 18, 2016

kingpong commented Jul 31, 2016

sstarcher commented Aug 1, 2016

MrMMorris commented Aug 1, 2016 • edited Loading

skippy commented Aug 1, 2016

kingpong commented Aug 2, 2016

MrMMorris commented Aug 2, 2016 • edited Loading

sstarcher commented Aug 2, 2016

MrMMorris commented Aug 2, 2016

mrwilby commented Aug 23, 2016 • edited Loading

sstarcher commented Aug 23, 2016

mrwilby commented Aug 23, 2016 • edited Loading

sstarcher commented Aug 23, 2016

slackpad commented Aug 23, 2016 • edited Loading

banupriya20 commented Oct 5, 2016

er-kiran commented Dec 13, 2016

slackpad commented Dec 13, 2016

isegal commented Jan 24, 2017

MrMMorris commented Aug 1, 2016 •

edited

Loading

MrMMorris commented Aug 2, 2016 •

edited

Loading

mrwilby commented Aug 23, 2016 •

edited

Loading

mrwilby commented Aug 23, 2016 •

edited

Loading

slackpad commented Aug 23, 2016 •

edited

Loading