-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node health flapping - EC2 #1212
Comments
Those parameters are not configuration-tunable but it would be simple to patch and build Consul with a different set of values. Do you know the approximate round-trip times between your different AZs? Also, is there anything interesting in the logs on the node side during one of these flapping events? |
@slackpad Sorry for the late response! Pinging between the three AZs in us-west-2 is roughly 1.3ms. Nothing interesting is happening in the logs or being reported in Sysdig. Our nodes are pretty under-utilized right now, but the highest traffic belongs to Consul and Consul-template with an average rate of about 7 KiB/s, spiking to 20KiB/s about every 30 seconds. Our nodes are reporting an average rate of 7KiB/s, so yea pretty much all Consul traffic. |
We got this problem too. At first it works fine with 8 nodes. But now we have 12 nodes and will increase more and we face with node health flapping all the times. This is happen randomly with random node and it happens every few seconds. The error would be below repeat with random node
FYI We run Consul in Docker container. Host run on EC2 with VPC on the same subnet and ping really working fine with below 2ms without any loss. |
We have this problem too, but we're running with a slightly larger cluster than those in this thread. We're running with approximately 160 nodes in our production cluster and I'm testing this problem in a cluster with a little over 60 healthy nodes. Both of them have this flapping issue. I enabled metrics in our testing environment and have been analyzing them to try and find some kind of pattern. I first thought maybe the probes were regularly taking 400ms and that some just happened to take slightly longer at 500ms and was failing. The mean time for probing nodes is 5-25ms with just an occasional outlier. The standard deviation has a max of 100ms from the metrics I've taken the last day. The curious part is that while the times are very low for probing a node, the sum indicates that at 1-2 probes fail a minute. The sum hops between a little above 500ms and 1000ms which seems to indicate 1-2 probes failed. I tried checking the serf queues to see if they were backed up. The metric that serf reports for intents, events, and queries seem to consistently be zero. I have no idea if these queues are the same as the memberlist queue though. I also don't know if that has anything to do with acks. A cursory look at the code seems to indicate that these queues shouldn't affect acks: https://github.com/hashicorp/memberlist/blob/master/net.go#L234 At this point, I'm confused about why this is happening. I know EC2's network isn't the best, but it failing so often doesn't seem to happen anywhere else that I'm aware of. I have already checked all security groups and we're operating inside of a VPC. I see traffic traveling over TCP and UDP so I know that it's not a configuration issue at this point. |
A correction to the above. There appeared to be a few nodes that had UDP blocked in our testing environment. I've confirmed we no longer have those and am gathering metrics again. I think we still have this problem though as our production environment doesn't have those ports blocked and we still regularly get nodes failing. |
so @jsternberg, just to clarify, you're still having the flapping problem, but have solved the bad metrics issue you were having? |
@djenriquez yep. I'm now attempting to get more data to try and find some kind of root cause. I'll be running with it in this state and will likely be able to confirm after monitoring the metrics for a couple of days about what is happening. I already see that one probe failed, but no dead node happened. Since this is such a common issue, it may be worth adding some additional logging for when probes fail or a test command for environment validation. |
This is definitely a good idea and we've been talking about this internally as well. Given the randomized nature of the pings, and the fact that it will try indirect pings via other nodes Serf/Consul can pave over a lot of different types of failures and configuration issues in ways that can be confusing. In Consul 0.6 we have added a TCP fallback ping which helps keep the cluster stable while providing some log messages about a possible misconfiguration ("node X was reachable via TCP but not UDP"). I forgot to ask, @djenriquez are you running Consul inside Docker containers as well? |
This can be a helpful debug message when trying to find misconfigurations of firewalls or EC2 security groups that cause udp pings to occasionally fail. When a udp ping fails, that might indicate a problem in configuration and having some debug message about it, with the information about the node its trying to ping, can be useful in finding the source of the failure. This should help with ambiguous issues with flapping as found in hashicorp/consul#1212.
@slackpad Yes sir. |
This is the problem on Docker. Until it fixed please see my workaround here |
@winggundamth I do not believe that this is the same issue. I am actually very familiar with the conntrack fix and have experienced it with Consul before in the past. The UDP issue fixed by conntrack is much more consistent in failures than the flapping problem that we are having here. The flapping issue that we are seeing is downtime for roughly 30-90 seconds for probably every few hours; the nodes are up 90-95% of the time. But when you start increasing the amount of nodes, your cluster will see failures more often because the chances of a single node being in that 5-10% failure time increases. |
@djenriquez So does it means that work around can not fix the problem for you? |
@winggundamth correct, this does not fix the problem for us. |
I retract my previous comments from this thread. It appears our core problem was something mentioned in a Google Groups mailing list message about this. After resolving network errors in our testing environment, I looked at our staging environment. Our staging environment was repeatedly failing. Luckily, the logs messages mentioned a node that I know has been having trouble due to having too much IO load. I'll have more data within the next couple of days, but I think this will probably fix our issue. I'll report back if I'm wrong and there is still an issue, but otherwise assume that we're fine and have no issues. The testing environment has been working perfectly with no node flapping. The PR I referenced above helped in figuring out which nodes were having problems and were failing their UDP ping checks. I also made another fix to the metrics that I'll open an issue for that caused "alive" metrics to get reported at invalid times. @djenriquez I'm not sure if my issue is the same as yours, but I would suggest looking at the metrics and see if you can make any heads or tails of them. It may point you to the problem. |
Thanks for the update @jsternberg - could you link to that Google Groups thread here? |
It was in response to the reason why this issue was created to begin with and how I found this issue number. https://groups.google.com/d/msg/consul-tool/zyh8Kbifv6M/c1WWpknQ8H8J One of the first responses so I'm a little embarrassed that was our underlying issue. It is certainly difficult with the gossip protocol to find who is causing the failure as it turns out I was always looking at the wrong nodes. |
Awesome, glad to see its working better for you @jsternberg. Unfortunately in our case, its not a single node but a completely random node that will fail for a short period of time, including nodes in the same VPC as the consul servers. We have all traffic set up to pass through the required consul ports across all servers. At first I had thought there was a VPN issue between our AZ, but that wouldn't answer why the nodes in the same AZ and VPC as the consul servers would also flap periodically. I haven't spent much time analyzing the data because the issue is minor, just more of an annoyance. I'll go ahead and start looking deeper at this issue. |
@djenriquez to clarify what happened to us, it was a random node that would fail. That's the reason why it was so hard to find was because the server that was failing would not be the one that actually failed.
A single instance of this happening isn't too bad, but if it happens with every ping you get a bunch of random suspect messages being sent to the cluster. Even if it's only sent to a fraction (5%) of them, you get 3 suspect messages a minute. Eventually, the suspect timeout gets hit before the node can refute the message and you end up with dead nodes. Unfortunately, I don't have enough evidence that this is exactly what happened, but removing the loaded node from our cluster seems to be making our cluster healthier. There could also be other reasons why the probe has failed. |
Ah, @jsternberg seems logical. The other problem with my issue however, is that none of our nodes are reaching over 50% CPU utilization, with the average utilization ~15% for our entire infrastructure, with the heaviest traffic in our nodes belonging to Consul at a whopping 40KiB/s average (sarcasm 😛 ). These are all new nodes that we're looking to utilize soon. |
My old cluster had 5 leaders that were t2.micro instances with around 50 agents connected and using it as DNS for auto discovery with TTL's for services and nodes of 5 seconds along with allow_stale turned on. All of these were running in docker with net=host. I have been seeing between 2-5 leader elections a day. I relaunched all of the leaders on new nodes and put them on m3.mediums yesterday. This new cluster just had its first random leader election. It was running for about 8 hours before the first event occurred. Some stats on this m3.medium cluster Node: 10.81
Node: 20.71
|
Any status updates or additional info we can provide to get this issue up and moving? The flapping is starting to affect some of our services since consul-templating is removing the routes to our NGINX because the nodes are considered unhealthy. Eventually, when they flap back, all is fine, but this service interruption is very problematic for our nodes hosting important services. |
Updates? |
Hi @djenriquez - haven't forgotten about this but working through a backlog. |
hey @djenriquez I'm not sure if you follow the list serve (https://groups.google.com/forum/#!forum/consul-tool) but there are a number of threads on the topic. I saw some flapping issues but they are now massively reduced through a few tweaks (if this helps):
I'm now seeing 1-2 leader elections a day on a small cluster (3 consul servers, another 6-12 agents, and ~ 16 services) feel free to ping me if you want to dive into details. |
I think I read through the whole thread. Did I miss the solution? It says it's closed. What's the solution? I am having the exact same problem. I have two VPCs connected via VPC Peering. In one VPC, there are four consul servers and all agents in both VPCs register to them. When I look at the Consul UI, I can see that random nodes are going orange and then back to green. There are ~70 servers in my infrastructure and they all have at least the consul agent running. I have everything going to Logstash so I can mine logs quickly for patterns. I am running consul 0.6.4. I am using consul-template to update HAProxy. The infrastructure is built by Cloudformation and all security groups are open to each other with respect to Consul and Consul-template. The flapping is not bad most of the time, but there have been several instances where the flapping was such that there were no servers in rotation in HAProxy but only for less than 1 minute. I don't know how to solve this issue. |
Hi @pong-takepart sorry the paper trail isn't very clear on this one. We've got some changes teed up to go out in the next release of Consul to make this much better - here's the PR that pulled them in - #2101. |
@slackpad OSSUM!!! Thanks!! |
@pong-takepart My issue turned out to be a low memory issue caused by running monitoring containers. When something happened in the Consul cluster which resulted in an increase in logging, it meant the monitoring container used more memory which lead to more issues with Consul due to low memory and more logging. I ended up removing the monitoring containers and the Consul cluster has been rock solid ever since. |
@pong-takepart what size are your servers? And if you are not collecting the consul metrics I would recommend it. |
@slackpad Hello. Is there any tentative guesstimate about when you will cut a release that contains these changes? I am guessing a lot of folks will be interested to pick up these improvements and see if they address the flapping issues. Thanks for spending cycles investigating this! |
Hi @mrwilby I don't have a firm date on the final release but hopefully a release candidate in the next several weeks. We are burning down a few more things before we cut a release, though this feature is fully integrated in master now. |
Thanks @slackpad - We're trying out master now in one of our test environments. Fingers folded for finally fixing flapping faults! |
@mrwilby excellent - would appreciate any feedback. The way these fixes should manifest is that the node experiencing non real-time behavior (CPU exhaustion, network performance issues, dropped packets, etc.) should still get marked failed, but it shouldn't be able to falsely accuse other, healthy nodes of being failed and cause them to flap. |
@slackpad - ok. Is there (or will there be) tuning parameters we can use to adjust the "non real-time behavior" tolerances ? We, and I am sure others, would prefer to avoid having to provision costly/large cloud instance types just for a very small cluster (a few handfuls of nodes) simply because consul is over-aggressive in its determinations of whether a node has failed or otherwise... |
We are definitely planning to do that for Raft and the servers as part of this release. Didn't plan manual tunes at the Serf level, but this new algorithm should be much more forgiving in that regard since it requires independent confirmations in order to quickly declare a failure so depending on what's causing the NRT behavior and how bad it is, you may find that the degraded node itself isn't getting marked failed at all. Especially if your load spikes are short lived and erratic then this should perform much better. As we continue testing and get feedback we may consider some Serf tunes as well, but hopefully we won't need to. |
Would be great to have access to Serf tunables at least to some extent. |
@slackpad I am running a build from master (0.7.0dev @ 6af6baf) and unfortunately I am still experiencing leader elections several times a day. This is a 5-node cluster that is completely idle except for Consul (which is just sitting there, not yet in use). The servers are relatively small (t2.small) spread across 3 AWS availability zones. I have tried using m3.medium and had the same result. What data can I provide that will help? |
@kingpong Coming from an AWS setup I can tell you I always experience leader election running t2 instance types. It was not until I moved to m3 and larger that the election issue went away. |
I am running my 3 node EC2 cluster on t2.small's and do not have the election issues. |
@kingpong check out this thread: hashicorp/vault#1585 it is related to vault going down because of consul flapping. TL/DR: like noted in this thread and elsewhere, you can't run consul on super super small instances, In summary, i found two things:
|
Thanks for the additional info, @skippy. tl;dr: I think GOMAXPROCS=2 was the culprit. Changing to 10 seems to have solved it. These servers are completely unused outside of Consul, and Consul itself is literally doing nothing except gossiping with itself. Even a t2.small is ridiculously overpowered for that task. They are 97% idle all of the time, including during the times when these leader elections have occurred. This smells like an application bug because I have been monitoring with vmstat, ping and tcpdump for the last 24 hours, and none of those tools indicate system load or link-layer network instability during the elections. I do see some TCP retransmits and connection resets, but other traffic (e.g. pings) between the hosts at literally the very same second is unaffected. I have been using the Chef consul cookbook to deploy Consul. The cookbook automatically sets the consul process's GOMAXPROCS to the number of cores or 2, whichever is greater. So on my single core machine, that means GOMAXPROCS=2. On a whim, I set it to GOMAXPROCS=10 (picked out of thin air) across the cluster. It's been six hours without an election so far (which is a record by a margin of about 5 hours). Tomorrow I will try removing GOMAXPROCS altogether. |
Just to provide my perspective again: I use an Ansible playbook to install and run consul, and the default GOMAXPROCS is also 2. Haven't had issues with it. ¯_(ツ)_/¯ |
@MrMMorris are you running inside of docker? If not that's likely why you can run on t2's by default. I noticed similar results to @kingpong with t2 instance types not being utilized by default. |
yep, no docker for consul here |
@kingpong Did your gomaxprocs=10 change have any measurable effect over time? We are running 0.7.x (with the latest flapping-related improvements) but still seeing a few flaps now and then (on c4.large in AWS). This is with a small cluster of ~30 nodes (incl. 3 consul servers). @slackpad It would be very helpful if there was some way to help diagnose the root cause of flaps - i.e. is this due to networking issues, CPU starvation or something else... Right now we have consul debug logging turned on, but the logs are not particularly enlightening. We also have the consul metrics piped into our stats backend, but again, don't see any real correlations with the flap times. We probably are not looking in the correct place inside the consul logs and/or stats. IMO the collection of flapping-related issues are definitely something that Hashicorp should prioritize - so far we've yet to find any recipe to solve these issues. The only recommended solution is massively overprovision your instance types, but when the agents themselves seem to contribute (not just the consul servers) then this effectively means every instance type has to be over-provisioned which is simply not viable in order to use consul itself. |
@mrwilby is consul eating up your cpu on the m4.large or is the cpu not doing a lot of work? When we were on m3.mediums we had 0 flapping until we pegged out CPUs at 80% + and moving to c4.larges resolved that. We run a large cluster and had a lot going on, but we could likely go down to m3.mediums again. With docker+non-t2 instance types CPU should only be an issue if the entire server's CPU is pegged. Setting gomaxprocs should not be necessary. |
@sstarcher Sorry, my mistake - we're actually running consul servers on c4.large (I edited my post to correct). The consul servers are listed as 100% idle from our metrics (of course, I am sure there are small bursts of activity which our metric granularity doesn't reflect). But anyway, massively over-provisioned for our very light workloads. The 0.7x changes appear to be a lot more stable for us than 0.6.x, but we do still see occasional flaps with no substantiating metrics to enable us to pin down why. Again, we are most likely not looking in the correct place which is why I asked if there was any info about how to diagnose the root cause of flaps. We have a small kafka cluster of 3 brokers. The last flap involved 2 of the brokers deciding that the 3rd had failed.
From looking at the stats & logs of the 3rd broker, it was >90% idle during this time. Our test environment was running kafka on m4.large instance type, so it wasn't an under-provisioned t2 series instance. We moved consul from docker to native EC2 a long time ago due to all the docker UDP issues and in the hope that it would resolve the flapping issues, but to no avail. |
@mrwilby your case certainly sounds like a networking issue. I would recommend looking at the logs of the node that is the leader. |
@mrwilby thanks for the feedback - fixing these flaps was a big area of focus for the 0.7 release and hopefully we can tune things before the release to get performance even better. It's not in master yet but we are also going to expose Raft timing controls that should let you run your servers on less powerful instances as well.
Are you running 0.7.0-rc1, and is it on all nodes in this test cluster or just some of them? To get the full flap reduction benefits you'd want most or ideally all nodes to have the new logic. You can see who actually detected the node as failed by looking for |
when Adding new node and distribute data w/o downtime and will the performance will degrade |
We're currently running Consul 0.6.4 on a cluster of ~100 nodes with both Windows and Linux boxes. We're seeing a high rate of failure and subsequent join events (almost immediately) when there's a max CPU usage situation on a few of the nodes (4 or 5). The number of failure events (EventMemberFailure) seen at a peer is around 200-250 per hour. The susprising thing is that failures reported are not just for the nodes with max CPU, but is across entire cluster. Though, the majority (a little over 50% ) are for the nodes with max CPU. Force stopping the Consul agent on the slammed nodes makes all the failures go away. Therefore, we suspect false alarms are being triggered by the slammed nodes, due to them not being able to send/receive/process UDP pkts in time, causing the entire cluster to experience a churn. We're still surprised that false alarms can progress to the point that many healthy nodes are getting reported as failed. We see that 0.7.0 adds guards for similar/related scenarios. Before we put in a effort to upgrade and re-validate our deployments, would be great if we could confirm that this is 'as expected' behavior with 0.6.4 when a few nodes are at max CPU, and that we haven't missed some tricks in the book to address this with 0.6.4. Thanks! |
Hi @er-kiran you are correct - Consul 0.7's Lifeguard changes were specifically targeted at limiting the damage that a degraded node could do to the rest of the cluster. Previously, one degraded node could start making false accusations that could lead to churn around the cluster. |
Unfortunately, 0.7 does not seem to fix this. We had a few incidents and we were surprised to find out that our staging instances were affecting production (even though staging were set up to only be able to read consul information, not write to it, via auth tokens). We upgraded everything to 0.7 but are still seeing flapping. It would be nice to be able to tune timeouts or at least disallow certain instances from having a vote (i.e. via access tokens for example). |
We have a five node Consul cluster handling roughly 30 nodes across 4 different AWS accounts in a shared VPC across different availability zones. For the most part, everything works great. However, quite frequently, a random node will flap from
healthy
tocritical
. The flapping happens on completely random nodes and no consistency whatsoever.Every time a node "flaps" it causes our consul-template, which populates our NGINX reverse-proxy config to reload. This is causes things like our Apache benchmark tests to fail.
We are looking to use Consul for production, but this issue has caused a lot of people to worry about consistency.
We also have all required TCP/UDP ports open through all the nodes, as well.
We believe the issue is just a latency problem with the polling of serf. Is there a way to modify the serf health-check interval to adjust to geographical latency?
Heres the log from one of the Consul servers:
The text was updated successfully, but these errors were encountered: