-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serf health check failures #1180
Comments
I also seem to be getting new leader elections periodically:
|
This is v0.5.2. |
Your network config looks fine to me - very similar to what I run. Some questions about your environment:
Also can you post the telemetry from your leader? |
Here is the telemetry:
|
I do have a health check that fails consistently on my standby postgres server. It runs ever 10s and checks to see if the service is running. Its output does not change. |
+1 |
Here is a tcpdump of traffic on 8300 from the leader |
The consul servers are:
|
We've been seeing similar issues on servers that are under relatively high load and doing a lot of network IO. Out of curiosity, what is your GOMAXPROCS set at? |
It should be set to 2, but looking at our configs, I am not certain that is the case now. I am starting to think this is related to performance because we aren't seeing this issue in production where we run larger instances. However, average CPU utilization is less than 5%, so I am leaning towards network IO as the suspect. The strange thing is that I am only seeing it in some networks but not others even though they are configured with the exact same Ansible roles. I can't seem to find the difference. |
Pretty sure this and #1212 are related |
My understanding of Raft and Serf is limited but would it be possible for a server doing a lot of network IO to appear unresponsive to enough of the peers in the cluster that it would flap between healthy and unhealthy states? This makes me really wish we had some kind of high level visualization tool for the Serf cluster to monitor the state of the underlying member list. |
I confirmed that this is set to 2. I never see this in our larger production environment, but it happens constantly where we are using t2.micro/t2.small instances. Even when all servers in the VPC are idle at night. Avg CPU usage is < 3%. |
Do we have any updates on this? Or any ways to further debug similar issues? We are also seeing a lot of Serf health failures. |
i feel need a tracing system to track the request. how about use appdash to tracing: |
I'm going to close this out as Consul 0.7 introduced Lifeguard enhancements to the gossip layer to combat these kinds of issues. We also did some work to allow you to relax the Raft timing if you want to run on smaller, slightly overloaded servers - https://www.consul.io/docs/guides/performance.html. |
I am seeing intermittent serf failures across all my nodes. There doesn't seem to be any real consistency. Random nodes, random times, and very short lived failures.
Here are the ACL rules for my subnet:
Every machine in the vpc has this security group:
The servers have these rules defined as well:
Here is my consul server config:
What am I missing?
The text was updated successfully, but these errors were encountered: