Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"consul members" command hangs on the agents #4011

Closed
neerajb opened this issue Apr 4, 2018 · 13 comments
Closed

"consul members" command hangs on the agents #4011

neerajb opened this issue Apr 4, 2018 · 13 comments
Labels
type/bug Feature does not function as expected
Milestone

Comments

@neerajb
Copy link

neerajb commented Apr 4, 2018

We have a 5000 node consul cluster - 5 servers and rest of them clients. After running them for about 48 hours the number of "failed" nodes reported by "consul members" command is increasing gradually (1000 so far).
On the failed nodes these are the observations:

  1. "Consul members" command and "agent/self" api is hanging and non-responsive.
  2. The RSS size for the consul agent process is between 1 GB - 2.5 GB compared to ~40 MB on the responsive nodes.

Version of client and server - 1.0.2

Output of consul monitor --log-level=DEBUG
Heap dump taken on 1 server when RSS size was around 275 MB

@mkeeler
Copy link
Member

mkeeler commented Apr 4, 2018

@neerajb What platform was this on? I am looking to pull the right binary from https://releases.hashicorp.com/consul/1.0.2/ to match the heap dump you gathered

@neerajb
Copy link
Author

neerajb commented Apr 4, 2018

@mkeeler The binary is amd64 arch running on debian 8

@mkeeler mkeeler added the type/bug Feature does not function as expected label Apr 5, 2018
@mkeeler
Copy link
Member

mkeeler commented Apr 6, 2018

After looking over all of this there are a few things that stand out.

  1. In the monitor log there are tons of messages being dropped due to queues filling up. That internal queue should be processed pretty quickly under normal circumstances unless the cpu on that system is overloaded
  2. Other things in the monitor log make it look like there are some intermittent networking issues. In particular I saw this:

2018/04/04 15:16:31 [WARN] memberlist: Was able to connect to g1-kvm-208266 but other probes failed, network may be misconfigured

  1. The monitor log contains ping message timeouts when it attempts to contact other nodes in the cluster. This could be high network latency or cpu starving on the other systems.
  2. The node the heap dump was taken on appears to be processing many different full state syncs (over 130). Under normal circumstances this shouldn't happen as each node will choose a random non-failed node in the cluster to perform a full state sync with every 30 seconds. Its unlikely that over 130 of the nodes just happened to pick that singular node and made all the requests at the same time. Instead it would appear like these requests are taking longer than usual again pointing to potential cpu overload.

So given that my next batch of questions are

  1. On the failing nodes what does the overall cpu utilization on the system look like versus the healthy systems
  2. Could there be any packet loss or high latency between the nodes in the cluster.

@neerajb
Copy link
Author

neerajb commented Apr 9, 2018

  1. Overall cpu utilization is very low. Here is a sar sample
    sar.txt

  2. Packet loss also seems to be improbable. Here is a ping sample between 2 nodes:
    ping.txt

Also I can see > 9900 goroutines inside consul agent process stuck in
"/goroot/src/runtime/sema.go:56 sync.runtime_Semacquire (0x43ee19)"

It seems most of the goroutines are stuck trying to acquire a read lock on serf.memberLock mutex

@mkeeler
Copy link
Member

mkeeler commented Apr 9, 2018

@neerajb Thanks for the update. 9900+ goroutines hanging out waiting for the lock could certainly cause this behavior too. I am going to dig into the code a bit more to audit the locking a bit and will get back with you.

@mkeeler
Copy link
Member

mkeeler commented Apr 9, 2018

@neerajb Could you grab the output of:
go tool pprof -proto -output consul_mutex.prof /path/to/your/consul/binary http://<your agent:port>/debug/pprof/mutex

I am hoping it will have some information about lock contention and what is holding the locks for so long.

@neerajb
Copy link
Author

neerajb commented Apr 10, 2018

@mkeeler Seem to have found the issue. There is a deadlock scenario that is being triggered.
serf.go Stats() api is taking a Rlock on memberLock and waiting for TransmitLimitedQueue q.Lock inside NumQueued() in memberlist.
While getBroadcasts() in memberlist has taken the q.Lock and is waiting for memberLock Rlock.

The Stats() API was hit via /agent/Self HTTP API which was triggered by our healthcheck script.
Disabling this script has fixed the issue for us for the time being.
I think if we release the memerlock Lock inside Stats() API before trying to acquire TransmitLimitedQueue lock it should fix the issue. I will be happy to submit a PR for the same.

@neerajb
Copy link
Author

neerajb commented Apr 10, 2018

@mkeeler I patched the consul binary with the following change
"release the memberLock Lock inside Stats() API before trying to acquire TransmitLimitedQueue lock ". We are not encountering the deadlock with the health check script enabled in our cluster now.

@mkeeler
Copy link
Member

mkeeler commented Apr 10, 2018

@neerajb Thanks a ton. If you had a branch ready for PR we would be glad to take it.

@neerajb
Copy link
Author

neerajb commented Apr 11, 2018

@mkeeler have created a pull request here

@mkeeler mkeeler added this to the Next milestone Apr 12, 2018
@mkeeler
Copy link
Member

mkeeler commented Apr 12, 2018

After the 1.0.7 release I am going to update the vendoring for serf to pull in these changes.

@neerajb
Copy link
Author

neerajb commented Apr 17, 2018

Thanks @mkeeler

@pearkes
Copy link
Contributor

pearkes commented May 7, 2018

Should be fixed by #4088.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants