-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"consul members" command hangs on the agents #4011
Comments
@neerajb What platform was this on? I am looking to pull the right binary from https://releases.hashicorp.com/consul/1.0.2/ to match the heap dump you gathered |
@mkeeler The binary is amd64 arch running on debian 8 |
After looking over all of this there are a few things that stand out.
So given that my next batch of questions are
|
Also I can see > 9900 goroutines inside consul agent process stuck in It seems most of the goroutines are stuck trying to acquire a read lock on serf.memberLock mutex |
@neerajb Thanks for the update. 9900+ goroutines hanging out waiting for the lock could certainly cause this behavior too. I am going to dig into the code a bit more to audit the locking a bit and will get back with you. |
@neerajb Could you grab the output of: I am hoping it will have some information about lock contention and what is holding the locks for so long. |
@mkeeler Seem to have found the issue. There is a deadlock scenario that is being triggered. The Stats() API was hit via /agent/Self HTTP API which was triggered by our healthcheck script. |
@mkeeler I patched the consul binary with the following change |
@neerajb Thanks a ton. If you had a branch ready for PR we would be glad to take it. |
After the 1.0.7 release I am going to update the vendoring for serf to pull in these changes. |
Thanks @mkeeler |
Should be fixed by #4088. |
We have a 5000 node consul cluster - 5 servers and rest of them clients. After running them for about 48 hours the number of "failed" nodes reported by "consul members" command is increasing gradually (1000 so far).
On the failed nodes these are the observations:
Version of client and server - 1.0.2
Output of consul monitor --log-level=DEBUG
Heap dump taken on 1 server when RSS size was around 275 MB
The text was updated successfully, but these errors were encountered: