"consul members" command hangs on the agents #4011

neerajb · 2018-04-04T13:28:23Z

We have a 5000 node consul cluster - 5 servers and rest of them clients. After running them for about 48 hours the number of "failed" nodes reported by "consul members" command is increasing gradually (1000 so far).
On the failed nodes these are the observations:

"Consul members" command and "agent/self" api is hanging and non-responsive.
The RSS size for the consul agent process is between 1 GB - 2.5 GB compared to ~40 MB on the responsive nodes.

Version of client and server - 1.0.2

Output of consul monitor --log-level=DEBUG
Heap dump taken on 1 server when RSS size was around 275 MB

mkeeler · 2018-04-04T14:07:58Z

@neerajb What platform was this on? I am looking to pull the right binary from https://releases.hashicorp.com/consul/1.0.2/ to match the heap dump you gathered

neerajb · 2018-04-04T15:42:05Z

@mkeeler The binary is amd64 arch running on debian 8

mkeeler · 2018-04-06T17:17:40Z

After looking over all of this there are a few things that stand out.

In the monitor log there are tons of messages being dropped due to queues filling up. That internal queue should be processed pretty quickly under normal circumstances unless the cpu on that system is overloaded
Other things in the monitor log make it look like there are some intermittent networking issues. In particular I saw this:

2018/04/04 15:16:31 [WARN] memberlist: Was able to connect to g1-kvm-208266 but other probes failed, network may be misconfigured

The monitor log contains ping message timeouts when it attempts to contact other nodes in the cluster. This could be high network latency or cpu starving on the other systems.
The node the heap dump was taken on appears to be processing many different full state syncs (over 130). Under normal circumstances this shouldn't happen as each node will choose a random non-failed node in the cluster to perform a full state sync with every 30 seconds. Its unlikely that over 130 of the nodes just happened to pick that singular node and made all the requests at the same time. Instead it would appear like these requests are taking longer than usual again pointing to potential cpu overload.

So given that my next batch of questions are

On the failing nodes what does the overall cpu utilization on the system look like versus the healthy systems
Could there be any packet loss or high latency between the nodes in the cluster.

neerajb · 2018-04-09T06:01:53Z

Overall cpu utilization is very low. Here is a sar sample
sar.txt
Packet loss also seems to be improbable. Here is a ping sample between 2 nodes:
ping.txt

Also I can see > 9900 goroutines inside consul agent process stuck in
"/goroot/src/runtime/sema.go:56 sync.runtime_Semacquire (0x43ee19)"

It seems most of the goroutines are stuck trying to acquire a read lock on serf.memberLock mutex

mkeeler · 2018-04-09T15:44:10Z

@neerajb Thanks for the update. 9900+ goroutines hanging out waiting for the lock could certainly cause this behavior too. I am going to dig into the code a bit more to audit the locking a bit and will get back with you.

mkeeler · 2018-04-09T16:23:34Z

@neerajb Could you grab the output of:
go tool pprof -proto -output consul_mutex.prof /path/to/your/consul/binary http://<your agent:port>/debug/pprof/mutex

I am hoping it will have some information about lock contention and what is holding the locks for so long.

neerajb · 2018-04-10T07:39:27Z

@mkeeler Seem to have found the issue. There is a deadlock scenario that is being triggered.
serf.go Stats() api is taking a Rlock on memberLock and waiting for TransmitLimitedQueue q.Lock inside NumQueued() in memberlist.
While getBroadcasts() in memberlist has taken the q.Lock and is waiting for memberLock Rlock.

The Stats() API was hit via /agent/Self HTTP API which was triggered by our healthcheck script.
Disabling this script has fixed the issue for us for the time being.
I think if we release the memerlock Lock inside Stats() API before trying to acquire TransmitLimitedQueue lock it should fix the issue. I will be happy to submit a PR for the same.

neerajb · 2018-04-10T10:21:31Z

@mkeeler I patched the consul binary with the following change
"release the memberLock Lock inside Stats() API before trying to acquire TransmitLimitedQueue lock ". We are not encountering the deadlock with the health check script enabled in our cluster now.

mkeeler · 2018-04-10T13:05:14Z

@neerajb Thanks a ton. If you had a branch ready for PR we would be glad to take it.

neerajb · 2018-04-11T06:09:46Z

@mkeeler have created a pull request here

mkeeler · 2018-04-12T13:38:35Z

After the 1.0.7 release I am going to update the vendoring for serf to pull in these changes.

neerajb · 2018-04-17T06:17:36Z

Thanks @mkeeler

pearkes · 2018-05-07T16:08:44Z

Should be fixed by #4088.

mkeeler added the type/bug Feature does not function as expected label Apr 5, 2018

neerajb mentioned this issue Apr 11, 2018

Fix for deadlock between stats() in serf and getBroadcasts() in memberlist hashicorp/serf#507

Merged

mkeeler added this to the Next milestone Apr 12, 2018

pearkes modified the milestones: Upcoming, 1.1.0 Apr 20, 2018

mkeeler mentioned this issue May 4, 2018

Upgrade Serf to Latest #4087

Closed

preetapan mentioned this issue May 7, 2018

Update serf to pick up clean leave fix #4088

Merged

pearkes closed this as completed May 7, 2018

jwalton mentioned this issue Jul 21, 2019

/v1/agent endpoints hang #6184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"consul members" command hangs on the agents #4011

"consul members" command hangs on the agents #4011

neerajb commented Apr 4, 2018

mkeeler commented Apr 4, 2018

neerajb commented Apr 4, 2018

mkeeler commented Apr 6, 2018

neerajb commented Apr 9, 2018 •

edited

Loading

mkeeler commented Apr 9, 2018

mkeeler commented Apr 9, 2018 •

edited

Loading

neerajb commented Apr 10, 2018

neerajb commented Apr 10, 2018

mkeeler commented Apr 10, 2018

neerajb commented Apr 11, 2018

mkeeler commented Apr 12, 2018

neerajb commented Apr 17, 2018

pearkes commented May 7, 2018

"consul members" command hangs on the agents #4011

"consul members" command hangs on the agents #4011

Comments

neerajb commented Apr 4, 2018

mkeeler commented Apr 4, 2018

neerajb commented Apr 4, 2018

mkeeler commented Apr 6, 2018

neerajb commented Apr 9, 2018 • edited Loading

mkeeler commented Apr 9, 2018

mkeeler commented Apr 9, 2018 • edited Loading

neerajb commented Apr 10, 2018

neerajb commented Apr 10, 2018

mkeeler commented Apr 10, 2018

neerajb commented Apr 11, 2018

mkeeler commented Apr 12, 2018

neerajb commented Apr 17, 2018

pearkes commented May 7, 2018

neerajb commented Apr 9, 2018 •

edited

Loading

mkeeler commented Apr 9, 2018 •

edited

Loading