Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for deadlock between stats() in serf and getBroadcasts() in memberlist #507

Merged
merged 1 commit into from
Apr 11, 2018

Conversation

neerajb
Copy link
Contributor

@neerajb neerajb commented Apr 11, 2018

Stats function can cause deadlock by holding on to the readLock on memberLock while attempting to read the length of broadcast queues. This is described in this issue.
This fix releases the memberLock before invoking the broadcast queues.
This is the stacktrace of the 2 goroutines that are stuck in the deadlock

GoRoutine 1 (waiting for memberlock.RLock in serf.go)

0 0x000000000042d6ec in runtime.gopark
at /goroot/src/runtime/proc.go:288
1 0x000000000042d7de in runtime.goparkunlock
at /goroot/src/runtime/proc.go:293
2 0x000000000043ed74 in runtime.semacquire1
at /goroot/src/runtime/sema.go:144
3 0x000000000043e999 in sync.runtime_Semacquire
at /goroot/src/runtime/sema.go:56
4 0x0000000000473b79 in sync.(*RWMutex).RLock
at /goroot/src/sync/rwmutex.go:50
5 0x00000000008fd216 in github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf.(*Serf).NumNodes
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf/serf.go:1749
6 0x000000000090181a in github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf.(*Serf).NumNodes-fm
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf/serf.go:342
7 0x00000000008da16c in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*TransmitLimitedQueue).GetBroadcasts
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/queue.go:80
8 0x00000000008e8df1 in github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf.(*delegate).GetBroadcasts
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf/delegate.go:124
9 0x00000000008c9c39 in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*Memberlist).getBroadcasts
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/broadcast.go:88
10 0x00000000008de833 in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*Memberlist).gossip
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/state.go:511
11 0x00000000008e56fa in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*Memberlist).(github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.gossip)-fm
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/state.go:111
12 0x00000000008dc18f in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*Memberlist).triggerFunc
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/state.go:135
13 0x000000000045ce61 in runtime.goexit
at /goroot/src/runtime/asm_amd64.s:2337

Goroutine 2 (waiting for q.Lock in queue.go)

0 0x000000000042d6ec in runtime.gopark
at /goroot/src/runtime/proc.go:288
1 0x000000000042d7de in runtime.goparkunlock
at /goroot/src/runtime/proc.go:293
2 0x000000000043ed74 in runtime.semacquire1
at /goroot/src/runtime/sema.go:144
3 0x000000000043ea8d in sync.runtime_SemacquireMutex
at /goroot/src/runtime/sema.go:71
4 0x0000000000472c5e in sync.(*Mutex).Lock
at /goroot/src/sync/mutex.go:134
5 0x00000000008da5c3 in github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist.(*TransmitLimitedQueue).NumQueued
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/memberlist/queue.go:116
6 0x00000000008fc40b in github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf.(*Serf).Stats
at /gopath/src/github.com/hashicorp/consul/vendor/github.com/hashicorp/serf/serf/serf.go:1682
7 0x00000000009c7c2d in github.com/hashicorp/consul/agent/consul.(*Client).Stats
at /gopath/src/github.com/hashicorp/consul/agent/consul/client.go:346
8 0x0000000000f2ceaf in github.com/hashicorp/consul/agent.(*Agent).Stats
at /gopath/src/github.com/hashicorp/consul/agent/agent.go:2034
9 0x0000000000f3235d in github.com/hashicorp/consul/agent.(*HTTPServer).AgentSelf
at /gopath/src/github.com/hashicorp/consul/agent/agent_endpoint.go:75
10 0x0000000000f72c3f in github.com/hashicorp/consul/agent.(*HTTPServer).handler.func2
at /gopath/src/github.com/hashicorp/consul/agent/http.go:104

@mkeeler
Copy link
Member

mkeeler commented Apr 11, 2018

So at first glance this didn't seem like it should be a problem because the memberLock is a RWMutex. However we could still get into this scenario when Stats gains a read lock on memberLock prior to the second RLock call from within NumNodes then an event handler for other gossip related activities attempts to gain a full lock on the memberLock. Then according to the RWMutex documentation as soon as something tries to get that full lock, future read locks even recursive ones from the same goroutine that already holds a lock will block. Essentially this problem boils down the fact that the RWMutex can't be used as a normal recursive/reentrant lock. This PR fixes the problem by making sure to release the current read lock prior to doing any operations which may acquire another read lock to prevent recursive usage.

Thanks again @neerajb for finding and fixing this.

@mkeeler mkeeler self-requested a review April 11, 2018 15:19
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Copy link

@preetapan preetapan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me

@mkeeler mkeeler merged commit 75eaa37 into hashicorp:master Apr 11, 2018
@neerajb neerajb deleted the fix_stats_deadlock branch April 12, 2018 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants