-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: make WaitGroup.Add sync with Wait #6628
Conversation
@@ -649,7 +651,7 @@ func (s *EtcdServer) run() { | |||
sched.Stop() | |||
|
|||
// wait for gouroutines before closing raft so wal stays open | |||
s.wg.Wait() | |||
s.goWait() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's rejecting goroutines now, would it be cleaner to do this?
s.wgMu.Lock()
close(s.stopping)
s.wgMu.Unlock()
s.wg.Wait()
sched.Stop()
goAttach() {
s.wgMu.RLock()
defer s.mgMu.RUnlock()
select {
case <-s.stopping:
plog.Warning(...)
return
default:
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks etcdserver/server_test.go TestTriggerSnap
tests since it's returning before goAttach
triggers snapshot. Then should we change the test? I will investigate more. Thanks.
I do no think this pull request is necessary. Previously the sync is driven by the raft routine, which runs concurrently with server routine. Thus we might call wg.Add and wg.Wait concurrently. Now we actually move the sync thing into server routine, so it wont be called concurrently. Right? |
One potential problem is actually at line706, when we call goAttach in another routine we just created. |
|
@heyitsanthony @gyuho OK. I will give this another look. |
d9fb22c
to
4771ac5
Compare
@heyitsanthony @xiang90 Fixed as Anthony suggested. PTAL. Thanks. |
@@ -235,6 +235,9 @@ type EtcdServer struct { | |||
|
|||
msgSnapC chan raftpb.Message | |||
|
|||
// wgMu blocks concurrent waitgroup mutation while server stopping | |||
wgMu sync.RWMutex | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this empty line? group this together with wg.
|
||
// wait for gouroutines before closing raft so wal stays open | ||
s.wg.Wait() | ||
|
||
sched.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we change the ordering here? why wait then stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted that change. Thanks.
<-donec | ||
srv.Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why move this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goAttach
of running snapshot operation exits (not executing the snapshot) from EtcdServer.Stop
that is called before this line https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1390. So the test was blocking forever to wait for snapshot to finish.
I think we should wait for snapshot to finish first (<-donec
) and then stop the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fair enough
lgtm |
LGTM. Thanks! |
This PR just helped me with the same problem I encountered in a completely unrelated project. Thanks! |
Fix #6534.
After stopping the cluster, I think it's ok to drop other remaining goroutines
i.e., we might drop goroutines here https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1503-L1510 when it misses to receive from
<-s.stopping
/cc @xiang90 @heyitsanthony
Any feedback?