Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: make WaitGroup.Add sync with Wait #6628

Merged
merged 1 commit into from
Oct 12, 2016

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Oct 11, 2016

Fix #6534.

After stopping the cluster, I think it's ok to drop other remaining goroutines

i.e., we might drop goroutines here https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1503-L1510 when it misses to receive from <-s.stopping

/cc @xiang90 @heyitsanthony

Any feedback?

@@ -649,7 +651,7 @@ func (s *EtcdServer) run() {
sched.Stop()

// wait for gouroutines before closing raft so wal stays open
s.wg.Wait()
s.goWait()
Copy link
Contributor

@heyitsanthony heyitsanthony Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's rejecting goroutines now, would it be cleaner to do this?

s.wgMu.Lock()
close(s.stopping)
s.wgMu.Unlock()
s.wg.Wait()

sched.Stop()
goAttach() {
s.wgMu.RLock()
defer s.mgMu.RUnlock()
select {
case <-s.stopping:
    plog.Warning(...)
    return
default:
}
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks etcdserver/server_test.go TestTriggerSnap tests since it's returning before goAttach triggers snapshot. Then should we change the test? I will investigate more. Thanks.

@xiang90
Copy link
Contributor

xiang90 commented Oct 11, 2016

I do no think this pull request is necessary. Previously the sync is driven by the raft routine, which runs concurrently with server routine. Thus we might call wg.Add and wg.Wait concurrently. Now we actually move the sync thing into server routine, so it wont be called concurrently. Right?

@xiang90
Copy link
Contributor

xiang90 commented Oct 11, 2016

One potential problem is actually at line706, when we call goAttach in another routine we just created.

@heyitsanthony
Copy link
Contributor

monitorVersions will also race, this is probably necessary

@xiang90
Copy link
Contributor

xiang90 commented Oct 11, 2016

@heyitsanthony @gyuho OK. I will give this another look.

@gyuho gyuho force-pushed the fix-waitgroup branch 2 times, most recently from d9fb22c to 4771ac5 Compare October 12, 2016 18:51
@gyuho
Copy link
Contributor Author

gyuho commented Oct 12, 2016

@heyitsanthony @xiang90 Fixed as Anthony suggested.

PTAL. Thanks.

@@ -235,6 +235,9 @@ type EtcdServer struct {

msgSnapC chan raftpb.Message

// wgMu blocks concurrent waitgroup mutation while server stopping
wgMu sync.RWMutex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this empty line? group this together with wg.


// wait for gouroutines before closing raft so wal stays open
s.wg.Wait()

sched.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we change the ordering here? why wait then stop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted that change. Thanks.

<-donec
srv.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goAttach of running snapshot operation exits (not executing the snapshot) from EtcdServer.Stop that is called before this line https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L1390. So the test was blocking forever to wait for snapshot to finish.

I think we should wait for snapshot to finish first (<-donec) and then stop the server?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fair enough

@heyitsanthony
Copy link
Contributor

lgtm

@xiang90
Copy link
Contributor

xiang90 commented Oct 12, 2016

LGTM. Thanks!

@gyuho gyuho merged commit 614adb0 into etcd-io:master Oct 12, 2016
@gyuho gyuho deleted the fix-waitgroup branch October 12, 2016 21:10
@warent
Copy link

warent commented Sep 7, 2017

This PR just helped me with the same problem I encountered in a completely unrelated project. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants