Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flannel hang if lease expired #1334

Merged
merged 1 commit into from
Mar 18, 2021
Merged

Conversation

chenchun
Copy link
Contributor

Description

Flannel sometimes hang when lease expired. I checked goroutines, it hangs on https://github.com/coreos/flannel/blob/v0.12.0/subnet/watch.go#L59 .
I can't share the goroutine screenshot because I'm using a modified version of flannel.

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Fix flannel hang if lease expired

@chenchun
Copy link
Contributor Author

The problem is that two goroutines share events via a chan, if the goroutine which reads from chan terminates before the one that writes to it, the later one would have a change to hang on writing to chan.

n.handleSubnetEvents(evtBatch)

case <-ctx.Done():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this, there is a chance this for loop ends before subnet.WatchLeases goroutine terminates, then no one will read events from evts chan, subnet.WatchLeases goroutine may happen to be writting to it. And it results in subnet.WatchLeases hangs forever on receiver <- batch.

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenchun Looks good. But I am not sure why you want to remove the context close event catching.

@rajatchopra
Copy link
Contributor

@chenchun Do you mind joining the Community Meeting one of the times and pursue this PR? We are still not sure on ignoring cancelled contexts.
https://github.com/coreos/flannel/blob/master/README.md#community-meeting

@chenchun
Copy link
Contributor Author

I'd like to, but it seems the next meeting is two weeks later.
I write a test case to expose the bug.
Without this patch, the test hangs at subnet/watch.go:64. You can try patch my fix to see if it resolves the issue.

# curl http://127.0.0.1:6060/debug/pprof/goroutine?debug=1
goroutine profile: total 8
1 @ 0x432be0 0x406b2b 0x406b01 0x4068e5 0x5d2406 0x7aa8cf 0x460001
#       0x5d2405        github.com/coreos/flannel/subnet.WatchLeases+0x355                      /root/go/src/github.com/coreos/flannel/subnet/watch.go:64
#       0x7aa8ce        github.com/coreos/flannel/backend/extension.(*network).Run.func1+0x5e   /root/go/src/github.com/coreos/flannel/backend/extension/extension_network.go:56

1 @ 0x432be0 0x407748 0x40771e 0x40740b 0x502947 0x506298 0x502579 0x503e57 0x502db6 0x7ab115 0x43280e 0x460001
#       0x502946        testing.(*T).Run+0x376          /usr/local/go/src/testing/testing.go:961
#       0x506297        testing.runTests.func1+0x77     /usr/local/go/src/testing/testing.go:1202
#       0x502578        testing.tRunner+0xc8            /usr/local/go/src/testing/testing.go:909
#       0x503e56        testing.runTests+0x2a6          /usr/local/go/src/testing/testing.go:1200
#       0x502db5        testing.(*M).Run+0x175          /usr/local/go/src/testing/testing.go:1117
#       0x7ab114        main.main+0x134                 _testmain.go:44
#       0x43280d        runtime.main+0x21d              /usr/local/go/src/runtime/proc.go:203

1 @ 0x432be0 0x407748 0x40771e 0x40744b 0x5cfc5b 0x460001
#       0x5cfc5a        k8s.io/klog.(*loggingT).flushDaemon+0x8a        /root/go/pkg/mod/k8s.io/[email protected]/klog.go:1010

1 @ 0x432be0 0x42d7da 0x42cda5 0x4911c5 0x493b48 0x493b27 0x5954b2 0x5b26d2 0x5b1257 0x7391f6 0x738f17 0x7aa973 0x7aa922 0x460001
#       0x42cda4        internal/poll.runtime_pollWait+0x54                             /usr/local/go/src/runtime/netpoll.go:184
#       0x4911c4        internal/poll.(*pollDesc).wait+0x44                             /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#       0x493b47        internal/poll.(*pollDesc).waitRead+0x1f7                        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#       0x493b26        internal/poll.(*FD).Accept+0x1d6                                /usr/local/go/src/internal/poll/fd_unix.go:384
#       0x5954b1        net.(*netFD).accept+0x41                                        /usr/local/go/src/net/fd_unix.go:238
#       0x5b26d1        net.(*TCPListener).accept+0x31                                  /usr/local/go/src/net/tcpsock_posix.go:139
#       0x5b1256        net.(*TCPListener).Accept+0x46                                  /usr/local/go/src/net/tcpsock.go:261
#       0x7391f5        net/http.(*Server).Serve+0x285                                  /usr/local/go/src/net/http/server.go:2896
#       0x738f16        net/http.(*Server).ListenAndServe+0xb6                          /usr/local/go/src/net/http/server.go:2825
#       0x7aa972        net/http.ListenAndServe+0x72                                    /usr/local/go/src/net/http/server.go:3080
#       0x7aa921        github.com/coreos/flannel/backend/extension.TestRun.func1+0x21  /root/go/src/github.com/coreos/flannel/backend/extension/extension_network_test.go:54

1 @ 0x432be0 0x443130 0x44311b 0x442d82 0x46fd54 0x7a90e0 0x7aaa23 0x460001
#       0x442d81        sync.runtime_Semacquire+0x41                                            /usr/local/go/src/runtime/sema.go:56
#       0x46fd53        sync.(*WaitGroup).Wait+0x63                                             /usr/local/go/src/sync/waitgroup.go:130
#       0x7a90df        github.com/coreos/flannel/backend/extension.(*network).Run+0x28f        /root/go/src/github.com/coreos/flannel/backend/extension/extension_network.go:68
#       0x7aaa22        github.com/coreos/flannel/backend/extension.TestRun.func2+0x72          /root/go/src/github.com/coreos/flannel/backend/extension/extension_network_test.go:66

1 @ 0x432be0 0x443130 0x44311b 0x442d82 0x46fd54 0x7aa80e 0x502579 0x460001
#       0x442d81        sync.runtime_Semacquire+0x41                                    /usr/local/go/src/runtime/sema.go:56
#       0x46fd53        sync.(*WaitGroup).Wait+0x63                                     /usr/local/go/src/sync/waitgroup.go:130
#       0x7aa80d        github.com/coreos/flannel/backend/extension.TestRun+0x20d       /root/go/src/github.com/coreos/flannel/backend/extension/extension_network_test.go:71
#       0x502578        testing.tRunner+0xc8                                            /usr/local/go/src/testing/testing.go:909

1 @ 0x47bb85 0x47996a 0x4920d4 0x492095 0x594b3f 0x5a81a8 0x72e968 0x460001
#       0x47bb84        syscall.Syscall+0x4                             /usr/local/go/src/syscall/asm_linux_amd64.s:18
#       0x479969        syscall.read+0x59                               /usr/local/go/src/syscall/zsyscall_linux_amd64.go:732
#       0x4920d3        syscall.Read+0x163                              /usr/local/go/src/syscall/syscall_unix.go:183
#       0x492094        internal/poll.(*FD).Read+0x124                  /usr/local/go/src/internal/poll/fd_unix.go:165
#       0x594b3e        net.(*netFD).Read+0x4e                          /usr/local/go/src/net/fd_unix.go:202
#       0x5a81a7        net.(*conn).Read+0x67                           /usr/local/go/src/net/net.go:184
#       0x72e967        net/http.(*connReader).backgroundRead+0x57      /usr/local/go/src/net/http/server.go:677

1 @ 0x546ef5 0x546d10 0x54393a 0x7a685a 0x7a7271 0x7359e4 0x7378bd 0x738e34 0x7347d5 0x460001
#       0x546ef4        runtime/pprof.writeRuntimeProfile+0x94  /usr/local/go/src/runtime/pprof/pprof.go:708
#       0x546d0f        runtime/pprof.writeGoroutine+0x9f       /usr/local/go/src/runtime/pprof/pprof.go:670
#       0x543939        runtime/pprof.(*Profile).WriteTo+0x3d9  /usr/local/go/src/runtime/pprof/pprof.go:329
#       0x7a6859        net/http/pprof.handler.ServeHTTP+0x339  /usr/local/go/src/net/http/pprof/pprof.go:245
#       0x7a7270        net/http/pprof.Index+0x6f0              /usr/local/go/src/net/http/pprof/pprof.go:268
#       0x7359e3        net/http.HandlerFunc.ServeHTTP+0x43     /usr/local/go/src/net/http/server.go:2007
#       0x7378bc        net/http.(*ServeMux).ServeHTTP+0x1bc    /usr/local/go/src/net/http/server.go:2387
#       0x738e33        net/http.serverHandler.ServeHTTP+0xa3   /usr/local/go/src/net/http/server.go:2802
#       0x7347d4        net/http.(*conn).serve+0x874            /usr/local/go/src/net/http/server.go:1890

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants