Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respider timer can die without running #3

Closed
philpennock opened this issue Dec 31, 2012 · 8 comments
Closed

Respider timer can die without running #3

philpennock opened this issue Dec 31, 2012 · 8 comments
Assignees
Labels

Comments

@philpennock
Copy link
Owner

This is worrying:

2012/11/21 21:11:27 main.go:157: Sleeping 7h59m53.410974498s before next respider
2012/11/24 05:19:01 main.go:157: Sleeping 8h0m13.584457147s before next respider
2012/11/24 13:22:31 main.go:157: Sleeping 7h59m41.516842812s before next respider
2012/11/24 23:32:56 main.go:157: Sleeping 7h59m44.415633029s before next respider
2012/11/25 09:43:24 main.go:157: Sleeping 8h0m35.908781979s before next respider
2012/12/31 16:02:07 main.go:157: Sleeping 8h0m58.686923411s before next respider

Before the last one there, I moved aside the persisted JSON file, so there was a fresh index. I should explore the execution paths to ensure that even after loading content instead of spidering it, the timer is set up and everything still works.

@ghost ghost assigned philpennock Dec 31, 2012
@sgrayban
Copy link

I'm seeing the same thing

$ cat logs/sksdaemon.log |grep respider
2012/12/31 03:58:16 main.go:157: Sleeping 8h0m43.082153551s before next respider
2012/12/31 04:17:42 main.go:157: Sleeping 8h0m9.436884807s before next respider
2012/12/31 05:05:42 main.go:154: respider period too low, capping 1791666145821 up to 1800000000000
2012/12/31 05:05:42 main.go:157: Sleeping 30m0s before next respider
2012/12/31 05:07:47 main.go:157: Sleeping 30m54.235010051s before next respider
2012/12/31 05:07:57 main.go:157: Sleeping 30m54.235010051s before next respider
2012/12/31 05:24:28 main.go:157: Sleeping 30m11.860293954s before next respider
2012/12/31 05:43:24 main.go:157: Sleeping 30m11.860293954s before next respider
2012/12/31 06:39:38 main.go:157: Sleeping 30m11.860293954s before next respider
2012/12/31 07:26:24 main.go:154: respider period too low, capping 1773983192302 up to 1800000000000
2012/12/31 07:26:24 main.go:157: Sleeping 30m0s before next respider
2012/12/31 08:12:13 main.go:154: respider period too low, capping 1750886052001 up to 1800000000000
2012/12/31 08:12:13 main.go:157: Sleeping 30m0s before next respider

@philpennock
Copy link
Owner Author

I don't see any sign there of the respider failing to be scheduled for a period of time? Your server appears to be operating correctly.

The key problem in this bug is that there's a large window of time where no respider was scheduled at all. The messages you see there are not signs of a problem: they're routine diagnostic traces saying "okay, this is when you should next expect to see this sort of activity from me".

You should be aware that some keyserver operators look at their logs for people hitting the op=stats page and will ask questions if they see it hit a lot. You might want to dial back from "every 30 minutes". I picked 8 hours as a reasonable default and that's what I run with.

@philpennock
Copy link
Owner Author

a43ae09 does a recover from panic in the spider launched from the infinite loop, which should prevent the silent disappearance of the respider and log what went wrong, so that can actually be fixed.

philpennock added a commit that referenced this issue Jan 20, 2013
Still seeing spider thread stop, still no reason given for why.

So, import net/http/pprof which gives us /debug/pprof/ URLs, with
  /debug/pprof/goroutine?debug=1
being roughly equivalent to the old Python version's /threadz.

We also get some other human-readable output at links from /debug/pprof/

Also, import expvar and set some basic stats counters from spidering.

SECURITY: definitely do not pass all of the URL space through from the
front-end, only /sks-stats as strongly suggested.

[ Affects issue #3 ]
@philpennock
Copy link
Owner Author

Okay, the lack of timeouts on HTTP client methods is very likely what's going on.

Very first run on the newly instrumented code, I see this in the output of http://localhost:8001/debug/pprof/goroutine?debug=1:

1 @ 0x410134 0x417586 0x4176da 0x47ae3e 0x42d6c3 0x429b57 0x400c18 0x40eec0 0x40ef63
#       0x4176da        sync.runtime_Semacquire+0x25                            /usr/local/go/src/pkg/runtime/zsema_amd64.c:146
#       0x47ae3e        sync.(*WaitGroup).Wait+0xf2                             /usr/local/go/src/pkg/sync/waitgroup.go:80
#       0x42d6c3        github.com/syscomet/sks_spider.(*Spider).Wait+0x2a      /home/pdp/src/go/src/github.com/syscomet/sks_spider/spider.go:128
#       0x429b57        github.com/syscomet/sks_spider.Main+0xa92               /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:255
#       0x400c18        main.main+0x18                                          /home/pdp/src/go/src/github.com/syscomet/sks_spider/sks_stats_daemon.go:9
#       0x40eec0        runtime.main+0x92                                       /usr/local/go/src/pkg/runtime/proc.c:245

[...]

1 @ 0x410134 0x403100 0x4034a8 0x4ace21 0x4a9f6f 0x4986f0 0x498efb 0x49828b 0x42b464 0x42f7aa 0x40ef63
#       0x4ace21        net/http.(*persistConn).roundTrip+0x33a                         /usr/local/go/src/pkg/net/http/transport.go:682
#       0x4a9f6f        net/http.(*Transport).RoundTrip+0x32f                           /usr/local/go/src/pkg/net/http/transport.go:161
#       0x4986f0        net/http.send+0x3ca                                             /usr/local/go/src/pkg/net/http/client.go:140
#       0x498efb        net/http.(*Client).doFollowingRedirects+0x633                   /usr/local/go/src/pkg/net/http/client.go:251
#       0x49828b        net/http.(*Client).Do+0x7e                                      /usr/local/go/src/pkg/net/http/client.go:106
#       0x42b464        github.com/syscomet/sks_spider.(*SksNode).Fetch+0xfd            /home/pdp/src/go/src/github.com/syscomet/sks_spider/server_info.go:119
#       0x42f7aa        github.com/syscomet/sks_spider.(*spiderShared).QueryHost+0x5b   /home/pdp/src/go/src/github.com/syscomet/sks_spider/spider.go:308

So, we're waiting on the WaitGroup behind spider.Wait(), which is why nothing appears in the log: the spidering is still happening.

The lsof command shows a connection to a certain IP address.

Using the BSD tcpdrop command to manually kill that TCP connection in the system causes spidering to complete and the results to be shown.

The question is whether this is a "3WHS completes, nothing ever comes back", "3WHS never completes" or "tarpit" scenario. After all, I already have "operation timed out" entries in the log, so it's not the normal failure that causes this.

philpennock added a commit that referenced this issue Jan 20, 2013
In addition to "operation timed out", there's some state where a TCP
connection can be established but we never, after days, get a response.
It might be tarpitting.  I don't know or care: instead, the client fetch
now wraps the fetch with a timeout, leading to:

    2013/01/20 05:37:53 spider.go:337: Failure fetching "[censored]": HTTP Do() timed out after 2m0s

Fixes issue #3
@philpennock philpennock reopened this Jan 23, 2013
@philpennock
Copy link
Owner Author

2013/01/20 22:19:18 main.go:162: Sleeping 8h0m37.609083209s before next respider
2013/01/21 06:19:56 main.go:164: Awoken!  Time to spider.

No open TCP connections, one listening socket.

% curl -s http://localhost:8001/debug/pprof/goroutine\?debug=1 
goroutine profile: total 13
1 @ 0x554473 0x5542a8 0x5516a0 0x4b3715 0x4b385d 0x4a42ac 0x4a5100 0x4a40bf 0x40ef63
#   0x554473    runtime/pprof.writeRuntimeProfile+0x88  /usr/local/go/src/pkg/runtime/pprof/pprof.go:517
#   0x5542a8    runtime/pprof.writeGoroutine+0x82   /usr/local/go/src/pkg/runtime/pprof/pprof.go:479
#   0x5516a0    runtime/pprof.(*Profile).WriteTo+0xa2   /usr/local/go/src/pkg/runtime/pprof/pprof.go:221
#   0x4b3715    net/http/pprof.handler.ServeHTTP+0x210  /usr/local/go/src/pkg/net/http/pprof/pprof.go:163
#   0x4b385d    net/http/pprof.Index+0x143      /usr/local/go/src/pkg/net/http/pprof/pprof.go:175
#   0x4a42ac    net/http.HandlerFunc.ServeHTTP+0x3d /usr/local/go/src/pkg/net/http/server.go:704
#   0x4a5100    net/http.(*ServeMux).ServeHTTP+0xb9 /usr/local/go/src/pkg/net/http/server.go:942
#   0x4a40bf    net/http.(*conn).serve+0x621        /usr/local/go/src/pkg/net/http/server.go:670

1 @ 0x410134 0x417586 0x4176da 0x47b53e 0x429bd3 0x400c18 0x40eec0 0x40ef63
#   0x4176da    sync.runtime_Semacquire+0x25            /usr/local/go/src/pkg/runtime/zsema_amd64.c:146
#   0x47b53e    sync.(*WaitGroup).Wait+0xf2         /usr/local/go/src/pkg/sync/waitgroup.go:80
#   0x429bd3    github.com/syscomet/sks_spider.Main+0x982   /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:290
#   0x400c18    main.main+0x18                  /home/pdp/src/go/src/github.com/syscomet/sks_spider/sks_stats_daemon.go:9
#   0x40eec0    runtime.main+0x92               /usr/local/go/src/pkg/runtime/proc.c:245

1 @ 0x410170 0x40da91 0x40ef63
#   0x410170    runtime.entersyscall+0x37   /usr/local/go/src/pkg/runtime/proc.c:952
#   0x40da91    runtime.MHeap_Scavenger+0xce    /usr/local/go/src/pkg/runtime/mheap.c:364

1 @ 0x4102d3 0x41795d 0x4d3234 0x40ef63
#   0x4d3234    os/signal.loop+0x1c /usr/local/go/src/pkg/os/signal/signal_unix.go:20

1 @ 0x410134 0x41038a 0x4d83f5 0x4d683b 0x4d53db 0x44f482 0x44be97 0x44c884 0x44ec00 0x45c855 0x45c94f 0x4a584f 0x4a57ac 0x428e50 0x40ef63
#   0x4d83f5    syscall.Syscall6+0x61                   /usr/local/go/src/pkg/syscall/asm_freebsd_amd64.s:57
#   0x4d683b    syscall.kevent+0x88                 /usr/local/go/src/pkg/syscall/zsyscall_freebsd_amd64.go:199
#   0x4d53db    syscall.Kevent+0xa4                 /usr/local/go/src/pkg/syscall/syscall_bsd.go:546
#   0x44f482    net.(*pollster).AddFD+0xe5              /usr/local/go/src/pkg/net/fd_freebsd.go:52
#   0x44be97    net.(*pollServer).AddFD+0xe8                /usr/local/go/src/pkg/net/fd.go:119
#   0x44c884    net.(*pollServer).WaitRead+0x37             /usr/local/go/src/pkg/net/fd.go:266
#   0x44ec00    net.(*netFD).accept+0x20d               /usr/local/go/src/pkg/net/fd.go:622
#   0x45c855    net.(*TCPListener).AcceptTCP+0x71           /usr/local/go/src/pkg/net/tcpsock_posix.go:320
#   0x45c94f    net.(*TCPListener).Accept+0x49              /usr/local/go/src/pkg/net/tcpsock_posix.go:330
#   0x4a584f    net/http.(*Server).Serve+0x88               /usr/local/go/src/pkg/net/http/server.go:1029
#   0x4a57ac    net/http.(*Server).ListenAndServe+0xb6          /usr/local/go/src/pkg/net/http/server.go:1019
#   0x428e50    github.com/syscomet/sks_spider.startHttpServing+0xd4    /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:186

3 @ 0x410134 0x403cdc 0x4039db 0x42e2d4 0x40ef63
#   0x403cdc    selectgo+0x2df                      /usr/local/go/src/pkg/runtime/chan.c:915
#   0x4039db    runtime.selectgo+0x12                   /usr/local/go/src/pkg/runtime/chan.c:792
#   0x42e2d4    github.com/syscomet/sks_spider.spiderMainLoop+0x43b /home/pdp/src/go/src/github.com/syscomet/sks_spider/spider.go:151

1 @ 0x4102d3 0x4d8399 0x4d683b 0x4d53db 0x44f9f5 0x44c695 0x40ef63
#   0x4d8399    syscall.Syscall6+0x5        /usr/local/go/src/pkg/syscall/asm_freebsd_amd64.s:38
#   0x4d683b    syscall.kevent+0x88     /usr/local/go/src/pkg/syscall/zsyscall_freebsd_amd64.go:199
#   0x4d53db    syscall.Kevent+0xa4     /usr/local/go/src/pkg/syscall/syscall_bsd.go:546
#   0x44f9f5    net.(*pollster).WaitFD+0x185    /usr/local/go/src/pkg/net/fd_freebsd.go:91
#   0x44c695    net.(*pollServer).Run+0xe4  /usr/local/go/src/pkg/net/fd.go:236

1 @ 0x410134 0x4188e3 0x40ef63
#   0x4188e3    timerproc+0x8e  /usr/local/go/src/pkg/runtime/ztime_amd64.c:144

1 @ 0x410134 0x40c59b 0x40ef63
#   0x40c59b    runfinq+0x75    /usr/local/go/src/pkg/runtime/mgc0.c:1055

1 @ 0x410134 0x4033d8 0x4034f3 0x428f8e 0x40ef63
#   0x428f8e    github.com/syscomet/sks_spider.shutdownRunner+0x39  /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:194

1 @ 0x410134 0x402c80 0x40346b 0x41af69 0x42db31 0x432618 0x428d5c 0x40ef63
#   0x41af69    github.com/syscomet/sks_spider.KillDummySpiderForDiagnosticsChannel+0x70    /home/pdp/src/go/src/github.com/syscomet/sks_spider/diagnostics.go:70
#   0x42db31    github.com/syscomet/sks_spider.StartSpider+0x2bd                /home/pdp/src/go/src/github.com/syscomet/sks_spider/spider.go:105
#   0x432618    github.com/syscomet/sks_spider._func_009+0x1c                   /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:167
#   0x428d5c    github.com/syscomet/sks_spider.respiderPeriodically+0x264           /home/pdp/src/go/src/github.com/syscomet/sks_spider/main.go:177

@philpennock
Copy link
Owner Author

Looks like I need to take a closer look at the known-gross-hacks involved in KillDummySpiderForDiagnosticsChannel() and why I even did that, when I get time to get back to it.

@philpennock
Copy link
Owner Author

Okay, KillDummySpiderForDiagnosticsChannel() has a queue-depth of 1, and so if starting the spider fails the second time, then the second kill will write to the buffered channel, and the third would then hang, which matches with the timestamps, so it's a filled channel. Spotted that after already deciding those channels should be synchronous, which means failures will become more visible (and thus debuggable).

@philpennock
Copy link
Owner Author

For the record: confirmed that this really has solved the issue, as far as I can see.

Last startup message:

2013/01/24 09:13:44 main.go:258: Spidering complete

(and I've just changed the text for that to be Start-up initial spidering complete which is clearer).

Since then:

2013/01/24 09:13:44 main.go:162: Sleeping 8h0m58.024909786s before next respider
2013/01/24 17:16:11 main.go:162: Sleeping 7h59m51.449684086s before next respider
2013/01/25 01:17:55 main.go:162: Sleeping 8h0m44.161869005s before next respider
2013/01/25 09:20:16 main.go:162: Sleeping 7h59m36.630029087s before next respider
2013/01/25 17:21:20 main.go:162: Sleeping 8h0m11.346044423s before next respider
2013/01/26 01:23:30 main.go:162: Sleeping 7h59m58.279115477s before next respider
2013/01/26 09:25:29 main.go:162: Sleeping 7h59m15.152512747s before next respider
2013/01/26 17:26:17 main.go:162: Sleeping 7h59m22.512759888s before next respider
2013/01/27 01:27:40 main.go:162: Sleeping 7h59m47.909222224s before next respider
2013/01/27 09:29:30 main.go:162: Sleeping 8h0m49.1585311s before next respider
2013/01/27 17:32:20 main.go:162: Sleeping 8h0m3.822322141s before next respider
2013/01/28 01:34:24 main.go:162: Sleeping 7h59m23.927246502s before next respider
2013/01/28 09:35:48 main.go:162: Sleeping 7h59m14.624590453s before next respider
2013/01/28 17:36:39 main.go:162: Sleeping 8h0m59.352209924s before next respider

So, spidering is no longer dying, or at least if it is, something else is causing it and it's much rarer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants