-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: SIGSEGV while handling SIGSEGV #13978
Comments
Thanks for the report. Unfortunately I don't see enough here to know what to fix. Are you saying that the Go program simply crashes with "Segmentation fault (core dumped)" and doesn't print anything else? That should (of course) never happen. There is something odd here. Your The backtrace line suggests that the crash is because If you can figure out the arguments in the core dump, you can determine where the first SIGSEGV occurred by looking at the ctx argument to sigtrampgo. It will be a pointer to a sigcontext struct, defined in runtime/defs_linux_amd64.go. The rip field should point to the instruction that got the error. |
I'm cross compiling from darwin to linux, so compiler is darwin/amd64 and runtime is sys_linux_amd64. Should have mentioned that.
Yes the go program simply crashes with I'll poke and see if I can figure out where the first sigsegv is occurring. |
Does your program crash if you compile it natively on the target Linux On Sat, 16 Jan 2016, 10:23 tarrant [email protected] wrote:
|
I've reproduced the issue with a linux built binary. This also happened to test it with go1.6beta2. |
Can you please paste the full stack trace from the crash. On Mon, Jan 18, 2016 at 3:18 AM, tarrant [email protected] wrote:
|
The go stacktrace (doesn't really exist) but here is what it prints out:
gdb:
|
I don't understand. There should be a panic from Go, especially for SEGV. Are you sure there is no cgo in your project ? If you're doing packet On Mon, Jan 18, 2016 at 7:57 AM, tarrant [email protected] wrote:
|
We're unfortunately not able to share the source code for the program. There's no cgo in this program outside of the core net and runtime/race packages. The packet processing is done on UDP packets in pure Go, not on SOCK_RAW with libpcap. The race detector has not found any data races in the program. The program crashes with SIGSEGV in runtime.sigtrampgo, apparently because either g.m or g.m.gsignal is nil. The I suspect that the instruction pointer for the first SIGSEGV is being accurately reported; that is, the original SIGSEGV is caused by an attempt to execute non-executable memory. I think this is likely due to memory corruption in the scheduler's datastructures. How can we hunt down corruption in the scheduler's memory? Does corruption there sound like a plausible explanation for the crashes we see? |
Can you reproduce the problem on previous versions of Go 1.5.3 or 1.4.2?
Possible, but unlikely, corruption of this sort usually comes from data races in the program. If you suspect that your tests do not exercise possible racy paths, could you build a race enabled version of your program and deploy it ? go build -race $PKG will do it. |
Can you run the process under gdb? That should stop at the point of the first segfault rather than when go's handler breaks and may be easier to find out what's going on. Alternatively, I guess, you could disable go's handling of SIGSEGV entirely. As to figuring out why you're jumping into the weeds, I guess if you're lucky you're jumping via CALL and the address you jumped from is on the top of the stack. If not, you might still be able to figure out where you were in the code by looking through the stack, though of course if the jump is from runtime·gogo or something like that this won't help. |
The program has been running with the race detector for the past hour with no races reported. It hasn't encountered the SIGSEGV yet, but all other periodic code paths should have been exercised a few hundred times. We can run the process under gdb, though it may take a day before we see the crash again. Disabling Go's SIGSEGV handling to force an immediate core dump sounds attractive. Can this be done via |
I don't actually know the answer to this. I was thinking of hacking out the runtime code that installs the SIGSEGV handler... |
The goroutine that's segfaulting is one of the GC's background mark workers. Both usable core dumps we have show the same thing (below).
|
I think I've found the cause of our issues. The G that runs the GC background mark worker is sometimes running without an M. When that goroutine calls
The Crash 1452891329, built on darwin/amd64 to run on linux/amd64 at go version
|
The 40 byte offset is because I'd been misinterpreting the meaning of the ctx argument to runtime.sigtrampgo. It's an unsafe.Pointer to a runtime.ucontext value which on amd64 includes a runtime.sigcontext value at a 40-byte offset (rather than a pointer directly to a runtime.sigcontext value). runtime/signal_linux_amd64.go shows how to correctly interpret the contents. I don't have any further answers for why the GC worker is running without an M. |
It looks like this is the same bug as #13363, although the CPU profiler was not involved with these crashes. /cc @aclements |
@rhysh thanks for the good investigative work. @RLH @aclements it appears that in some (uncommon) case gcBgMarkWorker can be started up without an m. |
Crash 1452891329 on go version 505fa7b comes in the middle of runtime.gcBgMarkWorker during a call to runtime.acquirem. This isn't the first call to runtime.acquirem in that function, which means that the worker must have had an m when it started and then lost it (otherwise we'd see a segfault at the first call). Crash 1453060415 on go version go1.6beta2 comes at the call to runtime.acquirem at the start of runtime.gopark. I looked through the stack of the crashed goroutine for instruction addresses and found a 64-bit aligned value of 0x000000000041f0fc at address 0xc8202a7f40, between the stack.hi value of 0xc8202a7fe0 and the rsp value of 0xc8202a7f20. The instruction address corresponds to the return from runtime.gopark at the top of runtime.gcBgMarkWorker's for loop. If I've unwound the stack correctly, this means that gcBgMarkWorker calling gopark calling acquirem led to the second crash.
The gcBgMarkWorker goroutines had an m when they were created, but had their execution resumed at some point without an m. |
Nice debugging! I agree that this is almost certainly the same failure as #13363. You're positive that CPU profiling wasn't enabled? Is it possible the application is receiving other asynchronous signals? (I would love to rule out async signals as a possible cause.) |
pprof is installed, but I was not actually calling any of the pprof endpoints for most of the crashes. Many of them I'd start the process close the screen session and then go home for the day to have it crash the next morning. I don't think any async signals are being called, though they may be under the hood. Most of the syscalls are read/write to network fds. I'll poke and see if I can find any others but I seriously doubt it. |
I can think of a few hypotheses:
We may be able to disprove 1 by looking at where the SIGSEGVs happen relative to possible preemption points. The background mark worker has fairly limited preemption points and since the goroutine can only resume at a preemption point, if there are any successful accesses to g.m between the crash and the previous preemption point, we can rule out 1. My case from #13363 does not disprove 1. In fact, since we resume from an involuntary preemption at the call to morestack, the crash happened just a few instructions after a possible resume point. This lends support to theory 1, but does not prove it. @rhysh, in 1452891329, do you know which acquirem crashed? Crash 1453060415 would seem to disprove 1. As of the acquirem in gopark, the goroutine hasn't been preemptible for quite a while. On all of the code paths that lead from disabling preemption to gopark (of which I think there are three: function entry, regular looping around, and looping around from the gcMarkDone condition), we've done our own acquirem, which would indicate g.m != nil since the last preemption point. Further evidence against 1 is that execute itself would crash if g.m were nil. However, there are other calls to gogo that wouldn't crash if g.m were nil. For 2 and 3, we need to consider the places that modify g.m. Ignoring places where we assign g0.m or gsignal.m, there are four: newextram, execute, dropg, and goexit0. newextram always sets it to non-nil (and immediately uses it), so we can discount it. Likewise, execute would crash if g.m were nil. dropg and goexit0 do set g.m to nil. @rhysh or @tarrant, could you have GDB dump the whole current G? For 2, it's almost certainly not goexit0. For 3, it could conceivably be goexit0. However, in #13363, it's definitely not goexit0 because the status is _Grunning and the waitstatus is set. Likewise, all of the calls to dropg are immediately preceded by setting the status to something other than _Grunning. Most paths back to _Grunning would crash if g.m were nil, but I'm not sure all of them would. This would be a useful place to dig deeper. For 4, I'm pretty sure there's nothing specific in the signal code that modifies g.m and I don't think it can call any of the functions I mentioned above. So 4 seems unlikely. I hope it's not 5. |
|
Thanks for looking into this, Austin. The acquirem call in 1452891329 is the one towards the top of gcBgMarkWorker's for loop: https://github.com/golang/go/blob/505fa7b/src/runtime/mgc.go#L1390 I've collected quite a few core dumps from reproducing #13363, which show five different call stacks at the time of the crash. I'm building binaries of the runtime/pprof test for linux/amd64 with
stress command:
gdb script:
changes to core packages to make reproducing easier, per #13363:
|
I think that last crash may be the most interesting, as it happens within runtime.gopark after a successful call to acquirem. The write to
|
Thanks @tarrant, it looks like that G is very similar to the one from #13363. It's in state _Grunning and the wait reason is set, so the same conclusions apply. @rhysh, you're right that that last crash is really interesting. What makes it even more interesting is that there's also a write barrier on the previous line that would have crashed if g.m was nil on that line. That strongly suggests that either the write barrier itself can somehow clear g.m, or something is concurrently clearing it. It's hard to say for sure, but I don't think any of the other stack traces are obviously incompatible with the theory that the write barrier itself is doing it. Though my hunch is that something is concurrently clearing it since there are so many different stacks and the write barrier itself is fairly deterministic. |
I may know how this is happening. In the gcMarkDone case of the mark worker, the worker disassociates with the P and makes itself preemptible, and then later attempts to reassociate with the P. If the scheduler moved the worker to a different P while it was preemptible (and a new worker hasn't been started for the original P), then as soon as the worker reassociates with the original P, the scheduler on that P can run the worker. It may do so before the worker has fully parked itself, at which point the worker G is effectively running on two Ms simultaneously. The sequence of steps is something like:
|
@rhysh or @tarrant, can you give https://go-review.googlesource.com/18741 a try? It's not a good fix, but it should test if I'm right about the cause. |
Hmm. The trybots are indicating that maybe you don't want to try CL 18741. It seems my "XXX This is definitely not ideal." is actually "XXX This causes random deadlocks all over the place." However, this does confirm that the workers really do migrate between Ps sometimes in this code path. |
This time for real. @rhysh or @tarrant, can you give https://go-review.googlesource.com/18761 a try? I think that's actually the right fix (and it passes the trybots). |
CL https://golang.org/cl/18761 mentions this issue. |
Sweet, I've gotten a patched version of go setup and the binary is built and running. I'll let it run for a few days and if nothing happens we can almost safely say this is fixed. |
I think there's a bug in my CL that causes it to deadlock occasionally, but
any results you can come up with are useful.
|
Yea looking like a deadlock.
Doesn't look like any time is being spent in my code, instead its just spinning trying to find a thread to run. |
Actually, it was spinning waiting for a dead thread to enter the "waiting" state, which, needless to say, was never going to happen. :) But I think I've fixed the problem and it's passing my stress tests. Could you try the new version of CL 18761? |
This one seems to run better, but my program is still spending most of its time spinning.
|
@tarrant, interesting. What does the profile look like without the CL? |
It would look a lot better if I had frame pointers enabled. But:
|
Thanks @tarrant. Unfortunately, I can't compare these profiles because there aren't any Go symbols in the second one. It looks like it's spending roughly the same amount of time in __ticket_spin_lock, but, for instance, I don't know if runtime.findrunnable has become more expensive with my CL. |
Hopefully this will help, this is a
|
Hi @tarrant. I haven't been able to reproduce any performance problems with this CL with either the go1 benchmarks or the x/benchmarks garbage benchmark. Are you sure those profiles are comparable? Do you have end-to-end benchmark numbers that show a problem?
|
I'm going to go ahead and push this CL so we can get it out there, but please open a new issue if it causes performance problems for your workload. |
Just tried go1.6rc1 and am having zero issues. I'm going to chalk up my issues to some oddity in how I was patching or compiling. Thanks a ton for your help in getting this all sorted out and fixed. |
Thanks to you and @rhysh for all the help debugging! |
It appears that my program is segfaulting (receiving a sigsegv) while handling
a sigsegv.
I was having some GC issues and a coworker recommended trying go1.6beta1 to see
if it would fix the GC issues. It didn't (I've resolved them in other ways) but
it did start causing my program to segfault.
Sadly I can't share the actual binary or core dump but I can share other
information which should hopefully assist in tracking it down.
The segfault occurs after 12-24 hours of running. The program is network IO
intensive. Taking UDP packets on one side and streaming TCP on the other.
CC: @rsc and @ianlancetaylor I was told this would be of interest to you two.
Version
I also had this occur with go1.6beta1, I'm attempting beta2 now, but it won't
be till tomorrow at the earliest that I know.
Do you use cgo or unsafe?
I don't believe so and no unsafe.
tarrant@box-name:~$ ldd canal
not a dynamic executable
GDB StackTrace
Go stacktrace
Memory info from gdb:
Registers
Object Dump
The text was updated successfully, but these errors were encountered: