Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: runtime: mcall called on m->g0 stack while concurrently fetching /profile and /trace #70454

Open
piob-io opened this issue Nov 19, 2024 · 7 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@piob-io
Copy link

piob-io commented Nov 19, 2024

Go version

go version go1.23.3 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/piob/.cache/go-build'
GOENV='/home/piob/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/piob/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/piob/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/piob/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1174589415=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Run the attached go program, which does some computation, and exposes profiling endpoints.

Fetch the cpu profile, and while it is being captured, fetch execution trace. Specifically to programmatically assure that cpu profile collection has started, hit the profile endpoint twice, expect one to fail with 500 error (only one profile can be captured concurrently), and then immediately start capturing the trace.

Repeat many times to reproduce the issue.

Attached is the go program code, and a rust binary that will repeatedly launch this program, and hit the profiling endpoints. I tried writing a go program to fetch the profiles, but it doesn't seem to repro, I suspect that race condition is subtle. The rust program has a concurrency target set to 32, problem seems to repro when all cores are busy.

Unpack the zip and:
repro.zip
$ go build fib.go
$ cargo run test

What did you see happen?

Sometimes the go program ends up crashing.

Output:
err.txt

What did you expect to see?

Successfully produced profiles without crashing.

@ajwerner
Copy link

More notes:

  • This reproduces on Linux on both AMD and Intel CPUs
    • Both machines where it reproduced thus far have had 32 vCPU
  • If GOMAXPROCS is lowered, it seems to reduce the frequency of the problem (it's never been observed with GOMAXPROCS=1)
  • It has reproduced with binaries built with go1.21.8, go1.22.5, go1.23.1, and go1.23.3
  • It reproduces with GODEBUG=asyncpremptoff=1
  • It reproduces with GOGC=off

@ianlancetaylor ianlancetaylor changed the title fatal error: runtime: mcall called on m->g0 stack while concurrently fetching /profile and /trace runtime: fatal error: runtime: mcall called on m->g0 stack while concurrently fetching /profile and /trace Nov 19, 2024
@ianlancetaylor
Copy link
Member

CC @golang/runtime

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 19, 2024
@rhysh
Copy link
Contributor

rhysh commented Nov 20, 2024

Here's the head of the log file:

    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.05s
     Running `target/debug/test test`
fatal error: runtime: mcall called on m->g0 stack

runtime stack:
runtime.throw({0x721602?, 0x44f1ce?})
	/usr/local/go/src/runtime/panic.go:1067 +0x48 fp=0x7f5eafdffd48 sp=0x7f5eafdffd18 pc=0x471d28
runtime.badmcall(0xc000182fc0?)
	/usr/local/go/src/runtime/proc.go:539 +0x1f fp=0x7f5eafdffd68 sp=0x7f5eafdffd48 pc=0x43cb1f
runtime.badmcall(0x44f755)
	<autogenerated>:1 +0x1f fp=0x7f5eafdffd80 sp=0x7f5eafdffd68 pc=0x47c35f
runtime: g 0: unexpected return pc for runtime.badmcall called from 0x7f5eafdffdd0
stack: frame={sp:0x7f5eafdffd68, fp:0x7f5eafdffd80} stack=[0x7f5eaf600250,0x7f5eafdffe50)
0x00007f5eafdffc68:  0x0000000000439237 <runtime.startpanic_m+0x0000000000000157>  0x000000000043baf4 <runtime.printstring+0x0000000000000054> 
0x00007f5eafdffc78:  0x0000000000795a40  0x000000c000182e00 
0x00007f5eafdffc88:  0x00007f5eafdffcc8  0x0000000000438f2b <runtime.fatalthrow.func1+0x000000000000006b> 
0x00007f5eafdffc98:  0x000000c000182e00  0x0000000000471d28 <runtime.throw+0x0000000000000048> 
0x00007f5eafdffca8:  0x00007f5eafdffd18  0x00007f5eafdffd18 
0x00007f5eafdffcb8:  0x0000000000471d28 <runtime.throw+0x0000000000000048>  0x000000c000182e00 
0x00007f5eafdffcc8:  0x00007f5eafdffd08  0x0000000000438e98 <runtime.fatalthrow+0x0000000000000058> 
0x00007f5eafdffcd8:  0x00007f5eafdffce8  0x000000c000182e00 
0x00007f5eafdffce8:  0x0000000000438ec0 <runtime.fatalthrow.func1+0x0000000000000000>  0x000000c000182e00 
0x00007f5eafdffcf8:  0x0000000000471d28 <runtime.throw+0x0000000000000048>  0x00007f5eafdffd18 
0x00007f5eafdffd08:  0x00007f5eafdffd38  0x0000000000471d28 <runtime.throw+0x0000000000000048> 
0x00007f5eafdffd18:  0x00007f5eafdffd20  0x0000000000438a60 <runtime.throw.func1+0x0000000000000000> 
0x00007f5eafdffd28:  0x0000000000721602  0x0000000000000024 
0x00007f5eafdffd38:  0x00007f5eafdffd58  0x000000000043cb1f <runtime.badmcall+0x000000000000001f> 
0x00007f5eafdffd48:  0x0000000000721602  0x000000000044f1ce <runtime.readyWithTime+0x000000000000006e> 
0x00007f5eafdffd58:  0x00007f5eafdffd70  0x000000000047c35f <runtime.badmcall+0x000000000000001f> 
0x00007f5eafdffd68: <0x000000c000182fc0  0x00007f5eafdffd78 
0x00007f5eafdffd78:  0x00007f5eafdffdd0 >0x000000000044f755 <runtime.semrelease1+0x00000000000001b5> 
0x00007f5eafdffd88:  0x0000000000971840  0x0000000000441adf <runtime.execute+0x000000000000017f> 
0x00007f5eafdffd98:  0x0000000000000000  0x0000000000000000 
0x00007f5eafdffda8:  0x0000000000971840  0x000000c0000b26c0 
0x00007f5eafdffdb8:  0x0000000000971840  0x0000000000971850 
0x00007f5eafdffdc8:  0x000000000096b138  0x00007f5eafdffe10 
0x00007f5eafdffdd8:  0x000000000045c275 <runtime.readTrace0+0x00000000000001d5>  0x000000c0001c7180 
0x00007f5eafdffde8:  0x0000000100000000  0x0000000000000000 
0x00007f5eafdffdf8:  0x00007f5eb7f40000  0x0000000000000000 
0x00007f5eafdffe08:  0x000000000096b138  0x00007f5eafdffe30 
0x00007f5eafdffe18:  0x000000000045c045 <runtime.ReadTrace.func1+0x0000000000000025>  0x000000c000094f5f 
0x00007f5eafdffe28:  0x000000c000094f60  0x000000c000094f28 
0x00007f5eafdffe38:  0x0000000000477e0a <runtime.systemstack+0x000000000000004a>  0x0000000000000000 
0x00007f5eafdffe48:  0x01ffffffffffff80 

goroutine 65 gp=0xc0001c7180 m=8 mp=0xc00008b108 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:479 +0x8 fp=0xc000094f38 sp=0xc000094f28 pc=0x477da8
runtime.ReadTrace()
	/usr/local/go/src/runtime/trace.go:753 +0x73 fp=0xc000094fa0 sp=0xc000094f38 pc=0x45bfd3
runtime/trace.Start.func1()
	/usr/local/go/src/runtime/trace/trace.go:130 +0x45 fp=0xc000094fe0 sp=0xc000094fa0 pc=0x6901a5
runtime.goexit({})
	/usr/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000094fe8 sp=0xc000094fe0 pc=0x479dc1
created by runtime/trace.Start in goroutine 11
	/usr/local/go/src/runtime/trace/trace.go:128 +0xcf

ReadTrace switches to the system stack to run readTrace0, which has two semrelease(&trace.doneSema[gen%2]) calls, with semrelease being a trivial wrapper around semrelease1.

Maybe this is the goyield call in semrelease1. It's right after a readyWithTime call, and gated (only) by if s.ticket == 1 && getg().m.locks == 0 {. goyield then does an mcall.

I'm not sure at the moment if systemstack increments mp.locks. If not, this code path could be the problem, and so semrelease1 should have an additional getg() == getg().m.curg gate before calling goyield.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 20, 2024
@ajwerner
Copy link

Nice catch @rhysh! I can confirm that the below patch makes what was a reliable repro that ~always hit within 1,000 runs to not happening in 10,000 runs. I'll leave it to the runtime experts to decide whether that's the appropriate fix.

diff --git a/src/runtime/sema.go b/src/runtime/sema.go
index f6b1b84f5f..26c33f578c 100644
--- a/src/runtime/sema.go
+++ b/src/runtime/sema.go
@@ -247,7 +247,7 @@ func semrelease1(addr *uint32, handoff bool, skipframes int) {
                        s.ticket = 1
                }
                readyWithTime(s, 5+skipframes)
-               if s.ticket == 1 && getg().m.locks == 0 {
+               if s.ticket == 1 && getg().m.locks == 0 && getg() == getg().m.curg {
                        // Direct G handoff
                        // readyWithTime has added the waiter G as runnext in the
                        // current P; we now call the scheduler so that we start running

@mknyszek mknyszek self-assigned this Nov 20, 2024
@mknyszek mknyszek added this to the Backlog milestone Nov 20, 2024
@rhysh
Copy link
Contributor

rhysh commented Nov 20, 2024

I don't have an explanation for how s.ticket == 1 would be true in the crash report. It looks like we know a few lines up that it's zero, and only becomes non-zero if handoff is true. But readTrace0's calls to semrelease1 are via the semrelease wrapper, which passes false for handoff.

Looks like @mknyszek is on it, I'll leave it there.

@ajwerner
Copy link

I added a bit of logging to the point that before readyWithTime is called, s.ticket is 0, and after it is 1. 🍿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

8 participants