-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: use frame pointers for callers #16638
Comments
That could work, and be much faster. |
There is no need to handle inlining at traceback time. Today, each PC corresponds to a single file/line/function. When we can inline non-leaf functions, each PC corresponds to a list of file/line/function tuples. At traceback time, we only need that PC. At symbolization time, we need to expand that PC into a list of file/line/function tuples. There is already code for handling this in The point is, traceback can be fast while still supporting non-leaf inlined functions when we interpret the traceback. |
CL https://golang.org/cl/33754 mentions this issue. |
func f() { g() } We mistakenly don't add a frame pointer for f. This means f isn't seen when walking the frame pointer linked list. That matters for kernel-gathered profiles, and is an impediment for issues like #16638. To fix, allocate a stack frame even for otherwise frameless functions like f. It is a bit tricky because we need to avoid some runtime internals that really, really don't want one. No test at the moment, as only kernel CPU profiles would catch it. Tests will come with the implementation of #16638. Fixes #18103 Change-Id: I411206cc9de4c8fdd265bee2e4fa61d161ad1847 Reviewed-on: https://go-review.googlesource.com/33754 Run-TryBot: Keith Randall <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Austin Clements <[email protected]>
CL https://golang.org/cl/33895 mentions this issue. |
When we copy the stack, we need to adjust all BPs. We correctly adjust the ones on the stack, but we also need to adjust the one that is in g.sched.bp. Like CL 33754, no test as only kernel-gathered profiles will notice. Tests will come (in 1.9) with the implementation of #16638. The invariant should hold that every frame pointer points to somewhere within its stack. After this CL, it is mostly true, but something about cgo breaks it. The runtime checks are disabled until I figure that out. Update #16638 Fixes #18174 Change-Id: I6023ee64adc80574ee3e76491d4f0fa5ede3dbdb Reviewed-on: https://go-review.googlesource.com/33895 Reviewed-by: Austin Clements <[email protected]>
CL 43150 may also help speed up tracing; that list of hot functions looks familiar from when I was working on that CL. |
Change https://golang.org/cl/33809 mentions this issue: |
Change https://golang.org/cl/212301 mentions this issue: |
@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301 But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames? One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done). |
Hi @danscales, The most profitable and critical is to optimize tracer, maybe mprof.
Either standard unwinding become faster, or everything else become slower :)
This sounds great. I would rely on tests that we correctly deal with inlining/skip. Thanks! |
The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and the skipped frames must be counted including inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization. But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests. |
We don't need these strict skip semantics here. We need to remove the non-interesting frames, how exactly does not matter. So instead of removing N inlined frames, we can remove M non-inlined frames. I would assume that the relevant runtime functions should not be inlined, so doing correct skip should be doable. Alternatively, we could consider changing what's being stripped a bit, if it makes things much simpler and faster on the unwinding side. |
I uploaded CL 463835 which is similar to CL 33809 and tested it against Before:
After:
I've also tested it against the sweet benchmarks and also got very good results, see gist. I suspect some changes from CL 212301 would have to be absorbed to make this more robust, but hopefully this could be done without losing performance (I missed this CL when I started this work, otherwise I would have probably used it as my starting point). Anyway, I'm willing to continue this work towards a production-ready patch. As far as the discussion above is concerned, I'm voting to make runtime/trace unwinding best effort: The more it can resemble gentraceback, the better. But small deviations are acceptable when it has a significant performance/simplicity benefits. Some more information is available in this blog post (ignore the first half). |
Change https://go.dev/cl/463835 mentions this issue: |
Change https://go.dev/cl/474916 mentions this issue: |
Change https://go.dev/cl/475795 mentions this issue: |
Change tracer to use frame pointer unwinding by default on amd64. The expansion of inline frames is delayed until the stack table is dumped at the end of the trace. This requires storing the skip argument in the stack table, which now resides in pcBuf[0]. For stacks that are not produced by traceStackID (e.g. CPU samples), a logicalStackSentinel value in pcBuf[0] indicates that no inline expansion is needed. Add new GODEBUG=tracefpunwindoff=1 option to use the old unwinder if needed. Benchmarks show a considerable decrease in CPU overhead when using frame pointer unwinding for trace events: GODEBUG=tracefpunwindoff=1 ../bin/go test -run '^$' -bench '.+PingPong' -count 20 -v -trace /dev/null ./runtime | tee tracefpunwindoff1.txt GODEBUG=tracefpunwindoff=0 ../bin/go test -run '^$' -bench '.+PingPong' -count 20 -v -trace /dev/null ./runtime | tee tracefpunwindoff0.txt goos: linux goarch: amd64 pkg: runtime cpu: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz │ tracefpunwindoff1.txt │ tracefpunwindoff0.txt │ │ sec/op │ sec/op vs base │ PingPongHog-32 3782.5n ± 0% 740.7n ± 2% -80.42% (p=0.000 n=20) For #16638 Change-Id: I2928a2fcd8779a31c45ce0f2fbcc0179641190bb Reviewed-on: https://go-review.googlesource.com/c/go/+/463835 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Felix Geisendörfer <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
Introduce a new m.incgocallback field that is true while C code calls into Go code. Use it in the tracer in order to fallback to the default unwinder instead of frame pointer unwinding for this scenario. The existing fields (incgo, ncgo) were not sufficient to detect the case where a thread created in C calls into Go code. Motivation: 1. Take advantage of a cgo symbolizer, if registered, to unwind through C stacks without frame pointers. 2. Reduce the chance of crashes. It seems unsafe to follow frame pointers when there could be C code that was compiled without frame pointers. Removing the curgp.m.incgocallback check in traceStackID shows the following minor differences between frame pointer unwinding and the default unwinder when there is no cgo symbolizer involved. trace_test.go:60: "goCalledFromCThread": got stack: main.goCalledFromCThread /src/runtime/testdata/testprogcgo/trace.go:58 _cgoexp_45c15a3efb3a_goCalledFromCThread _cgo_gotypes.go:694 runtime.cgocallbackg1 /src/runtime/cgocall.go:318 runtime.cgocallbackg /src/runtime/cgocall.go:236 runtime.cgocallback /src/runtime/asm_amd64.s:998 crosscall2 /src/runtime/cgo/asm_amd64.s:30 want stack: main.goCalledFromCThread /src/runtime/testdata/testprogcgo/trace.go:58 _cgoexp_45c15a3efb3a_goCalledFromCThread _cgo_gotypes.go:694 runtime.cgocallbackg1 /src/runtime/cgocall.go:318 runtime.cgocallbackg /src/runtime/cgocall.go:236 runtime.cgocallback /src/runtime/asm_amd64.s:998 trace_test.go:60: "goCalledFromC": got stack: main.goCalledFromC /src/runtime/testdata/testprogcgo/trace.go:51 _cgoexp_45c15a3efb3a_goCalledFromC _cgo_gotypes.go:687 runtime.cgocallbackg1 /src/runtime/cgocall.go:318 runtime.cgocallbackg /src/runtime/cgocall.go:236 runtime.cgocallback /src/runtime/asm_amd64.s:998 crosscall2 /src/runtime/cgo/asm_amd64.s:30 runtime.asmcgocall /src/runtime/asm_amd64.s:848 main._Cfunc_cCalledFromGo _cgo_gotypes.go:263 main.goCalledFromGo /src/runtime/testdata/testprogcgo/trace.go:46 main.Trace /src/runtime/testdata/testprogcgo/trace.go:37 main.main /src/runtime/testdata/testprogcgo/main.go:34 want stack: main.goCalledFromC /src/runtime/testdata/testprogcgo/trace.go:51 _cgoexp_45c15a3efb3a_goCalledFromC _cgo_gotypes.go:687 runtime.cgocallbackg1 /src/runtime/cgocall.go:318 runtime.cgocallbackg /src/runtime/cgocall.go:236 runtime.cgocallback /src/runtime/asm_amd64.s:998 runtime.systemstack_switch /src/runtime/asm_amd64.s:463 runtime.cgocall /src/runtime/cgocall.go:168 main._Cfunc_cCalledFromGo _cgo_gotypes.go:263 main.goCalledFromGo /src/runtime/testdata/testprogcgo/trace.go:46 main.Trace /src/runtime/testdata/testprogcgo/trace.go:37 main.main /src/runtime/testdata/testprogcgo/main.go:34 For #16638 Change-Id: I95fa27a3170c5abd923afc6eadab4eae777ced31 Reviewed-on: https://go-review.googlesource.com/c/go/+/474916 TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Run-TryBot: Felix Geisendörfer <[email protected]>
This allows comparing frame pointer unwinding against the default unwinder as shown below. goos: linux goarch: amd64 pkg: runtime cpu: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz │ callers.txt │ │ sec/op │ Callers/cached-32 1.254µ ± 0% FPCallers/cached-32 24.99n ± 0% For #16638 Change-Id: I4dd05f82254726152ef4a5d5beceab33641e9d2b Reviewed-on: https://go-review.googlesource.com/c/go/+/475795 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Felix Geisendörfer <[email protected]> Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
@felixge it looks like one of these CLs is causing failures on solaris on https://build.golang.org/ https://build.golang.org/log/3ed4fadb9027af0908e4ca04ea104b69651ad68d |
Change https://go.dev/cl/481617 mentions this issue: |
The previous name was wrong due to the mistaken assumption that calling f->g->getcallerpc and f->g->getcallersp would respectively return the pc/sp at g. However, they are actually referring to their caller's caller, i.e. f. Rename getcallerfp to getfp in order to stay consistent with this naming convention. Also see discussion on CL 463835. For #16638 Change-Id: I07990645da78819efd3db92f643326652ee516f8 Reviewed-on: https://go-review.googlesource.com/c/go/+/481617 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Run-TryBot: Felix Geisendörfer <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Change https://go.dev/cl/494857 mentions this issue: |
The previous name was wrong due to the mistaken assumption that calling f->g->getcallerpc and f->g->getcallersp would respectively return the pc/sp at g. However, they are actually referring to their caller's caller, i.e. f. Rename getcallerfp to getfp in order to stay consistent with this naming convention. Also see discussion on CL 463835. For #16638 This is a redo of CL 481617 that became necessary because CL 461738 added another call site for getcallerfp(). Change-Id: If0b536e85a6c26061b65e7b5c2859fc31385d025 Reviewed-on: https://go-review.googlesource.com/c/go/+/494857 Reviewed-by: Michael Pratt <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Cherry Mui <[email protected]> Run-TryBot: Felix Geisendörfer <[email protected]>
Traceback is the main source of slowdown for tracer. On net/http.BenchmarkClientServerParallel4:
BenchmarkClientServerParallel4-6 200000 10627 ns/op 4482 B/op 57 allocs/op
with tracer:
BenchmarkClientServerParallel4-6 200000 16444 ns/op 4482 B/op 57 allocs/op
That's +55% slowdown. Top functions of profile are:
6.09% http.test http.test [.] runtime.pcvalue
5.88% http.test http.test [.] runtime.gentraceback
5.41% http.test http.test [.] runtime.readvarint
4.31% http.test http.test [.] runtime.findfunc
2.98% http.test http.test [.] runtime.step
2.12% http.test http.test [.] runtime.mallocgc
runtime.callers/gcallers/Callers are not interested in frame/func/sp/args/etc for each frame, they only need PC values. PC values can be obtained using frame pointers, which must be much faster. Note that there calls are always synchronous (can't happen during function prologue or in the middle of goroutine switch), so should be much simpler to handle.
We should use frame pointers in runtime.callers.
@aclements @ianlancetaylor @hyangah
The text was updated successfully, but these errors were encountered: