Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: use frame pointers for callers #16638

Open
dvyukov opened this issue Aug 8, 2016 · 20 comments
Open

runtime: use frame pointers for callers #16638

dvyukov opened this issue Aug 8, 2016 · 20 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Aug 8, 2016

Traceback is the main source of slowdown for tracer. On net/http.BenchmarkClientServerParallel4:
BenchmarkClientServerParallel4-6 200000 10627 ns/op 4482 B/op 57 allocs/op
with tracer:
BenchmarkClientServerParallel4-6 200000 16444 ns/op 4482 B/op 57 allocs/op
That's +55% slowdown. Top functions of profile are:
6.09% http.test http.test [.] runtime.pcvalue
5.88% http.test http.test [.] runtime.gentraceback
5.41% http.test http.test [.] runtime.readvarint
4.31% http.test http.test [.] runtime.findfunc
2.98% http.test http.test [.] runtime.step
2.12% http.test http.test [.] runtime.mallocgc

runtime.callers/gcallers/Callers are not interested in frame/func/sp/args/etc for each frame, they only need PC values. PC values can be obtained using frame pointers, which must be much faster. Note that there calls are always synchronous (can't happen during function prologue or in the middle of goroutine switch), so should be much simpler to handle.

We should use frame pointers in runtime.callers.

@aclements @ianlancetaylor @hyangah

@quentinmit quentinmit added this to the Go1.8 milestone Aug 8, 2016
@randall77
Copy link
Contributor

That could work, and be much faster.
My only worry is that at some point we're going to tackle the "inline non-leaf functions" problem and then just the list of PCs from the frame pointer walk won't be enough. We'll need to somehow expand PCs that correspond to call sites that have been inlined into other functions. I'm not sure how that would work without doing everything the generic gentraceback is doing (findfunc, mainly).

@ianlancetaylor
Copy link
Member

There is no need to handle inlining at traceback time.  Today, each PC corresponds to a single file/line/function.  When we can inline non-leaf functions, each PC corresponds to a list of file/line/function tuples.  At traceback time, we only need that PC. At symbolization time, we need to expand that PC into a list of file/line/function tuples. There is already code for handling this in runtime.Frames.Next, we just need to move from FuncForPC to something that can return multiple file/line/function tuples.

The point is, traceback can be fast while still supporting non-leaf inlined functions when we interpret the traceback.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@rsc rsc modified the milestones: Go1.9Early, Go1.8 Nov 3, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/33754 mentions this issue.

gopherbot pushed a commit that referenced this issue Dec 1, 2016
func f() {
    g()
}

We mistakenly don't add a frame pointer for f.  This means f
isn't seen when walking the frame pointer linked list.  That
matters for kernel-gathered profiles, and is an impediment for
issues like #16638.

To fix, allocate a stack frame even for otherwise frameless functions
like f.  It is a bit tricky because we need to avoid some runtime
internals that really, really don't want one.

No test at the moment, as only kernel CPU profiles would catch it.
Tests will come with the implementation of #16638.

Fixes #18103

Change-Id: I411206cc9de4c8fdd265bee2e4fa61d161ad1847
Reviewed-on: https://go-review.googlesource.com/33754
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Austin Clements <[email protected]>
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/33895 mentions this issue.

gopherbot pushed a commit that referenced this issue Dec 7, 2016
When we copy the stack, we need to adjust all BPs.
We correctly adjust the ones on the stack, but we also
need to adjust the one that is in g.sched.bp.

Like CL 33754, no test as only kernel-gathered profiles will notice.
Tests will come (in 1.9) with the implementation of #16638.

The invariant should hold that every frame pointer points to
somewhere within its stack.  After this CL, it is mostly true, but
something about cgo breaks it.  The runtime checks are disabled
until I figure that out.

Update #16638
Fixes #18174

Change-Id: I6023ee64adc80574ee3e76491d4f0fa5ede3dbdb
Reviewed-on: https://go-review.googlesource.com/33895
Reviewed-by: Austin Clements <[email protected]>
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.9Early May 3, 2017
@josharian
Copy link
Contributor

CL 43150 may also help speed up tracing; that list of hot functions looks familiar from when I was working on that CL.

@bradfitz bradfitz added early-in-cycle A change that should be done early in the 3 month dev cycle. and removed early-in-cycle A change that should be done early in the 3 month dev cycle. labels Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/33809 mentions this issue: runtime: use frame pointers for callers

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 19, 2018
@aclements aclements modified the milestones: Go1.12, Go1.13 Jan 8, 2019
@randall77 randall77 modified the milestones: Go1.13, Go1.14 May 6, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@randall77
Copy link
Contributor

@danscales

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/212301 mentions this issue: runtime: use frame pointers to implement runtime.callers()

@danscales
Copy link
Contributor

@dvyukov I took a look at this and it seems that if we want to get runtime.Callers() semantics fully correct (including a proper skip, based on any inlined frames due to mid-stack inlining), then we will lose a lot of the benefit of the optimization. See my message for my prototype code https://go-review.googlesource.com/c/go/+/212301

But then I noticed that actually the tracer and mprof are using a separate routine gcallers. We could possible only optimize gcallers(). In that case, would it be acceptible to not expand out the logical frames that are not physically there because of mid-stack inlining? Or do the inline expanding at CallerFrames()/Next() time only, but do the early skip processing based on physical frames?

One really rough set of numbers that I did for your benchmark (take it with a grain of salt) is 40% overhead currently [like your 55% overhead], then 24% overhead if we do the optimization but have to deal with inlined frames, and 6% overhead if we don't have to deal with inlined frames (just follow FP pointers and done).

@dvyukov
Copy link
Member Author

dvyukov commented Dec 21, 2019

Hi @danscales,

The most profitable and critical is to optimize tracer, maybe mprof.
Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.
I've added a bunch of tests for tracer that check that resulting stacks capture precisely what we want, including exact behavior of skip arguments. These should be useful, you could add more if you see other tricky cases.

40% overhead currently [like your 55% overhead]

Either standard unwinding become faster, or everything else become slower :)

and 6% overhead if we don't have to deal with inlined frames

This sounds great. I would rely on tests that we correctly deal with inlining/skip.

Thanks!

@danscales
Copy link
Contributor

danscales commented Dec 24, 2019

The most profitable and critical is to optimize tracer, maybe mprof.
Yes, expanding later is perfectly fine. And I think we already do this in trace. Inline frames don't have PCs anyway. So even if we expand and apply skip, we can't possibly shrink it back to an array of PCs.

The problem is that the semantics of Callers (and presumable gcallers) is that tne pcbuf slice must be filled up to its max (as possible) after skipping 'skip' frames, and the skipped frames must be counted including inlined frames. So, if we don't understand the inlining at the time we fill up pcbuf initially, we don't know exactly how many physical frames to skip, so we may not grab enough frames to fill up the max after we do the skip. (The easier thing would be to grab all physical frame pcs until we do the inlining later, but where do we store them all -- we only have pcbuf.) So, we need a slightly looser definition for skip and for filling in pcbuf for gcallers() if we are going to use the frame pointer optimization.

But I will proceed with trying to optimizing gcallers, and doing the inlining interpretation later, and see how things go with your tests.

@dvyukov
Copy link
Member Author

dvyukov commented Jan 7, 2020

We don't need these strict skip semantics here. We need to remove the non-interesting frames, how exactly does not matter. So instead of removing N inlined frames, we can remove M non-inlined frames. I would assume that the relevant runtime functions should not be inlined, so doing correct skip should be doable. Alternatively, we could consider changing what's being stripped a bit, if it makes things much simpler and faster on the unwinding side.

@felixge
Copy link
Contributor

felixge commented Jan 31, 2023

I uploaded CL 463835 which is similar to CL 33809 and tested it against BenchmarkPingPongHog.

Before:

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
               │ baseline.txt │           gentraceback.txt            │
               │    sec/op    │    sec/op     vs base                 │
PingPongHog-72    642.8n ± 4%   5616.5n ± 3%  +773.82% (p=0.000 n=10)

After:

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
               │ baseline.txt │               fp.txt                │
               │    sec/op    │   sec/op     vs base                │
PingPongHog-72    642.8n ± 4%   834.8n ± 1%  +29.87% (p=0.000 n=10)

I've also tested it against the sweet benchmarks and also got very good results, see gist.

I suspect some changes from CL 212301 would have to be absorbed to make this more robust, but hopefully this could be done without losing performance (I missed this CL when I started this work, otherwise I would have probably used it as my starting point).

Anyway, I'm willing to continue this work towards a production-ready patch. As far as the discussion above is concerned, I'm voting to make runtime/trace unwinding best effort: The more it can resemble gentraceback, the better. But small deviations are acceptable when it has a significant performance/simplicity benefits.

Some more information is available in this blog post (ignore the first half).

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/463835 mentions this issue: runtime: frame pointer unwinding for tracer

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/474916 mentions this issue: runtime: use regular unwinding for trace events from cgo callbacks

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/475795 mentions this issue: runtime: add a benchmark of FPCallers

gopherbot pushed a commit that referenced this issue Mar 30, 2023
Change tracer to use frame pointer unwinding by default on amd64. The
expansion of inline frames is delayed until the stack table is dumped at
the end of the trace. This requires storing the skip argument in the
stack table, which now resides in pcBuf[0]. For stacks that are not
produced by traceStackID (e.g. CPU samples), a logicalStackSentinel
value in pcBuf[0] indicates that no inline expansion is needed.

Add new GODEBUG=tracefpunwindoff=1 option to use the old unwinder if
needed.

Benchmarks show a considerable decrease in CPU overhead when using frame
pointer unwinding for trace events:

GODEBUG=tracefpunwindoff=1 ../bin/go test -run '^$' -bench '.+PingPong' -count 20 -v -trace /dev/null ./runtime | tee tracefpunwindoff1.txt
GODEBUG=tracefpunwindoff=0 ../bin/go test -run '^$' -bench '.+PingPong' -count 20 -v -trace /dev/null ./runtime | tee tracefpunwindoff0.txt

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
               │ tracefpunwindoff1.txt │        tracefpunwindoff0.txt        │
               │        sec/op         │   sec/op     vs base                │
PingPongHog-32            3782.5n ± 0%   740.7n ± 2%  -80.42% (p=0.000 n=20)

For #16638

Change-Id: I2928a2fcd8779a31c45ce0f2fbcc0179641190bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/463835
Reviewed-by: Michael Pratt <[email protected]>
Run-TryBot: Felix Geisendörfer <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 30, 2023
Introduce a new m.incgocallback field that is true while C code calls
into Go code. Use it in the tracer in order to fallback to the default
unwinder instead of frame pointer unwinding for this scenario. The
existing fields (incgo, ncgo) were not sufficient to detect the case
where a thread created in C calls into Go code.

Motivation:

1. Take advantage of a cgo symbolizer, if registered, to unwind through
   C stacks without frame pointers.
2. Reduce the chance of crashes. It seems unsafe to follow frame
   pointers when there could be C code that was compiled without frame
   pointers.

Removing the curgp.m.incgocallback check in traceStackID shows the
following minor differences between frame pointer unwinding and the
default unwinder when there is no cgo symbolizer involved.

    trace_test.go:60: "goCalledFromCThread": got stack:
        main.goCalledFromCThread
        	/src/runtime/testdata/testprogcgo/trace.go:58
        _cgoexp_45c15a3efb3a_goCalledFromCThread
        	_cgo_gotypes.go:694
        runtime.cgocallbackg1
        	/src/runtime/cgocall.go:318
        runtime.cgocallbackg
        	/src/runtime/cgocall.go:236
        runtime.cgocallback
        	/src/runtime/asm_amd64.s:998
        crosscall2
        	/src/runtime/cgo/asm_amd64.s:30

        want stack:
        main.goCalledFromCThread
        	/src/runtime/testdata/testprogcgo/trace.go:58
        _cgoexp_45c15a3efb3a_goCalledFromCThread
        	_cgo_gotypes.go:694
        runtime.cgocallbackg1
        	/src/runtime/cgocall.go:318
        runtime.cgocallbackg
        	/src/runtime/cgocall.go:236
        runtime.cgocallback
        	/src/runtime/asm_amd64.s:998

    trace_test.go:60: "goCalledFromC": got stack:
        main.goCalledFromC
        	/src/runtime/testdata/testprogcgo/trace.go:51
        _cgoexp_45c15a3efb3a_goCalledFromC
        	_cgo_gotypes.go:687
        runtime.cgocallbackg1
        	/src/runtime/cgocall.go:318
        runtime.cgocallbackg
        	/src/runtime/cgocall.go:236
        runtime.cgocallback
        	/src/runtime/asm_amd64.s:998
        crosscall2
        	/src/runtime/cgo/asm_amd64.s:30
        runtime.asmcgocall
        	/src/runtime/asm_amd64.s:848
        main._Cfunc_cCalledFromGo
        	_cgo_gotypes.go:263
        main.goCalledFromGo
        	/src/runtime/testdata/testprogcgo/trace.go:46
        main.Trace
        	/src/runtime/testdata/testprogcgo/trace.go:37
        main.main
        	/src/runtime/testdata/testprogcgo/main.go:34

        want stack:
        main.goCalledFromC
        	/src/runtime/testdata/testprogcgo/trace.go:51
        _cgoexp_45c15a3efb3a_goCalledFromC
        	_cgo_gotypes.go:687
        runtime.cgocallbackg1
        	/src/runtime/cgocall.go:318
        runtime.cgocallbackg
        	/src/runtime/cgocall.go:236
        runtime.cgocallback
        	/src/runtime/asm_amd64.s:998
        runtime.systemstack_switch
        	/src/runtime/asm_amd64.s:463
        runtime.cgocall
        	/src/runtime/cgocall.go:168
        main._Cfunc_cCalledFromGo
        	_cgo_gotypes.go:263
        main.goCalledFromGo
        	/src/runtime/testdata/testprogcgo/trace.go:46
        main.Trace
        	/src/runtime/testdata/testprogcgo/trace.go:37
        main.main
        	/src/runtime/testdata/testprogcgo/main.go:34

For #16638

Change-Id: I95fa27a3170c5abd923afc6eadab4eae777ced31
Reviewed-on: https://go-review.googlesource.com/c/go/+/474916
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Run-TryBot: Felix Geisendörfer <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 30, 2023
This allows comparing frame pointer unwinding against the default
unwinder as shown below.

goos: linux
goarch: amd64
pkg: runtime
cpu: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
                    │ callers.txt │
                    │   sec/op    │
Callers/cached-32     1.254µ ± 0%
FPCallers/cached-32   24.99n ± 0%

For #16638

Change-Id: I4dd05f82254726152ef4a5d5beceab33641e9d2b
Reviewed-on: https://go-review.googlesource.com/c/go/+/475795
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Felix Geisendörfer <[email protected]>
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
@prattmic
Copy link
Member

@felixge it looks like one of these CLs is causing failures on solaris on https://build.golang.org/

https://build.golang.org/log/3ed4fadb9027af0908e4ca04ea104b69651ad68d

@felixge
Copy link
Contributor

felixge commented Apr 1, 2023

@prattmic tracking this in #59350 now. Will try to fix ASAP.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/481617 mentions this issue: runtime: improve getcallerfp comment

gopherbot pushed a commit that referenced this issue May 11, 2023
The previous name was wrong due to the mistaken assumption that calling
f->g->getcallerpc and f->g->getcallersp would respectively return the
pc/sp at g. However, they are actually referring to their caller's
caller, i.e. f.

Rename getcallerfp to getfp in order to stay consistent with this
naming convention.

Also see discussion on CL 463835.

For #16638

Change-Id: I07990645da78819efd3db92f643326652ee516f8
Reviewed-on: https://go-review.googlesource.com/c/go/+/481617
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
Run-TryBot: Felix Geisendörfer <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/494857 mentions this issue: runtime: rename getcallerfp to getfp

gopherbot pushed a commit that referenced this issue May 22, 2023
The previous name was wrong due to the mistaken assumption that calling
f->g->getcallerpc and f->g->getcallersp would respectively return the
pc/sp at g. However, they are actually referring to their caller's
caller, i.e. f.

Rename getcallerfp to getfp in order to stay consistent with this
naming convention.

Also see discussion on CL 463835.

For #16638

This is a redo of CL 481617 that became necessary because CL 461738
added another call site for getcallerfp().

Change-Id: If0b536e85a6c26061b65e7b5c2859fc31385d025
Reviewed-on: https://go-review.googlesource.com/c/go/+/494857
Reviewed-by: Michael Pratt <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
Run-TryBot: Felix Geisendörfer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests