Skip to content

Commit

Permalink
edits
Browse files Browse the repository at this point in the history
  • Loading branch information
felixge committed Apr 6, 2021
1 parent fb77221 commit 6a62d59
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions stack-traces.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,13 +141,13 @@ There are various use cases for stack traces in Go, but they all end up hitting

Each frame lookup begins with the current `pc` which is passed to [`findfunc()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) which looks up the meta data for the function that contains the `pc`. Historically this was done using `O(log N)` binary search, but [nowadays](https://go-review.googlesource.com/c/go/+/2097/) there is a hash-map-like index of [`findfuncbucket`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L671) structs that usually directly guides us to the right entry using an `O(1)` algorithm. So at this point the overall complexity is still the same as frame pointer unwinding, but it's worth noting that the constant overheads are already significantly higher.

The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dcf0a0bb/src/runtime/runtime2.go#L825) meta data that we just retrieved contains a `pcsp` offset into the `pctab` table that maps program counters to stack pointer deltas. To decode this information, we call [`funcspdelta()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L903) which does a `O(N)` linear search over all program counters of the function until it finds the (`pc`, `sp delta`) pair were are looking for. For stacks with recursive call cycles, a tiny program counter cache is used to avoids doing lots of duplicated work.
The [_func](https://github.com/golang/go/blob/9baddd3f21230c55f0ad2a10f5f20579dcf0a0bb/src/runtime/runtime2.go#L825) meta data that we just retrieved contains a `pcsp` offset into the `pctab` table that maps program counters to stack pointer deltas. To decode this information, we call [`funcspdelta()`](https://github.com/golang/go/blob/go1.16.3/src/runtime/symtab.go#L903) which does a `O(N)` linear search over all program counters of the function until it finds the (`pc`, `sp delta`) pair were are looking for. For stacks with recursive call cycles, a tiny program counter cache is used to avoid doing lots of duplicated work.

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 6, 2021

This is accurate, though I'll note that pcvalue maintains a small cache for repeated lookups (TBH, I don't know what the hit rate is like). Also, AIUI the table doesn't contain an entry for every PC, just one entry each time the delta changes (so we look for an entry with targetpc < pc. This is still O(n), but there aren't quite so many entries as this make it sound like.

This comment has been minimized.

Copy link
@felixge

felixge Apr 7, 2021

Author Member

This is accurate, though I'll note that pcvalue maintains a small cache for repeated lookups (TBH, I don't know what the hit rate is like).

That's what I was trying to describe by saying "a tiny program counter cache is used to avoid doing lots of duplicated work". Do you think I should reword that?

Also, AIUI the table doesn't contain an entry for every PC, just one entry each time the delta changes (so we look for an entry with targetpc < pc. This is still O(n), but there aren't quite so many entries as this make it sound like.

That's a good point, I didn't realize that not every pc is in the table, but that makes a lot of sense. Tried to reword it here: b022d6e

While discussing this, one thing I was wondering: Does the compiler guarantee an invariant that every pc has a 1:1 mapping to an sp delta? Or in other words, what happens if something is allocated on the stack inside an if statement, wouldn't the sp delta depend on whether that branch gets hit or not?

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 7, 2021

This is accurate, though I'll note that pcvalue maintains a small cache for repeated lookups (TBH, I don't know what the hit rate is like).

That's what I was trying to describe by saying "a tiny program counter cache is used to avoid doing lots of duplicated work". Do you think I should reword that?

No, that's fine, I'm not sure how I missed that. :)

Also, AIUI the table doesn't contain an entry for every PC, just one entry each time the delta changes (so we look for an entry with targetpc < pc. This is still O(n), but there aren't quite so many entries as this make it sound like.

That's a good point, I didn't realize that not every pc is in the table, but that makes a lot of sense. Tried to reword it here: fb4cbbc

While discussing this, one thing I was wondering: Does the compiler guarantee an invariant that every pc has a 1:1 mapping to an sp delta? Or in other words, what happens if something is allocated on the stack inside an if statement, wouldn't the sp delta depend on whether that branch gets hit or not?

Every function has a fixed frame size, based on the maximum amount of stack space required by all branches, so the offset should always be constant.

This comment has been minimized.

Copy link
@felixge

felixge Apr 7, 2021

Author Member

Every function has a fixed frame size, based on the maximum amount of stack space required by all branches, so the offset should always be constant.

Oh ... really? Then what's the point of having a pctab and not just putting this info directly in the _func meta data? I feel like I'm missing something 🙈

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 7, 2021

That's a great question. Now I'm not sure if this just covers the function prologue (before SP is decremented), or if there are other special cases I'm not thinking of.

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 7, 2021

That is certainly the normal case. Dumping the deltas [1], most functions look something like:

1: [0x0, 0x22): 0x0
1: [0x22, 0xe2): 0x448
1: [0xe2, 0xed): 0x0

The delta is zero in the prologue and morestack call (end of the function), and the frame size elsewhere.

[1] hack alert:

diff --git a/src/cmd/link/internal/ld/lib.go b/src/cmd/link/internal/ld/lib.go
index 4d5be30d82..6677002bde 100644
--- a/src/cmd/link/internal/ld/lib.go
+++ b/src/cmd/link/internal/ld/lib.go
@@ -2374,6 +2374,8 @@ func (sc *stkChk) check(up *chain, depth int) int {
        for pcsp.Init(ldr.Data(info.Pcsp())); !pcsp.Done; pcsp.Next() {
                // pcsp.value is in effect for [pcsp.pc, pcsp.nextpc).
 
+               fmt.Printf("%v: [%#x, %#x): %#x\n", s, pcsp.PC, pcsp.NextPC, pcsp.Value)
+
                // Check stack size in effect for this span.
                if int32(limit)-pcsp.Value < 0 {
                        sc.broke(up⇣ int(int32(limit)-pcsp.Value))

This comment has been minimized.

Copy link
@felixge

felixge Apr 8, 2021

Author Member

@prattmic huh, interesting! I had assumed that pctab can grow large, and based on this assumption made a benchmark showing how that impacts unwinding performance:

https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead#result-2-on-function-size

Now that it seems like pctab will always contain just 3 entries, I'm wondering if my benchmark is wrong or if I need a new explanation.

I'll take a closer look at that, but if you have any ideas, please let me know!

This comment has been minimized.

Copy link
@felixge

felixge Apr 8, 2021

Author Member

Mystery solved. My benchmark was flawed. I simulated "big functions" by relying on inlining, which meant I had up to 256 inlined functions in the function I was unwinding. So all the time ended up being spend in walking the inline tree 🙈 .

image

Ok this means I'll have to revise my writing a bit. Bigger functions will not be slower to unwind unless they are full of inlined function calls, but that's probably a rare edge case.

Anyway, back to the observation that the pctab encodes just 3 entries: I assume that while this is true for most functions, there might be exceptions (if not currently, then in the past or perhaps in the future), so it's probably not a good idea to attempt to optimize this by inlining it into the _func struct.

Thank you so much with your great comments here!

This comment has been minimized.

Copy link
@felixge

felixge Apr 8, 2021

Author Member

Related: I've just noticed two potential issues with the code walking the inline tree.

The first issue is that it seems to reuse the same cache that is used for the funcspdelta. My understanding is that this a bug because it ends up using the same cache for values from two different pc tables?

The second issue is that in my testing it seems like the inldata != nil predicate seems to trigger for pretty much every frame I'm unwinding, only to immediately bail out with ix < 0 right away. The useless pcdatavalue calls involved in this seem to eat up about ~30% of overall unwinding time. That's bad, right?

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 8, 2021

Anyway, back to the observation that the pctab encodes just 3 entries

FWIW, it isn't always exactly 3 entries. e.g.,

1518 (runtime.adjusttimers): [0x0, 0x17): 0x0
1518 (runtime.adjusttimers): [0x17, 0x66): 0x78
1518 (runtime.adjusttimers): [0x66, 0x67): 0x0
1518 (runtime.adjusttimers): [0x67, 0x83): 0x78
1518 (runtime.adjusttimers): [0x83, 0x84): 0x0
1518 (runtime.adjusttimers): [0x84, 0x371): 0x78   

The delta is 0 for each RET, which there are multiple of in this function. But generally it shouldn't be too many. I believe assembly can also have an arbitrary number of adjustments.

I'll take a look at the issues you noticed in a bit.

This comment has been minimized.

Copy link
@prattmic

prattmic Apr 8, 2021

The first issue is that it seems to reuse the same cache that is used for the funcspdelta.

The key for the cache is the target PC and off, which is the start of the table we're doing a lookup in. i.e., the second argument to pcvalue:

runtime/symtab.go
925:    fileno, _ := pcvalue(f, f.pcfile, targetpc, nil, strict)
926:    line, _ = pcvalue(f, f.pcln, targetpc, nil, strict)
940:    x, _ := pcvalue(f, f.pcsp, targetpc, cache, true)
974:    r, _ := pcvalue(f, pcdatastart(f, table), targetpc, cache, true)
982:    r, _ := pcvalue(f, pcdatastart(f, table), targetpc, cache, strict)
992:    return pcvalue(f, pcdatastart(f, table), targetpc, nil, true)

The second issue is that in my testing it seems like the inldata != nil predicate seems to trigger for pretty much every frame I'm unwinding, only to immediately bail out with ix < 0 right away.

The FUNCDATA is laid out in a flat array, so any function with open-coded defers (https://cs.opensource.google/go/go/+/master:src/cmd/internal/objabi/funcdata.go;l=22;drc=a7e16abb22f1b249d2691b32a5d20206282898f2;bpv=1;bpt=1), will also be forced to include an (empty) InlTree. That said, I'm not seeing much difference doing a quick benchmark of runtime.Callers with defers in the parents vs without. If you have better benchmarks, it may be worth opening an issue about, as the indices of the FUNCDATA may be possible to change to keep InlTree at the end if that helps things.

This comment has been minimized.

Copy link
@felixge

felixge Apr 10, 2021

Author Member

The key for the cache is the target PC and off, which is the start of the table we're doing a lookup in. i.e., the second argument to pcvalue:

I missed that! Subtle, but yeah, that works.

If you have better benchmarks, it may be worth opening an issue about, as the indices of the FUNCDATA may be possible to change to keep InlTree at the end if that helps things.

I'm going to look at this in more detail now and let you know if I find anything.

Thanks for your patience and help!


Now that that we have the stack pointer delta, we we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs that exist. Any inlined call found this way gets its own virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.
Now that that we have the stack pointer delta, we we are almost ready to locate the next `return address (pc)` value of the caller and do the same lookup for it until we reach the "bottom" of the stack. But before that, we need to check if the current `pc` is part of one or more inlined function calls. This is done by checking the `_FUNCDATA_InlTree` data for the current `_func` and doing another linear search over the (`pc`, `inline index`) pairs in that table. Any inlined call found this way gets virtual stack frame `pc` added to the list. Then we continue with `return address (pc)` as mentioned in the beginning of the paragraph.

Putting it all together, for non-recursive call stacks without inlining, the complexity for `gopclntab` unwinding is `O(N*M)` where `N` is the number of frames on the stack, and `M` is the average size of the generated machine code per function. This can be validated [experimentally](https://github.com/DataDog/go-profiler-notes/tree/main/examples/stack-unwind-overhead), but in the real world I'd expect the average `N` and `M` to be fairly similar for most non-trivial Go applications, so unwinding a stack (without symbolization) will generally cost `1-10µs`. That being said, naive frame pointer unwinding appears to be [50x faster](https://github.com/felixge/gounwind), and does less cache thrashing, so high-resolution profiling and tracing use cases would likely benefit from seeing [support for it in the core](https://github.com/golang/go/issues/16638).

Last but not least, it's worth noting that Go ships with two `.gopclntab` implementations. In addition to the one I've just described, there is also the [debug/gosym](https://golang.org/pkg/debug/gosym/) package that implements some of the same code as `runtime/symtab.go` and seems to be used by the linker, `go tool addr2line` and others. If you want, you can use it yourself in combination with [debug/elf](./examples/pclnttab/linux.go) or ([debug/macho](./examples/pclnttab/darwin.go)) to go on your [gopclntab adventures](./examples/pclnttab).
Last but not least, it's worth noting that Go ships with two `.gopclntab` implementations. In addition to the one I've just described, there is another one in the [debug/gosym](https://golang.org/pkg/debug/gosym/) package that seems to be used by the linker, `go tool addr2line` and others. If you want, you can use it yourself in combination with [debug/elf](./examples/pclnttab/linux.go) or ([debug/macho](./examples/pclnttab/darwin.go)) as a starting point for your own [gopclntab adventures](./examples/pclnttab) for good or [evil](https://tuanlinh.gitbook.io/ctf/golang-function-name-obfuscation-how-to-fool-analysis-tools).

### DWARF

Expand Down

0 comments on commit 6a62d59

Please sign in to comment.