-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: CL 561635 changes relative PCs printed in tracebacks #65761
Comments
Thanks for the throrough analysis. Yes, I (actually x/telemetry) need PCs only to compute func-relative line numbers, so option 2 would work. But I expect other telemetry systems will surely want as much information as they can get, especially if it relieves them of the need to (a) fork a child process with the same pclntab to decode the message printed via SetCrashOutput and (b) subtract the relative text segment offset from all addresses. A structured complete dump would address that. One concern: GOTRACEBACK=foo can increase the level but not decrease it, so a crash monitor would want "structured" to be the maximum value, perpetually, in case a user inadvertently lowers the value, causing the crash reports to be useless to the monitor. Or perhaps we should change SetCrashOutput to always emit structured output before the API freezes. I think this is my favorite option, but it's definitely more work on both sides. I'd be willing to implement it. I don't think it's that hard if we can agree on the schema. |
@adonovan , I think I'm not clear on how it's using even the absolute PCs to get relative line numbers. Doesn't it still have to parse the functab to get function start lines? Or is it assuming the symbolic information for the entry PC of a function represents its first line? What do you think of options 1a or 1b? If we're considering option 2, then we're already open to tweaking the traceback text format, and 1a/1b avoid a lot of gross work on the telemetry side. Structured tracebacks will be a fair amount of work. We may want to make sure the |
It parses the absolute PCs out of the SetCrashOutput text, subtracts the text segment offset (difference between parent and child process, same executable), then formats the stack using the same runtime.CallersFrames-based function that is used in the ordinary "grab current stack and increment counter" workflow. The function-relative line number is computed using
I suspect most users won't want to see relative line numbers in the traceback, and will want to see absolute ones, so there's some redundancy there. (Relative numbers can easily be confused for absolute ones, though I recently changed x/telemetry to include the explicit sign But I think options 1a and 1b may be overfitting to the current rather ad hoc needs of x/telemetry, and that other telemetry systems (including future variants of ours) may want more information, including (de-ASLR'd) program counters plus whatever else can be scraped out of CallersFrames--in essence, the full structured dump. |
Adding release-blocker because the change as it stands has lost information about the PC. |
Here's a starting point for a schema for a structured dump. JSON seems like the obvious syntax; I don't think it should be that hard to encode in a single pass to a file descriptor. I chose the schema below to avoid a commitment that every non-leaf frame has a PC, since (as you've mentioned), it's a fiction, and the non-leaf PCs are not all CALLs. The non-leaf PCs are in most cases just a means to obtain File/Line/etc information, but since that's provided in the dump, the issue is moot. [Edit: the diff shows the evolution of the first draft during subsequent discssion.] // A Crash describes a runtime crash.
type Crash struct {
Message string // fatal message or panic value
PanicType string // operand type of panic(x) call; "" => not a panic
- TextOffset uintptr // ASLR offset of text segment
+ RuntimeText uintptr // address of runtime.text symbol (for un-applying ASLR)
Goroutines []Goroutines // set of all goroutines
}
// A Goroutine describes a goroutine that was live at the moment of the crash.
type Goroutine struct {
ID int // unique identifier
State string // running, runnable, wait, etc.
Stack []Frame // stack of active Go function calls, outermost first
}
// A Frame is a logical stack frame entry.
// The last frame in the Stack is the "leaf".
type Frame struct {
// information about the logical function
Function string // canonical name of function (e.g "pkg.T.f"); "" if unknown
Filename string // name of file declaring function; "" if unknown
EntryPC uintptr // text segment address of function entrypoint (is this meaningful if Inlined??)
StartLine int // line number of declaration's 'func' token; 0 if unknown
// information about the goroutine's position within this function
PC uintptr // text segment address of current PC (if leaf), active call (if callee is not inlined), or 0 if callee inlined
Line int // line number of PC (if leaf) or active call (non-leaf)
Inlined bool // this function was inlined (=> parent Frame has no PC)
} We needn't gate the above issue on implementing a full structured dump, but I'd be willing to invest a day to get a prototype together to see if this looks like the right approach. Let me know what you think. |
The address of text segment is pretty hard to get in a portable way. Maybe we could use the address of the
For inlined frames, the entry PC is probably not meaningful. As the On the other hand, how would one compute relative PC (assuming it is useful)? Say, G is inlined to F, G crashes, we'd have
G
So we'd need |
Sounds good. RuntimeText uintptr // address of runtime.text symbol (for un-applying ASLR)
It's not obvious to me that it is useful. The virtual call (NOP) PCs were a means to an end (file/line/etc) that would now be served directly. |
This comment in traceback makes me wonder whether JSON is the right encoding, since a truncated JSON document is not a JSON document:
Is this rare enough in practice not to matter? |
Traceback crashing is usually a runtime bug or memory corruption e.g. due to misuse of unsafe or cgo. If we say that this feature is mostly for debugging user code, not runtime bugs, then maybe we could say it is rare. |
Change https://go.dev/cl/571676 mentions this issue: |
Change https://go.dev/cl/571797 mentions this issue: |
The JSON should use the same field names as exported runtime data structures and methods whenever possible. This means s/Filename/File/ and s/EntryPC/Entry/. |
I like the general shape of this, but have two big comments, both of which take inspiration from
|
@prattmic and I were just discussing this; like @cherrymui he thought truncation would arise only very rarely due to obscure runtime bugs (e.g. signal race with vdso), and that losing those records wouldn't stop application developers from getting good coverage of their bugs from field reports. (Of course, the runtime team might have a different valuation of exactly those bug reports.)
By "everything", do you mean all the strings printed during the traceback operation---the unstructured crash report? That would be redundant with the structured report, though perhaps slightly easier to read. (I can't imagine you are proposing that the entire process's stderr, even before the crash, would be JSON-framed and emitted to the crash fd.) |
Also, that would require buffering, and we can't be allocating at this point. |
Sorry, yes, I mean just the traceback. I'm thinking, if we tie this to
It's not straightforward, but it is doable. One way I can think of (there may be simpler ways) is that traceback would construct up a packet of information representing the frame to be printed and that can be printed as many times and different ways as necessary. There's a strictly bounded amount of information at that level. It only becomes unbounded and difficult to work with once you actually start emitting the strings. |
It would be much simpler to define a String method on Crash that formats the (decoded) crash report in a similar (but not identical) form. Would that suffice?
That's essentially the approach I was thinking of in any case, to avoid the need to duplicate the sprinkling of two kinds of print statements throughout the traceback algorithm: each packet would be (a) |
This reverts commit 643d816 (CL 561635). Reason for revert: This works for telemetry but broke various other properties of the tracebacks as well as some programs that read tracebacks. We should figure out a solution that works for all uses, and in the interim we should not be making telemetry work at the cost of breaking other, existing valid uses. See #65761 for details. Change-Id: I467993ae778887e5bd3cca4c0fb54e9d44802ee1 Reviewed-on: https://go-review.googlesource.com/c/go/+/571797 Auto-Submit: Russ Cox <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Austin Clements <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Change https://go.dev/cl/571656 mentions this issue: |
This CL changes the test expectations to match the runtime behavior after the revert of CL 561635, which caused the stack to emit program counters that are valid for CallersFrames. However, CallersFrames' expectations are... idiosyncratic and it was considered undesirable to transfer them to the runtime traceback format. Unfortunately that means that telemetry for crash reports will drop inline frames. Hopefully this is only temporary. Updates golang/go#65761 Change-Id: I0f8601cbb1328715f357249b91bd21941860bd0e Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/571656 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Matloob <[email protected]> Auto-Submit: Alan Donovan <[email protected]>
I'm not sure I understand what this would mean. What's |
Crash as defined here. Yes, that's what I had in mind.
Isn't the purpose of a structured dump to make well-documented promises about the schema of what the reader can expect, just as we do for (say) |
With Russ's CLs the original breaking change is resolved, so I think this doesn't need to be a release blocker now. |
The issue with relative PCs has long since been resolved by backing out CL 561635 and tweaking runtime.CallersFrames to work with the PCs that are printed in the traceback. #67182 is going to add an options struct to debug.SetCrashOutput that we can use in the future to add structured crash output. Closing this as completed. We'll have a follow-on issue to propose structured crash output. |
This CL changes the test expectations to match the runtime behavior after the revert of CL 561635, which caused the stack to emit program counters that are valid for CallersFrames. However, CallersFrames' expectations are... idiosyncratic and it was considered undesirable to transfer them to the runtime traceback format. Unfortunately that means that telemetry for crash reports will drop inline frames. Hopefully this is only temporary. Updates golang/go#65761 Change-Id: I0f8601cbb1328715f357249b91bd21941860bd0e Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/571656 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Matloob <[email protected]> Auto-Submit: Alan Donovan <[email protected]>
The intent of CL 561635 was to add absolute
pc=0x%x
s to inlined frames in tracebacks when GOTRACEBACK is set tosystem
or higher. The ultimate goal of this was to be able to compute function-relative line numbers for telemetry reports.Unfortunately, it had the side-effect of changing the relative PCs printed in all traceback modes. Specifically, if f calls g calls h and g and h are inlined into f, we used to print the "innermost" relative PC on the f frame, even though that technically points into h's body, and not print relative PCs on g or h, like this:
where 0x26 points to an instruction in the inlined copy of h.
With @adonovan 's change, we now print the "outermost" relative PC on the f frame, which is a NOP marker for where g was inlined. This is, in some sense, more consistent with the absolute PCs that can be printed, but in the common case it loses information.
There are a few hard constraints here:
And several nice-to-haves:
GOTRACEBACK=system
traceback.GOTRACEBACK=system
.I think @adonovan 's changes to traceback do satisfy 1–5 and 7, but violate 6.
Here are the options that come to my mind:
We could revert the change. Then we satisfy 1–4 and 6, but violate 5 and 7. Then we're back at the original problem that telemetry had. We have a few alternatives for telemetry to get relative line numbers:
a. We just start printing relative line numbers in the traceback.
b. 1a but gated by a GOTRACEBACK setting (orthogonal to its current settings)
c. An option for structured tracebacks.
We include meaningful relative PCs for every frame, including inlined frames. Something like
To reduce line noise, I've stripped the package path from the outer symbol printed for the inlined frames. We could maybe reduce it even further by just saying
...+0x26
orparent+0x26
or something. I think this satisfies all of the constraints and nice-to-haves.Options 1a and 1b are pretty straightforward in the runtime.
Option 1c is probably a good idea, but is also a whole endeavor.
Option 2 is a heavier lift but arguably more principled.
The text was updated successfully, but these errors were encountered: