Implement compact unwinding info #372

Gankra · 2021-04-26T23:23:30Z

This is a WIP implementation of #368. It appears to generate reasonable-looking STACK CFI records, though I haven't tested them yet.

A few major questions that need feedback from the symbolic folks:

Where should this implementation go (gimli, goblin, another symbolic- crate..?)
What should this implementation use as the basis for its parsing/error-handling (currently using gimli for everything, bad choice for the errors)
How should this handle conflicting STACK CFI records between the DWARF CFI sections and the compact unwinding sections? This is not theoretical -- this happens on e.g. libmozglue in release firefox distributions.

The conflicting records has been kind of useful for being able to see that my output does indeed seem reasonable, as it largely agrees with the DWARF CFI-based STACK CFI records. Although the compact unwind info entries seem to ignore properly adjusting .cfa during the function prelude, where registers are being saved. e.g. DWARF CFI produces this:

STACK CFI INIT c90 75 .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI c91 .cfa: $rsp 16 +
STACK CFI c93 .cfa: $rsp 24 +
STACK CFI c94 .cfa: $rsp 32 + $rbx: .cfa -32 + ^ $r14: .cfa -24 + ^ $rbp: .cfa -16 + ^

STACK CFI INIT d10 2fa .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI d11 .cfa: $rsp 16 +
STACK CFI d13 .cfa: $rsp 24 +
STACK CFI d15 .cfa: $rsp 32 +
STACK CFI d17 .cfa: $rsp 40 +
STACK CFI d19 .cfa: $rsp 48 +
STACK CFI d1a .cfa: $rsp 56 +
STACK CFI d1e .cfa: $rsp 160 + $rbx: .cfa -56 + ^ $r12: .cfa -48 + ^ $r13: .cfa -40 + ^ $r14: .cfa -32 + ^ $r15: .cfa -24 + ^ $rbp: .cfa -16 + ^

While compact unwind info produces this (note how the final STACK CFI records of the above agree with the below):

STACK CFI INIT c90 80 .cfa: $rsp 32 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r14: .cfa -24 + ^ $rbx: .cfa -32 + ^

STACK CFI INIT d10 300 .cfa: $rsp 160 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r15: .cfa -24 + ^ $r14: .cfa -32 + ^ $r13: .cfa -40 + ^ $r12: .cfa -48 + ^ $rbx: .cfa -56 + ^

This is a bit funky, but also kinda reasonable since these preludes necessarily shouldn't show up in a backtrace (except for maybe in the top frame, but then that frame is probably not very interesting to someone debugging a crash).

Gankra · 2021-04-26T23:27:47Z

I highly recommend pulling this PR and running cargo doc --open to get a nice pretty formatted version of the ~700 line comment describing the format.

Gankra · 2021-04-27T00:51:05Z

Potentially important empirical observation: the conflicting info seems to come from specifically the eh_frame section, which otherwise contains seemingly nothing else, so it's possible we can Just Work by doing:

if unwind_info {
  use_unwind_info()
} else if eh_frame {
  use_eh_frame()
}

Gankra · 2021-04-27T01:13:12Z

Pushed up that version for people to test out

jan-auer · 2021-04-27T12:27:06Z

Thanks, @Gankra! I'll need some time to get to reviewing this in detail.

Where should this implementation go

Apart from the Breakpad output, this would be well suited to be added to goblin::mach. It would be similar to how this has been done for PE exception data / unwind info.

So far, we kept debugging functionality in symbolic, while trying to upstream file format support as much as possible. If for any reason it is not an option to upstream to goblin, my second choice would be somewhere in symbolic. Our plan is to expose unwind info in a generalized form in symbolic_debuginfo::ObjectLike.

What should this implementation use as the basis for its parsing/error-handling

Going with my above argument around goblin, what are your thoughts on using scroll?

How should this handle conflicting STACK CFI records between the DWARF CFI sections and the compact unwinding sections?

I'll try to read up on why those differences might exist. If eh_frame provides strictly higher quality unwind information, you could consume both together and then choose one over the other on a per-record basis. Do you see a reason to prefer unwind_info over eh_frame? It's probably worth noting that __unwind_info already is a combination of __compact_unwind and __eh_frame according to here. IIRC, there were cases where the information stored in the debug_frame and eh_frame sections were disjunct but overlapping and it could be the same here, but seems highly unlikely.

The rest is guesswork: It's extremely unlikely but not entirely impossible that the function prelude crashes a program. Maybe, compact_unwind is optimized for cases where the program needs to unwind, i.e. exception handling, but not hard crashes like stack overflows.

loewenheim · 2021-04-27T12:54:04Z

While compact unwind info produces this (note how the final STACK CFI records of the above agree with the below):

STACK CFI INIT c90 80 .cfa: $rsp 32 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r14: .cfa -24 + ^ $rbx: .cfa -32 + ^

STACK CFI INIT d10 300 .cfa: $rsp 160 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r15: .cfa -24 + ^ $r14: .cfa -32 + ^ $r13: .cfa -40 + ^ $r12: .cfa -48 + ^ $rbx: .cfa -56 + ^

I have two, possibly stupid, questions:

Why do the compact records each cover 5 bytes more than the “normal” init records?
How can there be a single rule for computing .cfa (in the second example, .cfa = $rsp + 160) for the entire memory block from d10 to 300?

Gankra · 2021-04-27T13:14:33Z

It's probably worth noting that __unwind_info already is a combination of __compact_unwind and __eh_frame according to here.

I believe __unwind_info is strictly compact, and the comment at the start of the file is discussing the DWARF opcode inside of compact unwinding, which is just "this is too complicated, use the DWARF" and a 24-bit pointer to the FDE in the __eh_frame that has that info. I don't think this happens much, so a first implementation can just ignore these.

Why do the compact records each cover 5 bytes more than the “normal” init records?

I believe these are ~padding bytes in the binary which CFI carefully doesn't map, but Compact Unwinding incidentally maps because it doesn't care about having incorrect mappings for places that don't matter (the range covered by an opcode is only implicitly defined by where the next opcode starts, so there were necessarily be an entry for every address in the binary, although for our purposes there can still be holes because we don't emit STACK CFI entries for some of the opcodes).

How can there be a single rule for computing .cfa (in the second example, .cfa = $rsp + 160) for the entire memory block from d10 to 300?

It basically just ignores support for the function prologue where callee-saved registers are being individually PUSHed to the stack. I believe @jan-auer is correct in describing the format as:

optimized for cases where the program needs to unwind, i.e. exception handling, but not hard crashes like stack overflows.

That said, in Firefox we're seeing that this is the only unwinding info for most of our x64 macos builds, so it's better than nothing. And it's still the case that most stack overflows will happen after the registers are saved (as the rest of the frame is usually bulk-allocated upfront).

Gankra · 2021-04-27T13:15:03Z

(force-push was just updating some of the comments with more details I've found)

loewenheim · 2021-04-27T13:16:42Z

Thank you for the explanations!

Gankra · 2021-04-27T13:36:18Z

(more comment cleanup)

Gankra · 2021-04-27T13:51:39Z

Starting to look into what this would look like if defined in goblin but will need a while to acclimate to the structure of the library. Definitely wouldn't complain if this is something @jan-auer was willing to figure out for me, but will be proceeding on the assumption that I need to figure it out myself.

Gankra · 2021-04-27T15:47:10Z

force-push: tweaked the structure to still allow for spidering the structure in ARM64 even though we don't support decoding its opcodes, because the objdump-style "dump" operation doesn't need that. This also brings the structure closer to what it would be in its final form, properly stubbing out space for where the ARM impl would go (and properly marking all the x86-exclusive stuff as such).

Also everything that should be an Err instead of a panic should be now.

Swatinem · 2021-04-28T15:00:44Z

@Gankra this is awesome stuff! Just read the rustdocs and will try to decode some of these unwind entries by hand, which are not being handled by either symbolic master or your branch yet:

(output from llvm-objdump -u)

(from libsystem_kernel.dylib)
  Common encodings: (count = 8)
    encoding[0]: 0x01010001
    encoding[1]: 0x00000000
[…]
function offset=0x00006fd4, encoding[1]=0x00000000

(from libsystem_platform.dylib)
  Common encodings: (count = 8)
    encoding[0]: 0x010558d1
[…]
function offset=0x00003d60, encoding[63]=0x04000a90

I will continue reviewing this tomorrow as well :-)

Gankra · 2021-04-28T15:57:07Z

force-push:

moved the implementation to public module symbolic_debuginfo::macho::compact
removed all the extra setup types, now you directly create a CompactUnwindInfoIter from the section
changed the impl to use scroll/goblin APIs, so that it can be ported to goblin with minimal effort
- minor exceptions: used some symbolic stuff at the API boundary to make integration simpler
- but convert to only goblin/internal types immediately to minimize impl contamination
exposed it through symbolic_debuginfo::macho::MachObject::compact_unwind_info()
- had to do a kinda gross hack where I "unwrap" a Cow::Borrow to get the lifetimes to work :(
renamed CfiOp and CfiRegister to CompactCfiOp and CompactCfiRegister to minimize namespace pollution
added AsciiCfiWriter::process_macho which handles the compact_unwind_info format
changed AsciiCfiWriter::process_dwarf to include a skip_eh_frame flag to maintain my current hack (may revert)

Gankra · 2021-04-28T15:57:58Z

@Swatinem could you elaborate on your comment? What isn't my branch handling?

Edit: ah, you found an instance of "old" and "dwarf"

Gankra · 2021-04-28T19:18:22Z

Added the dwarf mode.

Some inspection of the functions that are being mapped to "old" (0x00000000) mode suggests that it's basically "there is no info", as they seem to be hand written assembly routines (jsimd_rgb_ycc_convert_avx2, lucet_context_bootstrap, ...).

Gankra · 2021-04-30T19:09:24Z

Added tests

Gankra · 2021-04-30T22:02:43Z

I've added a section at the end of the documentation discussing all the corner cases that I was able to think of in the format, and how the implementation handles them.

Gankra · 2021-05-03T15:55:05Z

I've replaced the Vec for CompactCfiOps with a little adhoc ArrayVec impl to avoid allocating and avoid actually depending on ArrayVec.

With this I regard the implementation no longer "WIP" and something that could land, modulo how you feel about the outstanding TODOs:

I expose entry_for_address publicly but it's just unimplemented!() and will crash if anyone uses it. I can implement it if you really want, or I can comment it out and remove references to it from the docs.
There is no ARM64 opcode impl. This is not needed for the current usecase as Firefox does not target iOS, although presumably it's useful for M1 macs? I could roll up my sleeves and do it but I am Tired of working on this so I am defaulting to not bothering until someone tells me to.
The x86 "stackless indirect" opcode is not implemented. This is needed for >2K stack frames. It does not appear to occur in Firefox's binary. It's possible llvm/clang avoid emitting it at all because it was actually briefly misimplemented. Implementing this opcode is otherwise annoying because it requires access to the actual function's instructions to decode a constant embedded in a subq/subl instruction. This would require some breaking API changes (passing in the full binary..?) I haven't really thought through, so it's possible you'll want to figure this out now just to have those API changes done. Again, Tired, so not bothering unless told.
I don't touch the LSDA/personality stuff at all, because that's only needed for runtime unwinding (libunwind). There's some sketches and gestures towards
Although there isn't really any reason to use this code directly, I make the module public so that its enormous top-level doc-comment is published somewhere. I could see you disliking that.

jrmuizel · 2021-05-03T17:14:38Z

The x86 "stackless indirect" opcode is not implemented. This is needed for >2K stack frames. It does not appear to occur in Firefox's binary. It's possible llvm/clang avoid emitting it at all because it was actually briefly misimplemented. Implementing this opcode is otherwise annoying because it requires access to the actual function's instructions to decode a constant embedded in a subq/subl instruction. This would require some breaking API changes (passing in the full binary..?) I haven't really thought through, so it's possible you'll want to figure this out now just to have those API changes done. Again, Tired, so not bothering unless told.

FWIW, having access to the full binary is also useful when doing stack scanning because you can check that potential return addresses are proceeded by call instructions.

Gankra · 2021-05-03T17:43:39Z

You're not doing stack scanning in the context of dumping the CFI into the breakpad format, which is the usecase here. It's not hard to get the binary (I don't think?), just an API change from what I have now (CompactUnwindInfoIter::new would take it as an (optional?) input).

jrmuizel · 2021-05-03T17:45:20Z

Ah right, this is the dumping not the unwinding. I was confused.

Swatinem

This looks awesome! I have been looking through the docs and the implementation. I only skimmed the tests superficially since I’m super tired right now.

In general:

The docs could use a few intra-doc links ;-)
The iterator implementation is a bit hard to follow. How about we actually have a two-level iterator internally (not exposed to the outside), that would make it a bit easier to follow I think. If I understood correctly, the first level pages are marked by a sentinel, and the second level pages have a fixed size.
Instead of having separate methods to get each of the pieces from the encoding, such as x86_frameless_stack_size and similar with their own shift and mask inside, I wish that would be inlined into x86_instructions, together with a bit of ascii art to visually show what the exact pieces are that you are extracting via shift&mask. Maybe a bit along these lines: https://github.com/getsentry/sentry-native/blob/3fdaa5d5eeb306fa2d13ca66ec25e656960d19b8/src/sentry_value.c#L38-L45
Please avoid using unimplemented,unreachable or similar panics. I vaguely remember that we actually hit one in production at some time. Rather just return an error or comment the function out.

In general I would be happy to land this rather soon, maybe even as crate-private allow(dead_code). That way we can improve this in incremental PRs. I also haven’t really reviewed this for its API surface, regrading what we want to expose, what needs to be #[non_exhaustive], etc. Would like to get another opinion from the team on this. Anything that would make this easier to review incrementally would be nice.

I will be looking through your goblin PR soon, though I have no authority on that repo :-D

Gankra · 2021-05-06T01:55:07Z

Realized ARM64 opcodes were really simple, so I added support for them (not tested on real-world data, don't have an M1 to test against).

The docs could use a few intra-doc links ;-)

Spent so long without them I never think to add them now that they exist. Will do.

Please avoid using unimplemented,unreachable or similar panics. I vaguely remember that we actually hit one in production at some time. Rather just return an error or comment the function out.

I removed all of them but one which is basically a ward against someone using the internal pointer_size function with an unknown architecture (currently can't happen). Most notably I have now fully commented out the theoretical API entry point and marked its example as ignore.

Instead of having separate methods to get each of the pieces from the encoding, such as x86_frameless_stack_size and similar with their own shift and mask inside, I wish that would be inlined into x86_instructions.

While I'm usually a big proponent of megafunctions inlining everything they can, in this case I found it a bit too distracting, preferring instead to keep *_instructions mostly focused on the higher-level semantics. Also a few of these do genuinely get some reuse.

The iterator implementation is a bit hard to follow. How about we actually have a two-level iterator internally (not exposed to the outside), that would make it a bit easier to follow I think.

Hmm, I'll think about this. This code is a bit finicky because of the requirement that we need to peek at the next item and handle the sentinel as if it were an item in a peek but never fully return it.

Swatinem · 2021-05-06T08:28:34Z

I have been looking at the CFI snapshot changes.
Most of them are actually more CFI, which is great!
In one case the covered range was increased, I think thats the case you mentioned about function padding being covered as well.
And in another case a few consecutive records were folded into one, which also looks great.

clippy is being quite pedantic though :-D

Gankra · 2021-05-06T12:22:22Z

I fixed up the clippy complaints, supressing the one complaining about next being used as a method name.

I added intradoc links, an also fixed up some old gunk in the docs I missed.

glandium was gracious enough to run this on an ARM64 macos XUL and the output seems good?

loewenheim

Please put impl blocks next to their types.

loewenheim · 2021-05-06T08:24:46Z