Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement compact unwinding info #372

Merged
merged 13 commits into from
May 18, 2021
Merged

Implement compact unwinding info #372

merged 13 commits into from
May 18, 2021

Conversation

Gankra
Copy link
Contributor

@Gankra Gankra commented Apr 26, 2021

This is a WIP implementation of #368. It appears to generate reasonable-looking STACK CFI records, though I haven't tested them yet.

A few major questions that need feedback from the symbolic folks:

  • Where should this implementation go (gimli, goblin, another symbolic- crate..?)
  • What should this implementation use as the basis for its parsing/error-handling (currently using gimli for everything, bad choice for the errors)
  • How should this handle conflicting STACK CFI records between the DWARF CFI sections and the compact unwinding sections? This is not theoretical -- this happens on e.g. libmozglue in release firefox distributions.

The conflicting records has been kind of useful for being able to see that my output does indeed seem reasonable, as it largely agrees with the DWARF CFI-based STACK CFI records. Although the compact unwind info entries seem to ignore properly adjusting .cfa during the function prelude, where registers are being saved. e.g. DWARF CFI produces this:

STACK CFI INIT c90 75 .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI c91 .cfa: $rsp 16 +
STACK CFI c93 .cfa: $rsp 24 +
STACK CFI c94 .cfa: $rsp 32 + $rbx: .cfa -32 + ^ $r14: .cfa -24 + ^ $rbp: .cfa -16 + ^

STACK CFI INIT d10 2fa .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI d11 .cfa: $rsp 16 +
STACK CFI d13 .cfa: $rsp 24 +
STACK CFI d15 .cfa: $rsp 32 +
STACK CFI d17 .cfa: $rsp 40 +
STACK CFI d19 .cfa: $rsp 48 +
STACK CFI d1a .cfa: $rsp 56 +
STACK CFI d1e .cfa: $rsp 160 + $rbx: .cfa -56 + ^ $r12: .cfa -48 + ^ $r13: .cfa -40 + ^ $r14: .cfa -32 + ^ $r15: .cfa -24 + ^ $rbp: .cfa -16 + ^

While compact unwind info produces this (note how the final STACK CFI records of the above agree with the below):

STACK CFI INIT c90 80 .cfa: $rsp 32 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r14: .cfa -24 + ^ $rbx: .cfa -32 + ^

STACK CFI INIT d10 300 .cfa: $rsp 160 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r15: .cfa -24 + ^ $r14: .cfa -32 + ^ $r13: .cfa -40 + ^ $r12: .cfa -48 + ^ $rbx: .cfa -56 + ^

This is a bit funky, but also kinda reasonable since these preludes necessarily shouldn't show up in a backtrace (except for maybe in the top frame, but then that frame is probably not very interesting to someone debugging a crash).

@Gankra Gankra requested a review from a team April 26, 2021 23:23
@Gankra
Copy link
Contributor Author

Gankra commented Apr 26, 2021

I highly recommend pulling this PR and running cargo doc --open to get a nice pretty formatted version of the ~700 line comment describing the format.

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

Potentially important empirical observation: the conflicting info seems to come from specifically the eh_frame section, which otherwise contains seemingly nothing else, so it's possible we can Just Work by doing:

if unwind_info {
  use_unwind_info()
} else if eh_frame {
  use_eh_frame()
}

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

Pushed up that version for people to test out

@jan-auer
Copy link
Member

Thanks, @Gankra! I'll need some time to get to reviewing this in detail.

Where should this implementation go

Apart from the Breakpad output, this would be well suited to be added to goblin::mach. It would be similar to how this has been done for PE exception data / unwind info.

So far, we kept debugging functionality in symbolic, while trying to upstream file format support as much as possible. If for any reason it is not an option to upstream to goblin, my second choice would be somewhere in symbolic. Our plan is to expose unwind info in a generalized form in symbolic_debuginfo::ObjectLike.

What should this implementation use as the basis for its parsing/error-handling

Going with my above argument around goblin, what are your thoughts on using scroll?

How should this handle conflicting STACK CFI records between the DWARF CFI sections and the compact unwinding sections?

I'll try to read up on why those differences might exist. If eh_frame provides strictly higher quality unwind information, you could consume both together and then choose one over the other on a per-record basis. Do you see a reason to prefer unwind_info over eh_frame? It's probably worth noting that __unwind_info already is a combination of __compact_unwind and __eh_frame according to here. IIRC, there were cases where the information stored in the debug_frame and eh_frame sections were disjunct but overlapping and it could be the same here, but seems highly unlikely.

The rest is guesswork: It's extremely unlikely but not entirely impossible that the function prelude crashes a program. Maybe, compact_unwind is optimized for cases where the program needs to unwind, i.e. exception handling, but not hard crashes like stack overflows.

@loewenheim
Copy link
Contributor

While compact unwind info produces this (note how the final STACK CFI records of the above agree with the below):

STACK CFI INIT c90 80 .cfa: $rsp 32 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r14: .cfa -24 + ^ $rbx: .cfa -32 + ^

STACK CFI INIT d10 300 .cfa: $rsp 160 + .ra: .cfa -8 + ^ $rbp: .cfa -16 + ^ $r15: .cfa -24 + ^ $r14: .cfa -32 + ^ $r13: .cfa -40 + ^ $r12: .cfa -48 + ^ $rbx: .cfa -56 + ^

I have two, possibly stupid, questions:

  1. Why do the compact records each cover 5 bytes more than the “normal” init records?
  2. How can there be a single rule for computing .cfa (in the second example, .cfa = $rsp + 160) for the entire memory block from d10 to 300?

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

It's probably worth noting that __unwind_info already is a combination of __compact_unwind and __eh_frame according to here.

I believe __unwind_info is strictly compact, and the comment at the start of the file is discussing the DWARF opcode inside of compact unwinding, which is just "this is too complicated, use the DWARF" and a 24-bit pointer to the FDE in the __eh_frame that has that info. I don't think this happens much, so a first implementation can just ignore these.

Why do the compact records each cover 5 bytes more than the “normal” init records?

I believe these are ~padding bytes in the binary which CFI carefully doesn't map, but Compact Unwinding incidentally maps because it doesn't care about having incorrect mappings for places that don't matter (the range covered by an opcode is only implicitly defined by where the next opcode starts, so there were necessarily be an entry for every address in the binary, although for our purposes there can still be holes because we don't emit STACK CFI entries for some of the opcodes).

How can there be a single rule for computing .cfa (in the second example, .cfa = $rsp + 160) for the entire memory block from d10 to 300?

It basically just ignores support for the function prologue where callee-saved registers are being individually PUSHed to the stack. I believe @jan-auer is correct in describing the format as:

optimized for cases where the program needs to unwind, i.e. exception handling, but not hard crashes like stack overflows.

That said, in Firefox we're seeing that this is the only unwinding info for most of our x64 macos builds, so it's better than nothing. And it's still the case that most stack overflows will happen after the registers are saved (as the rest of the frame is usually bulk-allocated upfront).

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

(force-push was just updating some of the comments with more details I've found)

@loewenheim
Copy link
Contributor

Thank you for the explanations!

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

(more comment cleanup)

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

Starting to look into what this would look like if defined in goblin but will need a while to acclimate to the structure of the library. Definitely wouldn't complain if this is something @jan-auer was willing to figure out for me, but will be proceeding on the assumption that I need to figure it out myself.

@Gankra
Copy link
Contributor Author

Gankra commented Apr 27, 2021

force-push: tweaked the structure to still allow for spidering the structure in ARM64 even though we don't support decoding its opcodes, because the objdump-style "dump" operation doesn't need that. This also brings the structure closer to what it would be in its final form, properly stubbing out space for where the ARM impl would go (and properly marking all the x86-exclusive stuff as such).

Also everything that should be an Err instead of a panic should be now.

@Swatinem
Copy link
Member

@Gankra this is awesome stuff! Just read the rustdocs and will try to decode some of these unwind entries by hand, which are not being handled by either symbolic master or your branch yet:

(output from llvm-objdump -u)

(from libsystem_kernel.dylib)
  Common encodings: (count = 8)
    encoding[0]: 0x01010001
    encoding[1]: 0x00000000
[…]
function offset=0x00006fd4, encoding[1]=0x00000000

(from libsystem_platform.dylib)
  Common encodings: (count = 8)
    encoding[0]: 0x010558d1
[…]
function offset=0x00003d60, encoding[63]=0x04000a90

I will continue reviewing this tomorrow as well :-)

@Gankra
Copy link
Contributor Author

Gankra commented Apr 28, 2021

force-push:

  • moved the implementation to public module symbolic_debuginfo::macho::compact
  • removed all the extra setup types, now you directly create a CompactUnwindInfoIter from the section
  • changed the impl to use scroll/goblin APIs, so that it can be ported to goblin with minimal effort
    • minor exceptions: used some symbolic stuff at the API boundary to make integration simpler
    • but convert to only goblin/internal types immediately to minimize impl contamination
  • exposed it through symbolic_debuginfo::macho::MachObject::compact_unwind_info()
    • had to do a kinda gross hack where I "unwrap" a Cow::Borrow to get the lifetimes to work :(
  • renamed CfiOp and CfiRegister to CompactCfiOp and CompactCfiRegister to minimize namespace pollution
  • added AsciiCfiWriter::process_macho which handles the compact_unwind_info format
  • changed AsciiCfiWriter::process_dwarf to include a skip_eh_frame flag to maintain my current hack (may revert)

@Gankra
Copy link
Contributor Author

Gankra commented Apr 28, 2021

@Swatinem could you elaborate on your comment? What isn't my branch handling?

Edit: ah, you found an instance of "old" and "dwarf"

@Gankra
Copy link
Contributor Author

Gankra commented Apr 28, 2021

Added the dwarf mode.

Some inspection of the functions that are being mapped to "old" (0x00000000) mode suggests that it's basically "there is no info", as they seem to be hand written assembly routines (jsimd_rgb_ycc_convert_avx2, lucet_context_bootstrap, ...).

@Gankra
Copy link
Contributor Author

Gankra commented Apr 30, 2021

Added tests

@Gankra
Copy link
Contributor Author

Gankra commented Apr 30, 2021

I've added a section at the end of the documentation discussing all the corner cases that I was able to think of in the format, and how the implementation handles them.

@Gankra
Copy link
Contributor Author

Gankra commented May 3, 2021

I've replaced the Vec for CompactCfiOps with a little adhoc ArrayVec impl to avoid allocating and avoid actually depending on ArrayVec.

With this I regard the implementation no longer "WIP" and something that could land, modulo how you feel about the outstanding TODOs:

  • I expose entry_for_address publicly but it's just unimplemented!() and will crash if anyone uses it. I can implement it if you really want, or I can comment it out and remove references to it from the docs.

  • There is no ARM64 opcode impl. This is not needed for the current usecase as Firefox does not target iOS, although presumably it's useful for M1 macs? I could roll up my sleeves and do it but I am Tired of working on this so I am defaulting to not bothering until someone tells me to.

  • The x86 "stackless indirect" opcode is not implemented. This is needed for >2K stack frames. It does not appear to occur in Firefox's binary. It's possible llvm/clang avoid emitting it at all because it was actually briefly misimplemented. Implementing this opcode is otherwise annoying because it requires access to the actual function's instructions to decode a constant embedded in a subq/subl instruction. This would require some breaking API changes (passing in the full binary..?) I haven't really thought through, so it's possible you'll want to figure this out now just to have those API changes done. Again, Tired, so not bothering unless told.

  • I don't touch the LSDA/personality stuff at all, because that's only needed for runtime unwinding (libunwind). There's some sketches and gestures towards

  • Although there isn't really any reason to use this code directly, I make the module public so that its enormous top-level doc-comment is published somewhere. I could see you disliking that.

@Gankra Gankra changed the title WIP compact unwinding info Implement compact unwinding info May 3, 2021
@jrmuizel
Copy link

jrmuizel commented May 3, 2021

  • The x86 "stackless indirect" opcode is not implemented. This is needed for >2K stack frames. It does not appear to occur in Firefox's binary. It's possible llvm/clang avoid emitting it at all because it was actually briefly misimplemented. Implementing this opcode is otherwise annoying because it requires access to the actual function's instructions to decode a constant embedded in a subq/subl instruction. This would require some breaking API changes (passing in the full binary..?) I haven't really thought through, so it's possible you'll want to figure this out now just to have those API changes done. Again, Tired, so not bothering unless told.

FWIW, having access to the full binary is also useful when doing stack scanning because you can check that potential return addresses are proceeded by call instructions.

@Gankra
Copy link
Contributor Author

Gankra commented May 3, 2021

You're not doing stack scanning in the context of dumping the CFI into the breakpad format, which is the usecase here. It's not hard to get the binary (I don't think?), just an API change from what I have now (CompactUnwindInfoIter::new would take it as an (optional?) input).

@jrmuizel
Copy link

jrmuizel commented May 3, 2021

Ah right, this is the dumping not the unwinding. I was confused.

Copy link
Member

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! I have been looking through the docs and the implementation. I only skimmed the tests superficially since I’m super tired right now.

In general:

  • The docs could use a few intra-doc links ;-)
  • The iterator implementation is a bit hard to follow. How about we actually have a two-level iterator internally (not exposed to the outside), that would make it a bit easier to follow I think. If I understood correctly, the first level pages are marked by a sentinel, and the second level pages have a fixed size.
  • Instead of having separate methods to get each of the pieces from the encoding, such as x86_frameless_stack_size and similar with their own shift and mask inside, I wish that would be inlined into x86_instructions, together with a bit of ascii art to visually show what the exact pieces are that you are extracting via shift&mask. Maybe a bit along these lines: https://github.com/getsentry/sentry-native/blob/3fdaa5d5eeb306fa2d13ca66ec25e656960d19b8/src/sentry_value.c#L38-L45
  • Please avoid using unimplemented,unreachable or similar panics. I vaguely remember that we actually hit one in production at some time. Rather just return an error or comment the function out.

In general I would be happy to land this rather soon, maybe even as crate-private allow(dead_code). That way we can improve this in incremental PRs. I also haven’t really reviewed this for its API surface, regrading what we want to expose, what needs to be #[non_exhaustive], etc. Would like to get another opinion from the team on this. Anything that would make this easier to review incrementally would be nice.

I will be looking through your goblin PR soon, though I have no authority on that repo :-D

@Gankra
Copy link
Contributor Author

Gankra commented May 6, 2021

Realized ARM64 opcodes were really simple, so I added support for them (not tested on real-world data, don't have an M1 to test against).

The docs could use a few intra-doc links ;-)

Spent so long without them I never think to add them now that they exist. Will do.

Please avoid using unimplemented,unreachable or similar panics. I vaguely remember that we actually hit one in production at some time. Rather just return an error or comment the function out.

I removed all of them but one which is basically a ward against someone using the internal pointer_size function with an unknown architecture (currently can't happen). Most notably I have now fully commented out the theoretical API entry point and marked its example as ignore.

Instead of having separate methods to get each of the pieces from the encoding, such as x86_frameless_stack_size and similar with their own shift and mask inside, I wish that would be inlined into x86_instructions.

While I'm usually a big proponent of megafunctions inlining everything they can, in this case I found it a bit too distracting, preferring instead to keep *_instructions mostly focused on the higher-level semantics. Also a few of these do genuinely get some reuse.

The iterator implementation is a bit hard to follow. How about we actually have a two-level iterator internally (not exposed to the outside), that would make it a bit easier to follow I think.

Hmm, I'll think about this. This code is a bit finicky because of the requirement that we need to peek at the next item and handle the sentinel as if it were an item in a peek but never fully return it.

@Swatinem
Copy link
Member

Swatinem commented May 6, 2021

I have been looking at the CFI snapshot changes.
Most of them are actually more CFI, which is great!
In one case the covered range was increased, I think thats the case you mentioned about function padding being covered as well.
And in another case a few consecutive records were folded into one, which also looks great.

clippy is being quite pedantic though :-D

@Gankra
Copy link
Contributor Author

Gankra commented May 6, 2021

I fixed up the clippy complaints, supressing the one complaining about next being used as a method name.

I added intradoc links, an also fixed up some old gunk in the docs I missed.

glandium was gracious enough to run this on an ARM64 macos XUL and the output seems good?

Copy link
Contributor

@loewenheim loewenheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put impl blocks next to their types.

//!
//! Similarly, the way ranges of instructions are mapped means that Compact
//! Unwinding will generally incorrectly map the padding bytes between functions
//! (attributing them to the previous function), while DWARF CFI tends to more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a "more" too many here

//! There are two high-level concepts in this format that enable significant
//! compression of the tables:
//!
//! 1. Eliding duplicate function offsets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Function offsets" should be "instruction addresses" per the earlier terminology explanation.

//!
//! Trick 2 is more novel: At the first level a global palette of up to 127 opcodes
//! is defined. Each second-level "compressed" (leaf) page can also define up to 128 local
//! opcodes. Then the entries mapping function offsets to opcodes can use 8-bit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

//!
//! (Currently Unimplemented)
//!
//! Stack-Indirect is exactly the same situation as Stack-Immediate, but the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one "the" too many here.

//! It will never violate memory safety but it may start yielding chaotic
//! values.
//!
//! If this implementation ever panics, that should be regarded as an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate "an"

//!
//! * A corrupt unwind_info section may have its entries out of order. Since
//! the next entry's instruction_address is always needed to compute the
//! number of bytes the current entry cover, the implementation will report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "covers"

Other,
}

/// An iterator over the CompactUnwindInfoEntry's of a `.unwind_info` section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be an intra-doc link.

/// The instructions can be described with simple CFI operations.
CfiOps(CompactCfiOpIter),
/// Instructions can't be encoded by Compact Unwinding, but an FDE
/// with real DWARF CFI instructions are stored in the eh_frame section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"are" should be "is"

@Gankra
Copy link
Contributor Author

Gankra commented May 6, 2021

review addressed, thanks!

@Gankra
Copy link
Contributor Author

Gankra commented May 10, 2021

So the style/lint issues seem to be pre-existing issues flaring up because i happened to touch the file, and I'd rather not make unrelated changes here.

The python tests failing appear to be needing a regeneration on the cache file used by test_macos_cficache, but I can't find the way to regenerate it.

@Swatinem
Copy link
Member

I have updated the snapshots and merged master here: https://github.com/getsentry/symbolic/tree/Gankra-compact

Otherwise I have been trying to testdrive your branch looking at some cases that we currently don’t handle well.

More specifically, we are failing to correctly unwind through libsystem_platform.dylib/_sigtramp for which we still fail to get any CFI info even with your branch.

llvm-objdump gives me this:

      [21]: function offset=0x00003d51, encoding[64]=0x04000458
      [22]: function offset=0x00003d60, encoding[63]=0x04000a90 <- this is `_sigtramp`
      [23]: function offset=0x00003d96, encoding[62]=0x04000378

dump_cfi omits that record:

STACK CFI INIT 3d51 7 .cfa: $rsp 8 + .ra: .cfa -8 + ^
STACK CFI INIT 3d96 6 .cfa: $rsp 8 + .ra: .cfa -8 + ^

I still haven’t figured out why that happens though

@Gankra
Copy link
Contributor Author

Gankra commented May 17, 2021

Assuming this is x86/x64, the opcode decodes to:

UseDwarfFde { offset_in_eh_frame: 2704 }

Is there an eh_frame entry there? It's possible it's running into the limits of your dwarf decoding.

@Swatinem
Copy link
Member

It's possible it's running into the limits of your dwarf decoding.

I will check this, thanks for looking this up for me!

@Gankra
Copy link
Contributor Author

Gankra commented May 17, 2021

(tests broke because the drive-by patch removed trailing whitespace, fixed)

@Swatinem Swatinem merged commit 7fbc10e into getsentry:master May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants