Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS for MacOS 64-bit #1568

Open
derekbruening opened this issue Dec 5, 2014 · 7 comments
Open

TLS for MacOS 64-bit #1568

derekbruening opened this issue Dec 5, 2014 · 7 comments

Comments

@derekbruening
Copy link
Contributor

Split from #58 as this may have some overlap with ARM work. I'm going to paste my notes here:

** TODO 64-bit: can only set gs, not fs, and can't read gs

For 64-bit: thread_fast_set_cthread_self64 which sets MSR_IA32_KERNEL_GS_BASE.
May not be a way to set MSR for FS: no reference to MSR_IA32_KERNEL_FS_BASE in
xnu sources.

*** TODO option #1 for DR: is there some free padding space in TLS mmap?

Maybe beyond pthread data structs, since stack beyond that is page-aligned?

*** TODO option #2 for DR: early injection and use privlib w/ larger mmap + app mangling?

Add extra page to TLS mmap, maybe to the left so out of way (16-bit offs
will still reach).

We'd need to mangle the app's references even w/o priv loader.

*** TODO option #3 for DR: like Windows, can we steal some slots from app's TLS?

Like Windows, request official TLS slots like app would and take ~20 (and
ensure directly addressable)?

*** TODO for priv libs, have to mangle app's refs

If go w/ option #2, have to do so even w/o priv libs.

*** TODO how read current gs?
*** TODO on Ivybridge+, use OP_wrfsbase or OP_wrgsbase?!
*** TODO steal register? later update: leverage ARM code?
Though this only helps for the code cache and our gencode: in our C code
we need a separate TLS mechanism.

@lunixbochs
Copy link

lunixbochs commented Aug 15, 2016

I think OS X uses GS exclusively for user-space TLS on x86_64 and FS is not used, so setting FS shouldn't be required.

The rest of my comment will talk about reading GS. Can you expand upon your other questions above in case I can help? FWIW it doesn't look like (RD|WR)(FS|GS)BASE are enabled in userspace on OS X.

afaict TLS on OS X by default is managed in dyld and libpthread sources on opensource.apple.com, which ends up in the /usr/lib/system/libdyld.dylib and /usr/lib/dyld binaries.

You can read the GS_base on x86_64 on 10.11.6 using this nasty hack: https://github.com/lunixbochs/precorn/blob/master/src/osx/x86_64.c#L44

All of the pthread structs are private and they removed the get_cthread MDEP syscalls on x86_64, so I'm not sure if there's an actual portable way to do this. The only forward-compat problem I'm actually worried about with this method is the magic offset into the pthread_t struct (28 * 8). This can be made more robust as follows:

  1. Grab the address at gs:[0]. This is a pthread_t struct, the same as calling pthread_self().
  2. Do a memory search at that address, for the address, because it is the first TLS entry. This will continue to work for finding the offset if they change the private struct.

If you want to make it even more robust, read ~32 bytes to grab the first few slots and do a larger memory search, or link libpthread, reset one of the slots to a magic value, and search for it.

I went through all this mess to make QEMU work on x86_64 OS X in a similar way to DynamoRIO, only to run into the fact QEMU has terrible AVX2 support, so now I'm rooting for you :)

@derekbruening
Copy link
Contributor Author

Thank you for the information. The bigger problem than reading the %gs base though is that there's no way to set the %fs base, requiring implementation of a different scheme for TLS from what we use on other unix-ish x86 platforms.

We are short on manpower for Mac work, unfortunately. We would welcome contributions.

@derekbruening
Copy link
Contributor Author

I found some notes from a while back that seem to be an expansion of the initial entry's notes. Pasting here for reference:

option #1 for DR: is there some free padding space in TLS mmap?

Maybe beyond pthread data structs, since stack beyond that is page-aligned?

WINNER option #2 for DR: early injection and use privlib w/ larger mmap + app mangling?

Add extra page to TLS mmap, maybe to the left so out of way (16-bit offs
will still reach).

We'd need to mangle the app's references even w/o priv loader.

option #3 for DR: like Windows, can we steal some slots from app's TLS?

Like Windows, request official TLS slots like app would and take ~20 (and
ensure directly addressable)?

Qin: on Linux static TLS is directly addressable so app could take it all

could have DR loaded by ld.so request a bunch of static

option #4: table lookup by thread id: too slow though

xref tls_table in os.c
that was a quick impl -- linear lookup, could be improved
but get_thread_private_dcontext is called a lot and it is important to have it be fast

option #5: replace TLS mmap at DR init permanently w/ a larger one

first part is copy of original

Qin: 1st TLS may be brk not mmap, and later ones combine TLS and stack.
and to see brk call happen need early injection which need priv loader =>
may as well just do option #2.

option #6: steal register? later update: leverage ARM code?

Though this only helps for the code cache and our gencode: in our C code
we need a separate TLS mechanism.

Stealing a register is more work than the other options.

for priv libs, have to mangle app's refs

If go w/ option #2, have to do so even w/o priv libs.

@derekbruening
Copy link
Contributor Author

Re-visiting the options here. Unlike Linux, where each thread's TLS is allocated in its own mmap and it's easy to add some space in order to have DR share the priv lib's layout, the Mac TLS is combined with the pthread library (initial thread) or thread stack (new threads):

thread 166878: self is 0x7fffa77f2380, gs base is 0x7fffa77f2460
thread 166947: self is 0x70000ced6000, gs base is 0x70000ced60e0

__DATA                 00007fffa77f1000-00007fffa77f2000 [    4K     4K     4K     0K] rw-/rwx SM=COW          /usr/lib/system/libsystem_platform.dylib
__DATA                 00007fffa77f2000-00007fffa77f6000 [   16K    16K    12K     0K] rw-/rwx SM=COW          /usr/lib/system/libsystem_pthread.dylib
__DATA                 00007fffa77f6000-00007fffa77f7000 [    4K     4K     0K     0K] rw-/rwx SM=COW          /usr/lib/system/libsystem_sandbox.dylib

Stack                  000070000ce56000-000070000ced8000 [  520K     8K     8K     0K] rw-/rwx SM=PRV          thread 1
Stack                  00007ffeef400000-00007ffeefc00000 [ 8192K    16K    16K     0K] rw-/rwx SM=PRV          thread 0
(lldb) x/80gx 0x7fffa77f2380
0x7fffa77f2380: 0x0000000054485244 0x0000000000000000
0x7fffa77f2390: 0x0000000000000000 0x0000000000010105
0x7fffa77f23a0: 0x0000000000000000 0x0000000000000000
0x7fffa77f23b0: 0x0000000000000000 0x0000002800000000
0x7fffa77f23c0: 0x0000000000000023 0x0000000000000000
0x7fffa77f23d0: 0x0000000000000000 0x0000000a0000001f
0x7fffa77f23e0: 0x000070000ced6000 0x00007fffa77f22a0
0x7fffa77f23f0: 0x0000000000000000 0x0000000000000000
0x7fffa77f2400: 0x0000000000000000 0x0000000000000000
0x7fffa77f2410: 0x0000000000000000 0x0000000000000000
0x7fffa77f2420: 0x0000000000000000 0x0000000000000000
0x7fffa77f2430: 0x00007ffeefc00000 0x0000000000800000
0x7fffa77f2440: 0x00007ffeebc00000 0x0000000004000000
0x7fffa77f2450: 0x0000000000001000 0x0000000000028bde <-- tid
0x7fffa77f2460: 0x00007fffa77f2380 0x00007fffa77f23c8 <-- tls slots, self first
0x7fffa77f2470: 0x0000000000000607 0x0000000000000307
0x7fffa77f2480: 0x00000000000008ff 0x0000000000000000
0x7fffa77f2490: 0x0000000000000000 0x0000000000000000
(lldb) x/60gx 0x70000ced6000
0x70000ced6000: 0x0000000054485244 0x0000000000000000
0x70000ced6010: 0x00000a0300000003 0x0000000000010109
0x70000ced6020: 0x0000000100000e30 0x0000000000000000
0x70000ced6030: 0x0000000000000000 0x0000000000000000
0x70000ced6040: 0x0000000000000022 0x0000000000000000
0x70000ced6050: 0x0000000000000000 0x0000000a0000001f
0x70000ced6060: 0x0000000000000000 0x00007fffa77f23e0
0x70000ced6070: 0x0000000000000000 0x0000000000000000
0x70000ced6080: 0x0000000000000000 0x0000000000000000
0x70000ced6090: 0x0000000000000000 0x0000000000000000
0x70000ced60a0: 0x0000000000000000 0x0000000000000000
0x70000ced60b0: 0x000070000ced6000 0x0000000000080000
0x70000ced60c0: 0x000070000ce55000 0x0000000000083000
0x70000ced60d0: 0x0000000000001000 0x0000000000028c23 <-- tid
0x70000ced60e0: 0x000070000ced6000 0x000070000ced6048 <-- tls slots, self first
0x70000ced60f0: 0x0000000000000b03 0x0000000000000a03
0x70000ced6100: 0x00000000000008ff 0x0000000000000000
0x70000ced6110: 0x0000000000000000 0x0000000000000000

That makes it harder to adjust the sizes.

Given the extra work in making our own loader, I'm looking at reviving @shawndenbow's PR #2293 approach of implementing #3 above and stealing some slots from the app. This is definitely the simplest solution to make progress w/o having to first or simultaneously solve complex problems outside of TLS.

@derekbruening
Copy link
Contributor Author

Following Qin's suggestion: since we're injecting late anyway and libpthread is always there, we could have DR depend on libpthread and invoke pthread_key_create a bunch of times to reserve TLS slots: i.e., actually use the user-mode interfaces to get our resources. Long-term we'd like to be independent of the user libs but that will require more developer effort than we currently have.

@derekbruening
Copy link
Contributor Author

We could use the same approach with private libraries: load a private pthread lib and invoke its pthread_key_create. Unlike Windows with its 64 dynamic slots, on Mac there are 768. The Mac system libraries reserve 256, leaving 512: presumably we could steal some from the tool, which should not be using a crazy amount of non-system libs. That's not as bad as Windows where we steal them from the app.

That would leave what to do with early injection and no client where normally we wouldn't bother trying to load any private libs: if it doesn't complicate the code too much we could just do an mmap there and set up the gsbase ourselves.

derekbruening added a commit that referenced this issue Apr 15, 2019
For 64-bit MacOS, there is no way to set the %fs base which stops
us from using DR's scheme used on other unix platforms. This commit
provides initial support to MacOS 64-bit by stealing a TLS slot
from the app for DR's TLS base.
+ implement is_thread_tls_initialized for MacOS 64-bit
+ implement tls_thread_init and tls_thread_free
+ set MACOS64 define in cmake script
+ add WRITE_TLS_SLOT_IMM etc. for MacOS 64-bit
+ add read_thread_register for MacOS 64-bit to get pthread_t base

Issue: #1568, #1979
hgreving2304 pushed a commit that referenced this issue Apr 22, 2019
For 64-bit MacOS, there is no way to set the %fs base which stops
us from using DR's scheme used on other unix platforms. This commit
provides initial support to MacOS 64-bit by stealing a TLS slot
from the app for DR's TLS base.
+ implement is_thread_tls_initialized for MacOS 64-bit
+ implement tls_thread_init and tls_thread_free
+ set MACOS64 define in cmake script
+ add WRITE_TLS_SLOT_IMM etc. for MacOS 64-bit
+ add read_thread_register for MacOS 64-bit to get pthread_t base

Issue: #1568, #1979
derekbruening added a commit that referenced this issue Sep 14, 2019
Uses pthread_key_create() to allocate enough contiguous and aligned TLS
slots to fit our os_local_state_t struct.  This makes it easier to share
Linux code for Mac64.

Keeps the scheme from ce8e803 of storing a pointer to the base of
os_local_state_t in TLS slot 6.  This is indirection we don't need with the
entire os_local_state_t struct in TLS but it is not clear we can take that
many TLS slots for large applications, so I'm leaving this mixture until
we're sure which direction to go in.

Disables the options -mangle_app_seg and -safe_read_tls_init for Mac64.

Issue: #1568, #1979
derekbruening added a commit that referenced this issue Sep 14, 2019
Uses pthread_key_create() to allocate enough contiguous and aligned TLS
slots to fit our os_local_state_t struct.  This makes it easier to share
Linux code for Mac64.

Keeps the scheme from ce8e803 of storing a pointer to the base of
os_local_state_t in TLS slot 6.  This is indirection we don't need with the
entire os_local_state_t struct in TLS but it is not clear we can take that
many TLS slots for large applications, so I'm leaving this mixture until
we're sure which direction to go in.

Disables the options -mangle_app_seg and -safe_read_tls_init for Mac64.

Issue: #1568, #1979
@derekbruening
Copy link
Contributor Author

I put in the pthread_key_create approach in #3832. It works for small apps at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants