-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access to thread locals isn't inlined across crates #25088
Comments
@veddan can you quantify the cost that you are seeing? For example this is what I get: #![feature(test)]
#![allow(warnings)]
extern crate test;
use std::sync::Mutex;
#[bench]
fn pthreads(b: &mut test::Bencher) {
let mut buf = [0u8; 1024];
enum pthread_mutex_t {}
enum pthread_attr_t {}
extern {
fn pthread_mutex_init(lock: *mut pthread_mutex_t,
attr: *mut pthread_attr_t) -> i32;
fn pthread_mutex_lock(lock: *mut pthread_mutex_t) -> i32;
fn pthread_mutex_unlock(lock: *mut pthread_mutex_t) -> i32;
}
unsafe {
let ptr = buf.as_ptr() as *mut pthread_mutex_t;
assert_eq!(pthread_mutex_init(ptr, 0 as *mut _), 0);
b.iter(|| {
assert_eq!(pthread_mutex_lock(ptr), 0);
assert_eq!(pthread_mutex_unlock(ptr), 0);
});
}
}
#[bench]
fn libstd(b: &mut test::Bencher) {
let m = Mutex::new(());
b.iter(|| drop(m.lock()));
}
When inlining everything with LTO the costs basically entirely go away, and when calling across the library boundary (note that there's no virtual call, and this is intentional!) the cost is quite negligible. I definitely agree that the current design requires a virtual call when inlined across crates, and I've experimented with tweaking it such that this is not necessary, but I don't think that it's easily possible due to various safety concerns. I just want to get an idea of what perf hit you're seeing. |
@alexcrichton wouldn't this be solved by marking the static containing the accessors, and the accessors themselves as |
Functionally, yes, but I don't think that |
@alexcrichton oh, I could've sworn I've seen |
I believe it is being silently ignored, yes.
This may be possible, but the semantics of ELF-like TLS and OS-based TLS should be the same, and OS-based TLS requires a |
@alexcrichton the OS-based TLS |
It's somewhat of an implementation detail, but there still needs to be one true address of the TLS static. Right now a |
@alexcrichton just the pair of accessors would be |
At some point in the past I believe |
@alexcrichton I believe we are talking past each other: I'm not suggesting we touch the platform-dependent hidden static at all. As it stands, we allow taking I would personally prefer using |
Could you elaborate on why you think this doesn't have a significant address? For OS-based TLS it needs a significant address as it mutates the memory and everyone needs to see the update. For ELF-based TLS it's also significant because of the mutations which need to be visible (and everyone needs to reference the same memory).
Right, but we just need to inline the getter into destination crates, probably via an
Unfortunately due to the current usage of |
Oh another thing is that a static FOO: u32 = 4;
const BAR: &'static u32 = &FOO;
|
Okay, As for the other thing, I'm talking about the outer For both OS-based and |
I've updated the description of this issue with a more detailed explanation about what I believe the solution here is (basically just a summary of the discussion @eddyb and I had a year ago I believe) |
Just how fast can TLS get? I thought that on x86-64 it could indirect off of a segment register or something like that, and be really fast. |
Hi, any consensus on this one ? |
@cyplo no change to the updated description, which I believe is still the best solution |
Rvalue promotion is getting close to stabilization (#38865 (comment)), so we'll use that + The other concerns raised in the past seem inconsequential now - allowing references to |
I'm writing an application whose performance is impacted by the fact that TLS accesses are not generated using a segment register, but go through a function call and are then not subject to usual optimizations like CSE. I found this issue that seems the culprit. It looks like #50252 fixed the problem for Linux and Mac but not Windows (as far as I can tell?). Given that 9 months are passed, would it be acceptable to refresh #50252 and activate it only for Linux/Mac? |
…ss-crate, r=Mark-Simulacrum std: Attempt again to inline thread-local-init across crates Issue rust-lang#25088 has been part of `thread_local!` for quite some time now. Historical attempts have been made to add `#[inline]` to `__getit` in rust-lang#43931, rust-lang#50252, and rust-lang#59720, but these attempts ended up not landing at the time due to segfaults on Windows. In the interim though with `const`-initialized thread locals AFAIK this is the only remaining bug which is why you might want to use `#[thread_local]` over `thread_local!`. As a result I figured it was time to resubmit this and see how it fares on CI and if I can help debugging any issues that crop up. Closes rust-lang#25088
Proposed solution
Right now access to thread locals defined by
thread_local!
aren't inlined across crates, causing performance problems that wouldn't otherwise be seen within one crate. This can probably be solved with a few new minor language features:#[inline]
annotation could be processed onstatic
variables. If the variable does not have any internal mutability, then the definition can be inlined into other LLVM modules and tagged withavailable_externally
. That means that the contents are available for optimization, but if you're taking the address it's available elsewhere.#[inline]
.Those two pieces I believe should provide enough inlining opportunities to ensure that accesses are as fast when done from external crates as they are done with internal crates.
Original description
This hurts performance for the locks in
std::sync
, as they callstd::rt::unwind::panicking()
(which just reads a thread-local). For uncontended locks the cost is quite significant.There are two problems:
std::rt::unwind::panicking()
isn't marked inline. This is trivial to solve.thread_local!
goes through function pointers, which LLVM fails to see through. These are the__getit
functions inlibstd/thread/local.rs
. Consider these two files:call_foo
gets the following IR with everything compiled with full optimization. Note the call through a function pointer:The text was updated successfully, but these errors were encountered: