Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos tlv workaround #60341

Merged
merged 15 commits into from
Jun 20, 2019
Merged

macos tlv workaround #60341

merged 15 commits into from
Jun 20, 2019

Conversation

mtak-
Copy link
Contributor

@mtak- mtak- commented Apr 27, 2019

fixes: #60141

Includes:

  • remove dead code: requires_move_before_drop. This hasn't been needed for a while now (oops I should have removed it in OSX: fix #57534 registering thread dtors while running thread dtors #57655)
  • redox had a copy of fast::Key (not sure why?). That has been removed.
  • Perform a read_volatile on OSX to reduce tlv_get_addr calls per __getit from (4-2 depending on context) to 1.

tlv_get_addr is relatively expensive (~1.5ns on my machine).

Previously, in contexts where __getit was inlined, 4 calls to tlv_get_addr were performed per lookup. For some reason when __getit is not inlined this is reduced to 2x - and performance improves to match.

After this PR, I have only ever seen 1x call to tlv_get_addr per __getit, and macos now benefits from situations where __getit is inlined.

I'm not sure if the read_volatile(&&__KEY) trick is working around an LLVM bug, or a rustc bug, or neither.

r? @alexcrichton

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 27, 2019
@alexcrichton
Copy link
Member

Thanks for this! The last commit here looks like the real meat of the PR but I'm not sure that I entirely understand what's going on, it's definitely counterintuitive that a volatile load makes things more performant!

Do you know what's happening at the LLVM layer? Do you, for example, have some LLVM IR of before/after to see what's going on? It'd be nice if we could avoid this trick and instead just tweak the codegen slightly to be more amenable to better code generation on the backend.

@mtak-
Copy link
Contributor Author

mtak- commented Apr 29, 2019

The last commit here looks like the real meat of the PR but I'm not sure that I entirely understand what's going on, it's definitely counterintuitive that a volatile load makes things more performant!

The last commit is indeed the meat. The other two commits are just deleting dead code.

Do you know what's happening at the LLVM layer? Do you, for example, have some LLVM IR of before/after to see what's going on?

Here's some sample code and associated codegen:

#[inline(never)]
#[cold]
fn init() -> String {
    String::from("hello world. this string is long")
}

pub fn get_the_thread_local() -> String {
    thread_local! {
        static X: String = init();
    }
    X.with(|x| x.clone())
}
IR _without_ `read_volatile` (`--release`)
; thread_local::get_the_thread_local
; Function Attrs: uwtable
define void @_ZN12thread_local20get_the_thread_local17h709f40512ac1fd32E(%"alloc::string::String"* noalias nocapture sret dereferenceable(24)) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %value.i.i.i = alloca %"alloc::string::String", align 16
  %_15.i.i = alloca %"alloc::string::String", align 8
  %_3.sroa.6.i = alloca [2 x i64], align 8
  %_3.sroa.6.0.sroa_cast9.i = bitcast [2 x i64]* %_3.sroa.6.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %_3.sroa.6.0.sroa_cast9.i)
  %.val.i.i.i.i = load i8, i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 25), align 1, !noalias !25
  %1 = icmp eq i8 %.val.i.i.i.i, 0
  br i1 %1, label %bb7.i.i.i.i, label %bb4.i.i

bb7.i.i.i.i:                                      ; preds = %start
  %.val.i.i.i.i.i = load i8, i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 24), align 8, !noalias !25
  %2 = icmp eq i8 %.val.i.i.i.i.i, 0
  br i1 %2, label %bb7.i.i.i.i.i, label %bb11.i.i

bb7.i.i.i.i.i:                                    ; preds = %bb7.i.i.i.i
; call std::sys::unix::fast_thread_local::register_dtor
  call void @_ZN3std3sys4unix17fast_thread_local13register_dtor17hea5fd2f6f1030afcE(i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 0), void (i8*)* nonnull @_ZN3std6thread5local4fast13destroy_value17h328a717a41c9449aE), !noalias !25
  store i8 1, i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 24), align 8, !noalias !25
  br label %bb11.i.i

bb11.i.i:                                         ; preds = %bb7.i.i.i.i.i, %bb7.i.i.i.i
  %3 = bitcast %"alloc::string::String"* %_15.i.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %3), !noalias !25
  %4 = load {}*, {}** bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E to {}**), align 8, !noalias !25
  %5 = icmp eq {}* %4, null
  br i1 %5, label %bb13.i.i, label %bb15.i.i

bb13.i.i:                                         ; preds = %bb11.i.i
  %6 = bitcast %"alloc::string::String"* %value.i.i.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %6), !noalias !25
; call thread_local::init
  call fastcc void @_ZN12thread_local4init17h88c40ced9b6cbe50E(%"alloc::string::String"* noalias nocapture nonnull dereferenceable(24) %value.i.i.i) #7, !noalias !25
  %7 = bitcast %"alloc::string::String"* %value.i.i.i to <2 x i64>*
  %8 = load <2 x i64>, <2 x i64>* %7, align 16, !noalias !25
  %_11.sroa.0.sroa.5.0..sroa_idx21.i.i.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %value.i.i.i, i64 0, i32 1, i32 3
  %_11.sroa.0.sroa.5.0.copyload.i.i.i = load i64, i64* %_11.sroa.0.sroa.5.0..sroa_idx21.i.i.i, align 16, !noalias !25
  %z.i.i.i.sroa.0.0.copyload.i.i.i = load {}*, {}** bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E to {}**), align 8, !noalias !30
  %z.i.i.i.sroa.4.0.copyload.i.i.i = load i64, i64* bitcast (i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 8) to i64*), align 8, !noalias !30
  store <2 x i64> %8, <2 x i64>* bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E to <2 x i64>*), align 8, !noalias !34
  store i64 %_11.sroa.0.sroa.5.0.copyload.i.i.i, i64* bitcast (i8* getelementptr inbounds (<{ [32 x i8] }>, <{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E, i64 0, i32 0, i64 16) to i64*), align 8, !noalias !34
  %9 = icmp eq {}* %z.i.i.i.sroa.0.0.copyload.i.i.i, null
  %10 = icmp eq i64 %z.i.i.i.sroa.4.0.copyload.i.i.i, 0
  %or.cond.i.i.i = or i1 %9, %10
  br i1 %or.cond.i.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17hf02e3e1800d44f2aE.exit.i.i", label %bb4.i.i.i.i.i.i.i.i.i

bb4.i.i.i.i.i.i.i.i.i:                            ; preds = %bb13.i.i
  %11 = bitcast {}* %z.i.i.i.sroa.0.0.copyload.i.i.i to i8*
  tail call void @__rust_dealloc(i8* nonnull %11, i64 %z.i.i.i.sroa.4.0.copyload.i.i.i, i64 1) #7, !noalias !25
  br label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17hf02e3e1800d44f2aE.exit.i.i"

"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17hf02e3e1800d44f2aE.exit.i.i": ; preds = %bb4.i.i.i.i.i.i.i.i.i, %bb13.i.i
  call void @llvm.lifetime.end.p0i8(i64 24, i8* nonnull %6), !noalias !25
  br label %bb15.i.i

bb15.i.i:                                         ; preds = %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17hf02e3e1800d44f2aE.exit.i.i", %bb11.i.i
; call <alloc::string::String as core::clone::Clone>::clone
  call void @"_ZN60_$LT$alloc..string..String$u20$as$u20$core..clone..Clone$GT$5clone17h5e0882d4896c9b5eE"(%"alloc::string::String"* noalias nocapture nonnull sret dereferenceable(24) %_15.i.i, %"alloc::string::String"* noalias nonnull readonly align 8 dereferenceable(24) bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h53cabb331a543277E to %"alloc::string::String"*)), !noalias !25
  %_3.sroa.0.0..sroa_cast.i = bitcast %"alloc::string::String"* %_15.i.i to {}**
  %_3.sroa.0.0.copyload.i = load {}*, {}** %_3.sroa.0.0..sroa_cast.i, align 8, !noalias !35
  %_3.sroa.6.0..sroa_idx.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %_15.i.i, i64 0, i32 1, i32 1, i32 1
  %_3.sroa.6.0..sroa_cast.i = bitcast i64* %_3.sroa.6.0..sroa_idx.i to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %_3.sroa.6.0.sroa_cast9.i, i8* nonnull align 8 %_3.sroa.6.0..sroa_cast.i, i64 16, i1 false), !noalias !35
  call void @llvm.lifetime.end.p0i8(i64 24, i8* nonnull %3), !noalias !25
  %12 = icmp eq {}* %_3.sroa.0.0.copyload.i, null
  br i1 %12, label %bb4.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17hf1fd224675692f45E.exit"

bb4.i.i:                                          ; preds = %bb15.i.i, %start
; call core::result::unwrap_failed
  tail call fastcc void @_ZN4core6result13unwrap_failed17hf64e0430bdaf4bcaE(), !noalias !36
  unreachable

"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17hf1fd224675692f45E.exit": ; preds = %bb15.i.i
  %_3.sroa.0.0..sroa_cast2.i = bitcast %"alloc::string::String"* %0 to {}**
  store {}* %_3.sroa.0.0.copyload.i, {}** %_3.sroa.0.0..sroa_cast2.i, align 8, !alias.scope !36
  %_3.sroa.6.0..sroa_idx6.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %0, i64 0, i32 1, i32 1, i32 1
  %_3.sroa.6.0..sroa_cast7.i = bitcast i64* %_3.sroa.6.0..sroa_idx6.i to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %_3.sroa.6.0..sroa_cast7.i, i8* nonnull align 8 %_3.sroa.6.0.sroa_cast9.i, i64 16, i1 false), !alias.scope !40
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %_3.sroa.6.0.sroa_cast9.i)
  ret void
}
IR with `read_volatile` (`--release`)
; thread_local::get_the_thread_local
; Function Attrs: uwtable
define void @_ZN12thread_local20get_the_thread_local17hceac182570e9506dE(%"alloc::string::String"* noalias nocapture sret dereferenceable(24)) unnamed_addr #1 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %value.i.i.i = alloca %"alloc::string::String", align 16
  %self.i.i.i.i = alloca %"std::thread::local::fast::Key<alloc::string::String>"*, align 8
  %_15.i.i = alloca %"alloc::string::String", align 8
  %_3.sroa.6.i = alloca [2 x i64], align 8
  %_3.sroa.6.0.sroa_cast9.i = bitcast [2 x i64]* %_3.sroa.6.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %_3.sroa.6.0.sroa_cast9.i)
  %self.i.i.i.i.0.sroa_cast = bitcast %"std::thread::local::fast::Key<alloc::string::String>"** %self.i.i.i.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %self.i.i.i.i.0.sroa_cast)
  store %"std::thread::local::fast::Key<alloc::string::String>"* bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h1f31f65c90417afeE to %"std::thread::local::fast::Key<alloc::string::String>"*), %"std::thread::local::fast::Key<alloc::string::String>"** %self.i.i.i.i, align 8, !noalias !25
  %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i = load volatile %"std::thread::local::fast::Key<alloc::string::String>"*, %"std::thread::local::fast::Key<alloc::string::String>"** %self.i.i.i.i, align 8, !noalias !25, !nonnull !0
  %1 = getelementptr inbounds %"std::thread::local::fast::Key<alloc::string::String>", %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i, i64 0, i32 5
  %.val.i.i.i.i = load i8, i8* %1, align 1, !noalias !25
  %2 = icmp eq i8 %.val.i.i.i.i, 0
  br i1 %2, label %bb8.i.i.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17ha49ffcfe07458aa0E.exit.thread.i"

bb8.i.i.i.i:                                      ; preds = %start
  %3 = getelementptr inbounds %"std::thread::local::fast::Key<alloc::string::String>", %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i, i64 0, i32 3
  %.val.i.i.i.i.i = load i8, i8* %3, align 8, !noalias !25
  %4 = icmp eq i8 %.val.i.i.i.i.i, 0
  br i1 %4, label %bb7.i.i.i.i.i, label %bb11.i.i

bb7.i.i.i.i.i:                                    ; preds = %bb8.i.i.i.i
  %5 = bitcast %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i to i8*
; call std::sys::unix::fast_thread_local::register_dtor
  call void @_ZN3std3sys4unix17fast_thread_local13register_dtor17ha1a2dbcd05fabbafE(i8* nonnull %5, void (i8*)* nonnull @_ZN3std6thread5local4fast13destroy_value17haead139569d8605fE), !noalias !25
  store i8 1, i8* %3, align 8, !noalias !25
  br label %bb11.i.i

"_ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17ha49ffcfe07458aa0E.exit.thread.i": ; preds = %start
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %self.i.i.i.i.0.sroa_cast)
  br label %bb4.i.i

bb11.i.i:                                         ; preds = %bb7.i.i.i.i.i, %bb8.i.i.i.i
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %self.i.i.i.i.0.sroa_cast)
  %6 = bitcast %"alloc::string::String"* %_15.i.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %6), !noalias !25
  %7 = bitcast %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i to {}**
  %8 = load {}*, {}** %7, align 8, !noalias !25
  %9 = icmp eq {}* %8, null
  br i1 %9, label %bb13.i.i, label %bb15.i.i

bb13.i.i:                                         ; preds = %bb11.i.i
  %10 = bitcast %"alloc::string::String"* %value.i.i.i to i8*
  call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %10), !noalias !25
; call thread_local::init
  call fastcc void @_ZN12thread_local4init17h08fde1884c30376bE(%"alloc::string::String"* noalias nocapture nonnull dereferenceable(24) %value.i.i.i) #6, !noalias !25
  %11 = bitcast %"alloc::string::String"* %value.i.i.i to <2 x i64>*
  %12 = load <2 x i64>, <2 x i64>* %11, align 16, !noalias !25
  %_11.sroa.0.sroa.5.0..sroa_idx21.i.i.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %value.i.i.i, i64 0, i32 1, i32 3
  %_11.sroa.0.sroa.5.0.copyload.i.i.i = load i64, i64* %_11.sroa.0.sroa.5.0..sroa_idx21.i.i.i, align 16, !noalias !25
  %z.i.i.i.sroa.0.0.copyload.i.i.i = load {}*, {}** %7, align 8, !noalias !30
  %13 = getelementptr inbounds %"std::thread::local::fast::Key<alloc::string::String>", %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i, i64 0, i32 1, i32 1, i32 2, i64 0
  %z.i.i.i.sroa.4.0.copyload.i.i.i = load i64, i64* %13, align 8, !noalias !30
  %14 = getelementptr inbounds %"std::thread::local::fast::Key<alloc::string::String>", %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i, i64 0, i32 1, i32 1, i32 2, i64 1
  %15 = bitcast %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i to <2 x i64>*
  store <2 x i64> %12, <2 x i64>* %15, align 8, !noalias !34
  store i64 %_11.sroa.0.sroa.5.0.copyload.i.i.i, i64* %14, align 8, !noalias !34
  %16 = icmp eq {}* %z.i.i.i.sroa.0.0.copyload.i.i.i, null
  %17 = icmp eq i64 %z.i.i.i.sroa.4.0.copyload.i.i.i, 0
  %or.cond.i.i.i = or i1 %16, %17
  br i1 %or.cond.i.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17h4a7d4a29fdf48028E.exit.i.i", label %bb4.i.i.i.i.i.i.i.i.i

bb4.i.i.i.i.i.i.i.i.i:                            ; preds = %bb13.i.i
  %18 = bitcast {}* %z.i.i.i.sroa.0.0.copyload.i.i.i to i8*
  tail call void @__rust_dealloc(i8* nonnull %18, i64 %z.i.i.i.sroa.4.0.copyload.i.i.i, i64 1) #6, !noalias !25
  br label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17h4a7d4a29fdf48028E.exit.i.i"

"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17h4a7d4a29fdf48028E.exit.i.i": ; preds = %bb4.i.i.i.i.i.i.i.i.i, %bb13.i.i
  call void @llvm.lifetime.end.p0i8(i64 24, i8* nonnull %10), !noalias !25
  br label %bb15.i.i

bb15.i.i:                                         ; preds = %"_ZN3std6thread5local17LocalKey$LT$T$GT$4init17h4a7d4a29fdf48028E.exit.i.i", %bb11.i.i
  %_19.0.i.i = bitcast %"std::thread::local::fast::Key<alloc::string::String>"* %self.i.i.i.i.0.self.i.i.i.0.self.i.i.0.self.i.0.self.0..i.i.i.i to %"alloc::string::String"*
; call <alloc::string::String as core::clone::Clone>::clone
  call void @"_ZN60_$LT$alloc..string..String$u20$as$u20$core..clone..Clone$GT$5clone17h9234dcb674122143E"(%"alloc::string::String"* noalias nocapture nonnull sret dereferenceable(24) %_15.i.i, %"alloc::string::String"* noalias nonnull readonly align 8 dereferenceable(24) %_19.0.i.i), !noalias !25
  %_3.sroa.0.0..sroa_cast.i = bitcast %"alloc::string::String"* %_15.i.i to {}**
  %_3.sroa.0.0.copyload.i = load {}*, {}** %_3.sroa.0.0..sroa_cast.i, align 8, !noalias !35
  %_3.sroa.6.0..sroa_idx.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %_15.i.i, i64 0, i32 1, i32 1, i32 1
  %_3.sroa.6.0..sroa_cast.i = bitcast i64* %_3.sroa.6.0..sroa_idx.i to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %_3.sroa.6.0.sroa_cast9.i, i8* nonnull align 8 %_3.sroa.6.0..sroa_cast.i, i64 16, i1 false), !noalias !35
  call void @llvm.lifetime.end.p0i8(i64 24, i8* nonnull %6), !noalias !25
  %19 = icmp eq {}* %_3.sroa.0.0.copyload.i, null
  br i1 %19, label %bb4.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17h0b7fbfbefd581825E.exit"

bb4.i.i:                                          ; preds = %bb15.i.i, %"_ZN3std6thread5local17LocalKey$LT$T$GT$8try_with17ha49ffcfe07458aa0E.exit.thread.i"
; call core::result::unwrap_failed
  tail call fastcc void @_ZN4core6result13unwrap_failed17hcfbdd175a3d3f832E(), !noalias !36
  unreachable

"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17h0b7fbfbefd581825E.exit": ; preds = %bb15.i.i
  %_3.sroa.0.0..sroa_cast2.i = bitcast %"alloc::string::String"* %0 to {}**
  store {}* %_3.sroa.0.0.copyload.i, {}** %_3.sroa.0.0..sroa_cast2.i, align 8, !alias.scope !36
  %_3.sroa.6.0..sroa_idx6.i = getelementptr inbounds %"alloc::string::String", %"alloc::string::String"* %0, i64 0, i32 1, i32 1, i32 1
  %_3.sroa.6.0..sroa_cast7.i = bitcast i64* %_3.sroa.6.0..sroa_idx6.i to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %_3.sroa.6.0..sroa_cast7.i, i8* nonnull align 8 %_3.sroa.6.0.sroa_cast9.i, i64 16, i1 false), !alias.scope !40
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %_3.sroa.6.0.sroa_cast9.i)
  ret void
}

I can only guess at what's going on at the LLVM layer, and unfortunately I don't know much about LLVM's inner workings or where to look for tweaking codgen in rustc.

The actual fast thread local @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY ... is mentioned 1x in the read_volatile code, whereas it is referenced 10x in the non-read_volatile code. Six of those mentions use getelementptr inbounds which appears to correspond to the 6 tlv_get_addr (callq *(%rdi)) calls in the final asm.

asm without `read_volatile`
__ZN12thread_local20get_the_thread_local17ha0569ef21dbb73eaE:
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r14
	pushq	%rbx
	subq	$64, %rsp
	movq	%rdi, %rbx
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	cmpb	$0, 25(%rax)
	jne	LBB4_9
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	cmpb	$0, 24(%rax)
	jne	LBB4_3
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	movq	%rax, %r14
	leaq	__ZN3std6thread5local4fast13destroy_value17h2b601cce828bc311E(%rip), %rsi
	movq	%rax, %rdi
	callq	__ZN3std3sys4unix17fast_thread_local13register_dtor17hea5fd2f6f1030afcE
	movb	$1, 24(%r14)
LBB4_3:
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	cmpq	$0, (%rax)
	je	LBB4_4
LBB4_7:
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	leaq	-48(%rbp), %rdi
	movq	%rax, %rsi
	callq	__ZN60_$LT$alloc..string..String$u20$as$u20$core..clone..Clone$GT$5clone17h5e0882d4896c9b5eE
	movq	-48(%rbp), %rax
	movq	-40(%rbp), %rcx
	movq	%rcx, -64(%rbp)
	movq	-32(%rbp), %rcx
	movq	%rcx, -56(%rbp)
	testq	%rax, %rax
	je	LBB4_9
	movq	%rax, (%rbx)
	movq	-64(%rbp), %rax
	movq	-56(%rbp), %rcx
	movq	%rcx, 16(%rbx)
	movq	%rax, 8(%rbx)
	movq	%rbx, %rax
	addq	$64, %rsp
	popq	%rbx
	popq	%r14
	popq	%rbp
	retq
LBB4_4:
	leaq	-48(%rbp), %rdi
	callq	__ZN12thread_local4init17h71551753bba23e2bE
	movaps	-48(%rbp), %xmm0
	movaps	%xmm0, -80(%rbp)
	movq	-32(%rbp), %rcx
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h5103305ae64a8553E@TLVP(%rip), %rdi
	callq	*(%rdi)
	movq	(%rax), %rdi
	movq	8(%rax), %rsi
	movaps	-80(%rbp), %xmm0
	movups	%xmm0, (%rax)
	movq	%rcx, 16(%rax)
	testq	%rdi, %rdi
	je	LBB4_7
	testq	%rsi, %rsi
	je	LBB4_7
	movl	$1, %edx
	callq	___rust_dealloc
	jmp	LBB4_7
LBB4_9:
	callq	__ZN4core6result13unwrap_failed17ha41e6baa6f090bd5E

It's also interesting that the IR for a linux release build is the same, but the assembly only performs callq __tls_get_addr@PLT once at the start of the function.

linux asm
_ZN12thread_local20get_the_thread_local17h53234738c97e2accE:
	pushq	%r15
	pushq	%r14
	pushq	%rbx
	subq	$48, %rsp
	movq	%rdi, %r14
	leaq	_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@TLSLD(%rip), %rdi
	callq	__tls_get_addr@PLT
	cmpb	$0, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF+25(%rax)
	jne	.LBB4_8
	movq	%rax, %rbx
	cmpb	$0, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF+24(%rbx)
	jne	.LBB4_3
	movq	%rbx, %rax
	movq	%rax, %r15
	leaq	_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF(%rax), %rdi
	leaq	_ZN3std6thread5local4fast13destroy_value17h5fc509333a064006E(%rip), %rsi
	callq	*_ZN3std3sys4unix17fast_thread_local13register_dtor17h7d5b1813798466baE@GOTPCREL(%rip)
	movb	$1, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF+24(%r15)
.LBB4_3:
	movq	%rbx, %rax
	cmpq	$0, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF(%rbx)
	je	.LBB4_4
.LBB4_7:
	leaq	_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF(%rbx), %rsi
	movq	%rsp, %rdi
	callq	*_ZN60_$LT$alloc..string..String$u20$as$u20$core..clone..Clone$GT$5clone17h5526323efb1899c4E@GOTPCREL(%rip)
	movq	(%rsp), %rax
	movups	8(%rsp), %xmm0
	movaps	%xmm0, 32(%rsp)
	testq	%rax, %rax
	je	.LBB4_8
	movq	%rax, (%r14)
	movaps	32(%rsp), %xmm0
	movups	%xmm0, 8(%r14)
	movq	%r14, %rax
	addq	$48, %rsp
	popq	%rbx
	popq	%r14
	popq	%r15
	retq
.LBB4_4:
	movq	%rsp, %rdi
	callq	_ZN12thread_local4init17h8519144e0d22f3f8E
	movaps	(%rsp), %xmm0
	movq	16(%rsp), %rcx
	movq	%rbx, %rax
	movq	_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF(%rbx), %rdi
	movq	_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF+8(%rbx), %rsi
	movaps	%xmm0, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF(%rbx)
	movq	%rcx, _ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hde48327e2980a364E@DTPOFF+16(%rbx)
	testq	%rdi, %rdi
	je	.LBB4_7
	testq	%rsi, %rsi
	je	.LBB4_7
	movl	$1, %edx
	callq	*__rust_dealloc@GOTPCREL(%rip)
	jmp	.LBB4_7
.LBB4_8:
	callq	_ZN4core6result13unwrap_failed17h1d58e680aff0dbf0E
	ud2

It'd be nice if we could avoid this trick and instead just tweak the codegen slightly to be more amenable to better code generation on the backend.

Any ideas/tips for fixing the codegen, or thoughts on where to look in the source? I am pretty motivated to get something done on this.

@alexcrichton
Copy link
Member

Ok thanks for the info! I think I see what's going on here, and this is arguably an LLVM bug I think. What appears to be happening is that LLVM's optimizing for raw accesses to the global, so it's optimizing as much as possible to reference items through the global rather than through some intermediate value (because then it has more knowledge about what's happening). On Linux I believe TLS is a two-level system where you call some fancy function to get the base address and then you offset from that to get the actual item. It looks like LLVM is optimized at the backend to understand that the base address is memoizable so it only needs to call that function once rather than once-per-reference.

On OSX, however, it looks like TLS works differently where it's a dynamic function call to get the address of each variable (I think?). It looks like LLVM isn't realizing that it's probably always the same address and as a result is calling the function once per reference.

Note though that the use of read_volatile which protects us from referencing the global multiple times in the LLVM IR can also work against us. LLVM now has less knowledge about what's being read and can deduce less information about global properties (as volatile reads are sort of a black box to the optimizer AFAIK).

So at it's root it'd be probably good to report this bug to upstream LLVM. This reproduction I believe shows the issue where the callq *(%rdi) happens twice when it should in theory only need to happen once (or so we claim).

Otherwise though it seems reasonable to fix this issue otherwise on our side of things in the meantime. I'd lean towards a solution without volatile reads though to ensure that we don't accidentally break other optimizations others are relying on with thread locals. Would it be possible to perhaps reorganize the code such that it only references the address of the thread local once? This would likely involve some trickery to ensure LLVM is convinced to only reference the global once, but there may be some cleverness we could restructure with raw pointers or something like that maybe?

If it's not possible without volatile though I don't know of anything concrete that gets pessimized with volatile loads so I think we should probably go ahead with the PR as-is. Just figure it's good to try without the volatile first and see how far we can go! It'd also be great to file an upstream LLVM bug and reference that in the code where the volatile read happens if that stays.

@mtak-
Copy link
Contributor Author

mtak- commented Apr 30, 2019

@alexcrichton Thank you for all the information, and the repro!

Everything you said makes perfect sense. I'll file an LLVM bug report, attempt a more optimizer friendly version of the final commit, update this PR and ping you when there's something worth reviewing.

I have some ideas for working around the misoptimization by packing all the metadata about the thread local into a single word. Instead of an Option and two bool's, just a single:

enum State {
  Uninitialized,
  ..
  DtorRunning,
}

or similar packing would reduce the number of offsets that need to be read on every call to __getit. If that isn't good enough, I'll fall back to the more brittle pointer workarounds.

@alexcrichton
Copy link
Member

That sounds reasonable to me yeah, and will probably be a win across the board for reorganizing the internals :)

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:072b724e:start=1556862230604515911,finish=1556862317054375468,duration=86449859557
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---
[01:11:09] .................................................................................................... 2400/5483
[01:11:14] .................................................................................................... 2500/5483
[01:11:18] .................................................................................................... 2600/5483
[01:11:21] .................................................................................................... 2700/5483
[01:11:26] .............F...................................................................................... 2800/5483
[01:11:34] .................................................................................................... 3000/5483
[01:11:37] .................................................................................................... 3100/5483
[01:11:40] .................................................................................................... 3200/5483
[01:11:44] ...............................................................................i.................... 3300/5483
---
[01:13:05] 
[01:13:05] - error[E0133]: call to unsafe function is unsafe and requires unsafe function or block
[01:13:05] -   --> $DIR/issue-43733.rs:18:5
[01:13:05] -    |
[01:13:05] - LL |     __KEY.get(init)
[01:13:05] -    |     ^^^^^^^^^^^^^^^ call to unsafe function
[01:13:05] -    |
[01:13:05] -    = note: consult the function's documentation for information on how to avoid undefined behavior
[01:13:05] - error[E0133]: call to unsafe function is unsafe and requires unsafe function or block
[01:13:05] + error[E0061]: this function takes 1 parameter but 2 parameters were supplied
[01:13:05] 10   --> $DIR/issue-43733.rs:22:5
[01:13:05] 11    |
[01:13:05] 11    |
[01:13:05] 12 LL |     std::thread::LocalKey::new(__getit, Default::default);
[01:13:05] -    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ call to unsafe function
[01:13:05] -    |
[01:13:05] -    |
[01:13:05] -    = note: consult the function's documentation for information on how to avoid undefined behavior
[01:13:05] 16 
[01:13:05] - error: aborting due to 2 previous errors
[01:13:05] + error: aborting due to previous error
[01:13:05] 18 
[01:13:05] 18 
[01:13:05] - For more information about this error, try `rustc --explain E0133`.
[01:13:05] + For more information about this error, try `rustc --explain E0061`.
[01:13:05] 20 
[01:13:05] 
[01:13:05] 
[01:13:05] The actual stderr differed from the expected stderr.
[01:13:05] Actual stderr saved to /checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/issues/issue-43733/issue-43733.stderr
[01:13:05] To update references, rerun the tests and pass the `--bless` flag
[01:13:05] To only update this specific test, also pass `--test-args issues/issue-43733.rs`
[01:13:05] error: 1 errors occurred comparing output.
[01:13:05] status: exit code: 1
[01:13:05] status: exit code: 1
[01:13:05] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/ui/issues/issue-43733.rs" "-Zthreads=1" "--target=x86_64-unknown-linux-gnu" "--error-format" "json" "-Zui-testing" "-C" "prefer-dynamic" "-o" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/issues/issue-43733/a" "-Crpath" "-O" "-Zunstable-options" "-Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui/issues/issue-43733/auxiliary" "-A" "unused"
[01:13:05] ------------------------------------------
[01:13:05] 
[01:13:05] ------------------------------------------
[01:13:05] stderr:
[01:13:05] stderr:
[01:13:05] ------------------------------------------
[01:13:05] error[E0061]: this function takes 1 parameter but 2 parameters were supplied
[01:13:05]   --> /checkout/src/test/ui/issues/issue-43733.rs:22:5
[01:13:05]    |
[01:13:05] LL |     std::thread::LocalKey::new(__getit, Default::default);
[01:13:05] 
[01:13:05] error: aborting due to previous error
[01:13:05] 
[01:13:05] For more information about this error, try `rustc --explain E0061`.
---
[01:13:05] thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:519:22
[01:13:05] note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
[01:13:05] 
[01:13:05] 
[01:13:05] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/ui" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/ui" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "ui" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
[01:13:05] 
[01:13:05] 
[01:13:05] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:13:05] Build completed unsuccessfully in 0:04:17
[01:13:05] Build completed unsuccessfully in 0:04:17
[01:13:05] Makefile:48: recipe for target 'check' failed
[01:13:05] make: *** [check] Error 1
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:0f843bee
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri May  3 06:58:32 UTC 2019

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:1ab9ce60:start=1556896367533439355,finish=1556896369597919973,duration=2064480618
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---
Check compiletest suite=compile-fail mode=compile-fail (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
[01:21:18] 
[01:21:18] running 30 tests
[01:21:19] ERROR 2019-05-03T16:34:19Z: compiletest::runtest: fatal error, panic: "no error pattern specified in \"/checkout/src/test/compile-fail/meta-expected-error-wrong-rev.rs\""
[01:21:19] ................F.............
[01:21:19] 
[01:21:19] ---- [compile-fail] compile-fail/issue-43733-2.rs stdout ----
[01:21:19] thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:519:22
[01:21:19] 
[01:21:19] 
[01:21:19] error: /checkout/src/test/compile-fail/issue-43733-2.rs:24: unexpected error: '24:1: 24:36: `std::cell::Cell<std::thread::local::fast::DtorState>` cannot be shared between threads safely [E0277]'
[01:21:19] 
[01:21:19] error: /checkout/src/test/compile-fail/issue-43733-2.rs:24: expected error not found: `std::cell::Cell<bool>` cannot be shared between threads safely [E0277]
[01:21:19] error: 1 unexpected errors found, 1 expected errors not found
[01:21:19] status: exit code: 1
[01:21:19] status: exit code: 1
[01:21:19] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/compile-fail/issue-43733-2.rs" "-Zthreads=1" "--target=x86_64-unknown-linux-gnu" "--error-format" "json" "-Zui-testing" "-C" "prefer-dynamic" "-o" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail/issue-43733-2/a" "-Crpath" "-O" "-Zunstable-options" "-Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail/issue-43733-2/auxiliary" "-A" "unused"
[01:21:19]     Error {
[01:21:19]         line_num: 24,
[01:21:19]         kind: Some(
[01:21:19]             Error,
[01:21:19]             Error,
[01:21:19]         ),
[01:21:19]         msg: "24:1: 24:36: `std::cell::Cell<std::thread::local::fast::DtorState>` cannot be shared between threads safely [E0277]",
[01:21:19] ]
[01:21:19] 
[01:21:19] not found errors (from test file): [
[01:21:19]     Error {
[01:21:19]     Error {
[01:21:19]         line_num: 24,
[01:21:19]         kind: Some(
[01:21:19]             Error,
[01:21:19]         ),
[01:21:19]         msg: "`std::cell::Cell<bool>` cannot be shared between threads safely [E0277]",
[01:21:19] ]
[01:21:19] 
[01:21:19] thread '[compile-fail] compile-fail/issue-43733-2.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:1402:13
[01:21:19] note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
---
[01:21:19] test result: FAILED. 29 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out
[01:21:19] 
[01:21:19] 
[01:21:19] 
[01:21:19] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/compile-fail" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail" "--stage-id" "stage2-x86_64-unknown-linux-gnu" "--mode" "compile-fail" "--target" "x86_64-unknown-linux-gnu" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/usr/lib/llvm-6.0/bin/FileCheck" "--host-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--quiet" "--llvm-version" "6.0.0\n" "--system-llvm" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
[01:21:19] 
[01:21:19] 
[01:21:19] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test
[01:21:19] Build completed unsuccessfully in 0:10:40
[01:21:19] Build completed unsuccessfully in 0:10:40
[01:21:19] Makefile:48: recipe for target 'check' failed
[01:21:19] make: *** [check] Error 1
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:079a1c33
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri May  3 16:34:20 UTC 2019

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:234f6d2c:start=1556902085358861084,finish=1556902087680618046,duration=2321756962
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---

[00:03:53] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:03:54] tidy error: /checkout/src/test/compile-fail/issue-43733-2.rs:26: line longer than 100 chars
[00:03:55] some tidy checks failed
[00:03:55] 
[00:03:55] 
[00:03:55] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:03:55] 
[00:03:55] 
[00:03:55] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:03:55] Build completed unsuccessfully in 0:00:45
[00:03:55] Build completed unsuccessfully in 0:00:45
[00:03:56] make: *** [tidy] Error 1
[00:03:56] Makefile:67: recipe for target 'tidy' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:2b0226c0
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Fri May  3 16:52:14 UTC 2019
---
travis_time:end:13836984:start=1556902335499543878,finish=1556902335504434800,duration=4890922
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:142a3e6e
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:29d0d940
travis_time:start:29d0d940
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:187be89b
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@mtak-
Copy link
Contributor Author

mtak- commented May 4, 2019

Would it be possible to perhaps reorganize the code such that it only references the address of the thread local once?

@alexcrichton I have something working. Unfortunately we need at least 1 byte (except in NonZero* cases) to say whether the value has been initialized or not. This ordinarily results in two address calculations.

match &self.opt {             // calculate address of discriminant
  Some(ref val) => Some(val), // calculate address of value storage
  None => self.slow_path(),   // returns Option<&'static T>
}

However, when self.slow_path() is not inlined, the second address calculation occurs within a phi node. Come asm time, this results in linux style offsetting as it's not sure which branch was taken.

With that in mind, here's what a simpler version of the original example now looks like (.len() instead of .clone()):

pub fn get_the_thread_local() -> usize {
    thread_local! {
        static X: String = String::from("hello world. this string is long");
    }
    X.with(|x| x.len())
}

LLVM-IR

; thread_local::get_the_thread_local
; Function Attrs: uwtable
define i64 @_ZN12thread_local20get_the_thread_local17h9b1ad8597fd10cadE() unnamed_addr #4 personality i32 (i32, i32, i64, %"unwind::libunwind::_Unwind_Exception"*, %"unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
start:
  %0 = load {}*, {}** bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h6da2233c675cfa90E to {}**), align 8, !alias.scope !51
  %1 = icmp eq {}* %0, null
  br i1 %1, label %_ZN12thread_local20get_the_thread_local1X7__getit17h0fb9551d9b330632E.exit.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17ha7d145a93e39ed07E.exit"

_ZN12thread_local20get_the_thread_local1X7__getit17h0fb9551d9b330632E.exit.i.i: ; preds = %start
; call std::thread::local::fast::Key<T>::try_initialize_drop
  %2 = tail call fastcc align 8 dereferenceable_or_null(24) i64* @"_ZN3std6thread5local4fast12Key$LT$T$GT$19try_initialize_drop17hdbbedea7d28cfb9fE"()
  %3 = icmp eq i64* %2, null
  br i1 %3, label %bb4.i.i, label %"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17ha7d145a93e39ed07E.exit"

bb4.i.i:                                          ; preds = %_ZN12thread_local20get_the_thread_local1X7__getit17h0fb9551d9b330632E.exit.i.i
; call core::result::unwrap_failed
  tail call fastcc void @_ZN4core6result13unwrap_failed17h73420553132c6252E()
  unreachable

"_ZN3std6thread5local17LocalKey$LT$T$GT$4with17ha7d145a93e39ed07E.exit": ; preds = %start, %_ZN12thread_local20get_the_thread_local1X7__getit17h0fb9551d9b330632E.exit.i.i
  %_0.0.i.i1.i.i = phi i64* [ %2, %_ZN12thread_local20get_the_thread_local1X7__getit17h0fb9551d9b330632E.exit.i.i ], [ bitcast (<{ [32 x i8] }>* @_ZN12thread_local20get_the_thread_local1X7__getit5__KEY17h6da2233c675cfa90E to i64*), %start ]
  %4 = getelementptr i64, i64* %_0.0.i.i1.i.i, i64 2
  %.idx.val.i.i = load i64, i64* %4, align 8, !alias.scope !54
  ret i64 %.idx.val.i.i
}

asm

__ZN12thread_local20get_the_thread_local17h3ebe4a7ff0ca9672E:
	pushq	%rax
	movq	__ZN12thread_local20get_the_thread_local1X7__getit5__KEY17hcd2875a690ec9ddbE@TLVP(%rip), %rdi
	callq	*(%rdi)
	cmpq	$0, (%rax)
	je	LBB10_1
LBB10_2:
	movq	16(%rax), %rax
	popq	%rcx
	retq
LBB10_1:
	callq	__ZN3std6thread5local4fast12Key$LT$T$GT$19try_initialize_drop17hb0e61de580a0b1eeE
	testq	%rax, %rax
	jne	LBB10_2
	callq	__ZN4core6result13unwrap_failed17h6d9d014b9b2876d6E

The final asm is now very tight for the fast path. This is likely a win for other platforms as well.

fast::Key still has an UnsafeCell<Option<T>>, but now None simply means, take the slow path - try_initialize_*. The slow path checks dtor_state to see if it needs to:

  • register and initialize the value
  • return None (dtor running)
  • initialize the value (recursive initialization)

The Key types now handle initialization, and LocalKey no longer has an init fn pointer - backwards incompatible change of #[doc(hidden)] unstable public API.

fast::Key no longer does a drop_in_place on the contained value, instead performing a take() on the option and dropping that.

The try_initialize_* functions are now marked #[inline(never)] and #[cold]. They are only run 1x per thread except in corner cases like recursive initialization or during/after the destructor has run. There's not much precedent for #[cold] except for use in std::sync::Once and std::panicking. Is it OK here?

The only other ways I found of reorganizing the code that resulted in 1x tlv_get_addr call on macos involved Boxing up the fast::Key or using read_volatile.

Looking forward to reading your feedback on this. I would understand if these changes are too drastic.

llvm bug report

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice wins! The new structure looks great to me. Could you place a comment somewhere that it's somewhat carefully crafted to ensure that the address of the thread local is read only once due to the LLVM bug? We probably can't really easily add a codegen test for this or verify it doesn't regress over time, but that's also fine since this doesn't really change all that often.

src/libstd/thread/local.rs Outdated Show resolved Hide resolved
src/libstd/thread/local.rs Outdated Show resolved Hide resolved
src/libstd/thread/local.rs Show resolved Hide resolved
src/libstd/thread/local.rs Outdated Show resolved Hide resolved
@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-6.0 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
travis_time:end:05b8cf2c:start=1557594977169819880,finish=1557595062623797864,duration=85453977984
$ git checkout -qf FETCH_HEAD
travis_fold:end:git.checkout

Encrypted environment variables have been removed for security reasons.
See https://docs.travis-ci.com/user/pull-requests/#pull-requests-and-security-restrictions
$ export SCCACHE_BUCKET=rust-lang-ci-sccache2
$ export SCCACHE_REGION=us-west-1
$ export GCP_CACHE_BUCKET=rust-lang-ci-cache
$ export AWS_ACCESS_KEY_ID=AKIA46X5W6CZEJZ6XT55
---
[00:03:57] tidy error: /checkout/src/libstd/thread/local.rs:377: trailing whitespace
[00:04:02] some tidy checks failed
[00:04:02] 
[00:04:02] 
[00:04:02] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:04:02] 
[00:04:02] 
[00:04:02] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:04:02] Build completed unsuccessfully in 0:01:10
[00:04:02] Build completed unsuccessfully in 0:01:10
[00:04:02] make: *** [tidy] Error 1
[00:04:02] Makefile:67: recipe for target 'tidy' failed
The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:02b3a874
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
Sat May 11 17:21:53 UTC 2019
---
travis_time:end:00962157:start=1557595313903822327,finish=1557595313908505674,duration=4683347
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:11dd118b
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:1509167b
travis_time:start:1509167b
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:0b783588
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

pub unsafe fn get<F: FnOnce() -> T>(&self, init: F) -> Option<&'static T> {
match self.inner.get() {
Some(val) => Some(val),
None => self.try_initialize(init),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this again, I'm just trying to follow the branching here and I'm thinking now, would this produce the same codegen?

match self.inner.get() {
    Some(val) => Some(val),
    None => {
        if mem::needs_drop::<T>() {
            self.register_dtor()?;
        }
        Some(self.inner.initialize(init))
    }
}

where register_dtor is basically what try_initialize_drop is right now except it's missing the last self.inner.initialize.

I'd want to be careful to not tamper with the codegen too much though. Is #[cold] on try_initialize needed for that? If so, then could try_initialize be updated to have a similar structure to the match I'm thinking of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT self.inner.initialize has to be behind a #[inline(never)] function or else there's two tlv_get_addr calls on macos. #[cold] is optional, but seems to generate better asm IMO.

Putting the needs_drop check behind an inline(never) like so:

#[inline(never)]
#[cold]
unsafe fn try_initialize<F: FnOnce() -> T>(&self, init: F) -> Option<&'static T> {
    if !mem::needs_drop::<T>() || self.try_register_dtor() {
        Some(self.inner.initialize(init))
    } else {
        None
    }
}

causes unwrapping code to be generated for LocalKey::with, even for cases which always return Some (needs_drop::<T>() == false). I'll poke a little more at this on the weekend, but it is tricky to merge drop/nodrop paths together without hurting one or the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think #[cold] is definitely an easy win (and one we should do here).

If #[inline(never)] is only on try_register_dtor (this hypothetical function) I think that the codegen should be basically the same as this PR currently is, meaning no unwrapping code but still two accesses perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated code does come out to be the same! I think that's a pretty good win.

Moreover, I was wrong about needing #[inline(never)] anywhere. Only using #[cold] and only on try_initialize gives pretty equivalent results. Sometimes two accesses (try_initialize is inlined), sometimes 1 (try_initialize is not inlined).

@alexcrichton Thanks for help and time spent on this PR! I pushed up the inlining/refactor changes.

@alexcrichton
Copy link
Member

Ok I've got one tiny nit about code organization, but other than that r=me. I didn't mean to cause this to go up to two address lookups though so if you want to restore to the previous code that only had one access that's also ok. So long as it's documented as a bit funky because we're minimizing accesses that's fine by me!

@mati865
Copy link
Contributor

mati865 commented May 13, 2019

Shouldn't this go through perf.rlo first?

@tesuji
Copy link
Contributor

tesuji commented May 13, 2019

Isn't perf Linux only?

@mati865
Copy link
Contributor

mati865 commented May 13, 2019

Unless I'm blind it changes generic code.

@bors
Copy link
Contributor

bors commented May 16, 2019

⌛ Testing commit 2c37961 with merge 2e75d00...

@bors
Copy link
Contributor

bors commented May 16, 2019

💔 Test failed - checks-travis

@rust-highfive
Copy link
Collaborator

The job test-various of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
[01:13:56] failures:
[01:13:56] 
[01:13:56] ---- [compile-fail] compile-fail/issue-43733-2.rs stdout ----
[01:13:56] 
[01:13:56] error: /checkout/src/test/compile-fail/issue-43733-2.rs:24: unexpected error: '24:1: 24:36: `std::cell::Cell<bool>` cannot be shared between threads safely [E0277]'
[01:13:56] 
[01:13:56] error: /checkout/src/test/compile-fail/issue-43733-2.rs:24: expected error not found: `std::cell::Cell<std::thread::local::fast::DtorState>` cannot be shared between threads
[01:13:56] error: 1 unexpected errors found, 1 expected errors not found
[01:13:56] status: exit code: 1
[01:13:56] status: exit code: 1
[01:13:56] command: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "/checkout/src/test/compile-fail/issue-43733-2.rs" "-Zthreads=1" "--target=wasm32-unknown-unknown" "--error-format" "json" "-Zui-testing" "--out-dir" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail/issue-43733-2" "-Crpath" "-O" "-Zunstable-options" "-Lnative=/checkout/obj/build/wasm32-unknown-unknown/native/rust-test-helpers" "-L" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail/issue-43733-2/auxiliary" "-A" "unused"
[01:13:56]     Error {
[01:13:56]         line_num: 24,
[01:13:56]         kind: Some(
[01:13:56]             Error,
[01:13:56]             Error,
[01:13:56]         ),
[01:13:56]         msg: "24:1: 24:36: `std::cell::Cell<bool>` cannot be shared between threads safely [E0277]",
[01:13:56] ]
[01:13:56] 
[01:13:56] not found errors (from test file): [
[01:13:56]     Error {
[01:13:56]     Error {
[01:13:56]         line_num: 24,
[01:13:56]         kind: Some(
[01:13:56]             Error,
[01:13:56]         ),
[01:13:56]         msg: "`std::cell::Cell<std::thread::local::fast::DtorState>` cannot be shared between threads",
[01:13:56] ]
[01:13:56] 
[01:13:56] thread '[compile-fail] compile-fail/issue-43733-2.rs' panicked at 'explicit panic', src/tools/compiletest/src/runtest.rs:1405:13
[01:13:56] note: Run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
---
[01:13:56] 
[01:13:56] thread 'main' panicked at 'Some tests failed', src/tools/compiletest/src/main.rs:512:22
[01:13:56] 
[01:13:56] 
[01:13:56] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/compiletest" "--compile-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib" "--run-lib-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/wasm32-unknown-unknown/lib" "--rustc-path" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage2/bin/rustc" "--src-base" "/checkout/src/test/compile-fail" "--build-base" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/compile-fail" "--stage-id" "stage2-wasm32-unknown-unknown" "--mode" "compile-fail" "--target" "wasm32-unknown-unknown" "--host" "x86_64-unknown-linux-gnu" "--llvm-filecheck" "/checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/bin/FileCheck" "--nodejs" "/node-v9.2.0-linux-x64/bin/node" "--host-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/x86_64-unknown-linux-gnu/native/rust-test-helpers" "--target-rustcflags" "-Crpath -O -Zunstable-options  -Lnative=/checkout/obj/build/wasm32-unknown-unknown/native/rust-test-helpers" "--docck-python" "/usr/bin/python2.7" "--lldb-python" "/usr/bin/python2.7" "--gdb" "/usr/bin/gdb" "--llvm-version" "8.0.0-rust-1.36.0-dev\n" "--cc" "" "--cxx" "" "--cflags" "" "--llvm-components" "" "--llvm-cxxflags" "" "--adb-path" "adb" "--adb-test-dir" "/data/tmp/work" "--android-cross-path" "" "--color" "always"
[01:13:56] 
[01:13:56] 
[01:13:56] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test --target wasm32-unknown-unknown src/test/run-make src/test/ui src/test/run-pass src/test/compile-fail src/test/mir-opt src/test/codegen-units src/libcore
[01:13:56] Build completed unsuccessfully in 1:05:09
---
travis_time:end:210a2cc0:start=1558026885625713844,finish=1558026885637748304,duration=12034460
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:015b4fe8
$ ln -s . checkout && for CORE in obj/cores/core.*; do EXE=$(echo $CORE | sed 's|obj/cores/core\.[0-9]*\.!checkout!\(.*\)|\1|;y|!|/|'); if [ -f "$EXE" ]; then printf travis_fold":start:crashlog\n\033[31;1m%s\033[0m\n" "$CORE"; gdb --batch -q -c "$CORE" "$EXE" -iex 'set auto-load off' -iex 'dir src/' -iex 'set sysroot .' -ex bt -ex q; echo travis_fold":"end:crashlog; fi; done || true
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:2f6740bf
travis_time:start:2f6740bf
$ cat ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
cat: ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers: No such file or directory
travis_fold:end:after_failure.5
travis_fold:start:after_failure.6
travis_time:start:03d100d0
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels May 16, 2019
@pietroalbini
Copy link
Member

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 18, 2019
@jonas-schievink
Copy link
Contributor

Ping from triage @mtak-, I think there's still CI failures to be addressed?

@mtak-
Copy link
Contributor Author

mtak- commented Jun 18, 2019

@jonas-schievink Yeah! Sorry for letting this hang. I'll look into the failures tonight (PST).

@mtak-
Copy link
Contributor Author

mtak- commented Jun 19, 2019

@jonas-schievink Should be good to go now - barring other CI failures.

@jonas-schievink
Copy link
Contributor

Great! I don't really understand the changed test, so I'll defer to the reviewer for one last check-in.

@jonas-schievink jonas-schievink added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 19, 2019
@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Jun 20, 2019

📌 Commit b148c25 has been approved by alexcrichton

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 20, 2019
@bors
Copy link
Contributor

bors commented Jun 20, 2019

⌛ Testing commit b148c25 with merge 3c805ce...

bors added a commit that referenced this pull request Jun 20, 2019
macos tlv workaround

fixes: #60141

Includes:
* remove dead code: `requires_move_before_drop`. This hasn't been needed for a while now (oops I should have removed it in #57655)
* redox had a copy of `fast::Key` (not sure why?). That has been removed.
* Perform a `read_volatile` on OSX to reduce `tlv_get_addr` calls per `__getit` from (4-2 depending on context) to 1.

`tlv_get_addr` is relatively expensive (~1.5ns on my machine).

Previously, in contexts where `__getit` was inlined, 4 calls to `tlv_get_addr` were performed per lookup. For some reason when `__getit` is not inlined this is reduced to 2x - and performance improves to match.

After this PR, I have only ever seen 1x call to `tlv_get_addr` per `__getit`, and macos now benefits from situations where `__getit` is inlined.

I'm not sure if the `read_volatile(&&__KEY)` trick is working around an LLVM bug, or a rustc bug, or neither.

r? @alexcrichton
@tesuji
Copy link
Contributor

tesuji commented Jun 20, 2019

It would be better if this PR is squashed appropriately (But don't after it is testing).

@bors
Copy link
Contributor

bors commented Jun 20, 2019

☀️ Test successful - checks-travis, status-appveyor
Approved by: alexcrichton
Pushing 3c805ce to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Jun 20, 2019
@bors bors merged commit b148c25 into rust-lang:master Jun 20, 2019
@mati865
Copy link
Contributor

mati865 commented Jun 20, 2019

The performance is rather good although there are few regressions: https://perf.rust-lang.org/compare.html?start=&end=3c805ce183840bd20bd81a54a02b3c8b6dccc9dd&stat=instructions%3Au

@RalfJung RalfJung mentioned this pull request Jun 20, 2019
@mtak-
Copy link
Contributor Author

mtak- commented Jun 20, 2019

That might be giving this PR too much credit/blame. Here's a more narrow time range showing tiny improvements across the board, but mostly noise: https://perf.rust-lang.org/compare.html?start=7d107613498c500699dcf89aab7231c28a8fa3af&end=3c805ce183840bd20bd81a54a02b3c8b6dccc9dd&stat=instructions%3Au

@mati865
Copy link
Contributor

mati865 commented Jun 20, 2019

Right I messed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

very bad codegen for thread_local! on OSX
9 participants