Improve ptr::read code in debug builds #81163

RalfJung · 2021-01-18T11:50:10Z

In #80290, some people raised concerns about the quality of the code that ptr::write compiles to in debug builds. Given that, to my knowledge, reads are much more common than writes, I would think that one should be much more concerned with the code that ptr::read compiles to -- and currently, there's going to be quite a few function calls in there, so without inlining, that code will be pretty slow.

ptr::read could be improved with techniques similar to what I did for ptr::write (call intrinsics directly, and inline everything else by hand). This would result in (something like) the following implementation: (EDIT see below for why this is wrong)

pub const unsafe fn read<T>(src: *const T) -> T {
    // We are calling the intrinsics directly to avoid function calls in the generated code
    // as `intrinsics::copy_nonoverlapping` is a wrapper function.
    extern "rust-intrinsic" {
        fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize);
    }

    // For the same reason, we also side-step `mem::MaybeUninit` and use a custom `union` directly.
    #[repr(C)]
    union MaybeUninit<T> {
        init: mem::ManuallyDrop<T>,
        uninit: (),
    }

    let mut tmp: MaybeUninit<T> = MaybeUninit { uninit: () };
    // SAFETY: the caller must guarantee that `src` is valid for reads.
    // `src` cannot overlap `tmp` because `tmp` was just allocated on
    // the stack as a separate allocated object.
    //
    // `MaybeUninit` is repr(C), so we can assume `init` is at offset 0, justifying the pointer
    // casts.
    //
    // Finally, since we just wrote a valid value into `tmp`, it is guaranteed
    // to be properly initialized.
    unsafe {
        copy_nonoverlapping(src, &mut tmp as *mut _ as *mut T, 1);
        mem::transmute_copy(&tmp.init)
    }
}

However, here we have the extra difficulty that read is (unstably) a const fn, so the above implementation is rejected. &tmp.init can be replaced by &mut tmp.init and that works (or we wait for a bootstrap bump so we can make use of #80418), but transmute_copy is non-const, so there's still more work to be done. (transmute does not work since the compiler does not recognize that T and ManuallyDrop<T> have the same size.)

I will stop here, but if someone else strongly cares about ptr::read performance/codesize in debug builds, feel free to pick this up and drive it to completion.

Cc @bjorn3 @therealprof @usbalbin

The text was updated successfully, but these errors were encountered:

RalfJung · 2021-01-18T11:52:34Z

Also Cc #80377

RalfJung · 2021-01-18T11:59:32Z

Actually, transmute_copy is implemented using read, so scratch that plan.^^
I don't think there is a way to do this without calling ManuallyDrop::into_inner.

nagisa · 2021-01-18T22:25:28Z

I don't think there is a way to do this without calling ManuallyDrop::into_inner.

If all of ManuallyDrop, MaybeUninit and ptr::read are defined in the same crate, the fields of the wrappers could be made pub to ptr::read.

RalfJung · 2021-01-19T09:00:06Z

Yeah, that would work... but we'd have to make them public to the entire ptr module I think, so there is a significant risk of this being misused down the road. IOW, this is a bit too hacky for my taste.^^

therealprof · 2021-01-19T10:17:13Z

Any chance of implementing those lowest level functions exclusively with intrinsics? It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

I would argue that the most fundamental building blocks to low-level applications should never cause unnecessary overhead, even in debug mode.

RalfJung · 2021-01-19T10:18:50Z

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

It seems non-ideal to rely on a (non-existing in debug mode) optimizer to turn this into a decent binary code.

This is not entirely true; some MIR simplification passes do run even in debug mode.

therealprof · 2021-01-19T10:29:56Z

Intrinsics carry a significant implementation cost in various layers of the compiler (typechecking, CTFE/Miri, codegen), so IMO we should keep the number of intrinsics to the absolute minimum necessary.

Fair enough. I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

This is not entirely true; some MIR simplification passes do run even in debug mode.

If we can get efficient implementations for the mentioned functions another way that would be great.

RalfJung · 2021-01-19T10:38:47Z

I still think write, read, write_volatile and read_volatile qualify as essentials for a systems programming language.

That might be a good enough justification to make the fields of MaybeUninit and ManuallyDrop "public enough" to avoid function calls in read.

OTOH, the volatile operations have to be intrinsics anyway, so potentially much of the code could be shared between read_volatile and a read intrinsic.

alecmocatta · 2021-01-19T11:31:41Z

I may be off the mark here, but might we see net perf gains with certain cheap wrapper functions (e.g. ptr::copy_nonoverlapping) just being marked #[inline(always)]? Might be worth testing before expending effort on manual inlining.

bjorn3 · 2021-01-19T11:37:38Z

#[inline(always)] won't help for cg_clif until the MIR inliner is enabled by default. It also requires the LLVM inliner to do more work, thus causing debug builds to become a bit slower.

RalfJung · 2021-01-19T11:45:57Z

I thought inline(always) only affects release build? If it also affects debug builds, that might still be better overall than some of the more hacky alternatives.

bjorn3 · 2021-01-19T11:55:00Z

#[inline(always)] is also enabled when optimizations are disabled. Except when -Cno-prepopulate-passes is used.

For the old pass manager:

rust/compiler/rustc_codegen_llvm/src/back/write.rs

Line 1047 in 4e208f6

llvm::LLVMRustAddAlwaysInlinePass(builder, config.emit_lifetime_markers);

For the new pass manager:

rust/compiler/rustc_llvm/llvm-wrapper/PassWrapper.cpp

Line 911 in 2ba7ca2

MPM.addPass(AlwaysInlinerPass(EmitLifetimeMarkers));

alecmocatta · 2021-01-19T12:03:35Z

It doesn't seem implausible to me that #[inline(always)] might be a net gain for debug build times, while also improving perf of debug binaries. The llvm.memcpy intrinsic might be cheaper for LLVM to reason about than a call to the external ptr::copy_nonoverlapping, possibly making up for the increased cost to the inliner.

bjorn3 · 2021-01-19T12:07:34Z

I mean slower compared to manual inlining.

alecmocatta · 2021-01-19T12:17:23Z

Sure, #[inline(always)] may have a (I would assume very small) cost over manual inlining. The potential "net gain" I'm referring to is from automatically inlining in every place where ptr::copy_nonoverlapping is called (including user crates and applications) rather than a limited number of manual inlinings in std.

bjorn3 · 2021-01-19T13:10:31Z

ptr::copy_nonoverlapping is a thin wrapper around the intrinsic named the same. I would very much like all intrinsic wrappers to become re-exports of the intrinsics themself. This would eliminate the need for inlining but would require some way to change the "abi" of intrinsics from "rust-intrinsic" to "Rust" to make them indistinguishable from real functions.

RalfJung · 2021-01-19T13:14:18Z

I would very much like all intrinsic wrappers to become re-exports of the intrinsics themself.

FWIW, copy_nonoverlapping was such a reexport until we added the debug assertions. Now that the debug assertions are disabled, if we feel like we don't want them back anyway, we could change it back to a reeexport.

This would eliminate the need for inlining but would require some way to change the "abi" of intrinsics from "rust-intrinsic" to "Rust" to make them indistinguishable from real functions.

I think the main issue here is that using re-exports makes the intrinsics stably callable at std::intrinsics::X. Rust stability is not fine-grained enough to let us stabilize the reexport without stabilizing the thing that is being reexported.

oli-obk · 2021-01-20T12:44:55Z

I don't think we should be trying to solve this at the library level. We are getting rather close to making MIR inlining work everywhere, and in my personal opinion we should look into improving debug mode by adding more mir-opt-level=1 optimizations that do not affect debug info negatively. There are still various operations that require intrinsics and function calls at the code level, but have equivalent basic MIR statements available that we could optimize to. I mean... ptr::write is equivalent to a single MIR statement: StatementKind::Assign(Place(dest, [Deref]), Rvalue::Use(Operand::Move(source)). If we can figure this out properly (optimize away forget, lower copy_nonoverlapping to StatementKind::Assign), then we may actually see better debug compile-times as well as smaller binary code than we did before.

RalfJung · 2021-01-20T12:51:02Z

If we can figure this out properly (optimize away forget, lower copy_nonoverlapping to StatementKind::Assign), then we may actually see better debug compile-times as well as smaller binary code than we did before.

forget is already removed by #79049. So the only thing that's left is this optimization (and I keep repeating this but somehow it seems to get lost all the time^^).

RalfJung · 2021-01-20T12:51:33Z

Anyway, this issue is about ptr::read, not ptr::write. Let's stay on topic please.

wesleywiser · 2021-01-20T14:40:11Z

I started working on that optimization last night.

tmiasko · 2021-01-20T16:23:06Z

Lowering copy_nonoverlapping to an assignment removes the dead cleanup pad (MIR building currently doesn't understand that some intrinsics don't unwind). It still introduces an extra alloca for src during LLVM code generation because src has its address taken, unlike what happened with move_val_init.

RalfJung · 2021-01-21T08:57:28Z

FWIW, the MaybeUninit and ManuallyDrop functions called by ptr::read are all already inline(always).

So together with #81238, ptr::read doesn't call any non-inline(always) function any more.

directly expose copy and copy_nonoverlapping intrinsics This effectively un-does rust-lang#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang#80290 and rust-lang#81163. Cc `@bjorn3` `@therealprof`

RalfJung · 2021-02-14T10:35:11Z

ptr::read should have gotten better now in debug builds, since mem::copy_nonoverlapping is directly exposing the intrinsic.

@wesleywiser

I started working on that optimization last night.

What is the status of that? Both ptr::read and ptr::write would now benefit from this.

therealprof · 2021-02-14T13:14:56Z

ptr::read should have gotten better now in debug builds, since mem::copy_nonoverlapping is directly exposing the intrinsic.

I can confirm that I'm seeing slight binary size reductions (< 1% in the best case) when comparing some STM32F0 applications using rustc 1.52.0-nightly (8e54a2113 2021-02-13) to either 1.50 or a previous nightly from a days ago in debug binaries.

wesleywiser · 2021-02-16T00:57:15Z

I started working on that optimization last night.

What is the status of that? Both ptr::read and ptr::write would now benefit from this.

@RalfJung I opened #81344 with that optimization but completely forgot that we approved an MCP a few months ago to add a MIR StatementKind::CopyNonOverlapping. That work is happening in #77511. Once that is finished, I'll redo #81344 if it's still relevant.

directly expose copy and copy_nonoverlapping intrinsics This effectively un-does rust-lang/rust#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang/rust#80290 and rust-lang/rust#81163. Cc `@bjorn3` `@therealprof`

mati865 · 2021-03-14T19:18:57Z

FYI #77511 has landed.

jrmuizel · 2021-03-31T20:44:39Z

@wesleywiser do you still plan on redoing #81344?

wesleywiser · 2021-04-08T01:04:31Z

Thanks for the reminder! I've opened #83785 which shows some small, positive improvements to compilation time.

RalfJung · 2022-11-27T08:19:30Z

read currently calls the intrinsic directly

rust/library/core/src/ptr/mod.rs

Lines 1116 to 1121 in 5fa44b5

    
           // We are calling the intrinsics directly to avoid function calls in the generated code 
        
           // as `intrinsics::copy_nonoverlapping` is a wrapper function. 
        
           extern "rust-intrinsic" { 
        
               #[rustc_const_stable(feature = "const_intrinsic_copy", since = "1.63.0")] 
        
               fn copy_nonoverlapping<T>(src: *const T, dst: *mut T, count: usize); 
        
           }

Is there anything left to do in this issue?

RalfJung · 2022-11-27T08:24:35Z

For now, considering this fixed by #87827.

camelid added T-libs Relevant to the library team, which will review and decide on the PR/issue. A-raw-pointers Area: raw pointers, MaybeUninit, NonNull I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jan 18, 2021

therealprof mentioned this issue Jan 20, 2021

implement ptr::write without dedicated intrinsic #80290

Merged

oli-obk added A-mir-opt Area: MIR optimizations A-mir-opt-inlining Area: MIR inlining labels Jan 20, 2021

wesleywiser self-assigned this Jan 20, 2021

RalfJung mentioned this issue Jan 21, 2021

directly expose copy and copy_nonoverlapping intrinsics #81238

Merged

RalfJung closed this as completed Nov 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ptr::read code in debug builds #81163

Improve ptr::read code in debug builds #81163

RalfJung commented Jan 18, 2021 •

edited

Loading

RalfJung commented Jan 18, 2021

RalfJung commented Jan 18, 2021

nagisa commented Jan 18, 2021

RalfJung commented Jan 19, 2021

therealprof commented Jan 19, 2021

RalfJung commented Jan 19, 2021 •

edited

Loading

therealprof commented Jan 19, 2021

RalfJung commented Jan 19, 2021

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

RalfJung commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021 •

edited

Loading

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

RalfJung commented Jan 19, 2021 •

edited

Loading

oli-obk commented Jan 20, 2021

RalfJung commented Jan 20, 2021 •

edited

Loading

RalfJung commented Jan 20, 2021

wesleywiser commented Jan 20, 2021

tmiasko commented Jan 20, 2021

RalfJung commented Jan 21, 2021 •

edited

Loading

RalfJung commented Feb 14, 2021

therealprof commented Feb 14, 2021

wesleywiser commented Feb 16, 2021

mati865 commented Mar 14, 2021

jrmuizel commented Mar 31, 2021

wesleywiser commented Apr 8, 2021

RalfJung commented Nov 27, 2022

RalfJung commented Nov 27, 2022

Improve ptr::read code in debug builds #81163

Improve ptr::read code in debug builds #81163

Comments

RalfJung commented Jan 18, 2021 • edited Loading

RalfJung commented Jan 18, 2021

RalfJung commented Jan 18, 2021

nagisa commented Jan 18, 2021

RalfJung commented Jan 19, 2021

therealprof commented Jan 19, 2021

RalfJung commented Jan 19, 2021 • edited Loading

therealprof commented Jan 19, 2021

RalfJung commented Jan 19, 2021

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

RalfJung commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021 • edited Loading

alecmocatta commented Jan 19, 2021

bjorn3 commented Jan 19, 2021

RalfJung commented Jan 19, 2021 • edited Loading

oli-obk commented Jan 20, 2021

RalfJung commented Jan 20, 2021 • edited Loading

RalfJung commented Jan 20, 2021

wesleywiser commented Jan 20, 2021

tmiasko commented Jan 20, 2021

RalfJung commented Jan 21, 2021 • edited Loading

RalfJung commented Feb 14, 2021

therealprof commented Feb 14, 2021

wesleywiser commented Feb 16, 2021

mati865 commented Mar 14, 2021

jrmuizel commented Mar 31, 2021

wesleywiser commented Apr 8, 2021

RalfJung commented Nov 27, 2022

RalfJung commented Nov 27, 2022

RalfJung commented Jan 18, 2021 •

edited

Loading

RalfJung commented Jan 19, 2021 •

edited

Loading

bjorn3 commented Jan 19, 2021 •

edited

Loading

RalfJung commented Jan 19, 2021 •

edited

Loading

RalfJung commented Jan 20, 2021 •

edited

Loading

RalfJung commented Jan 21, 2021 •

edited

Loading