-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement ptr::write without dedicated intrinsic #80290
Conversation
r? @m-ou-se (rust-highfive has picked a reviewer for you, use r? to override) |
Let's see if perf says anything. @bors try |
⌛ Trying commit e10fd772a10fa762806af973306b1513d40ac1c1 with merge 035e759b99e57ac8055a4d0c71b48e2ceb0beb36... |
@rust-timer queue |
Awaiting bors try build completion. |
The job Click to see the possible cause of the failure (guessed by this bot)
|
☀️ Try build successful - checks-actions |
Queued 035e759b99e57ac8055a4d0c71b48e2ceb0beb36 with parent 793931f, future comparison URL. @rustbot label: +S-waiting-on-perf |
Finished benchmarking try commit (035e759b99e57ac8055a4d0c71b48e2ceb0beb36): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Let's see what happens if we only call intrinsics directly in @bors try @rust-timer queue |
Awaiting bors try build completion. |
⌛ Trying commit 9b2ca1ac97baf9624f2c5fb8bff9959e6e91d712 with merge 63c71d4cdc96bd631e4a4da9b77587035aac443b... |
The job Click to see the possible cause of the failure (guessed by this bot)
|
☀️ Try build successful - checks-actions |
Queued 63c71d4cdc96bd631e4a4da9b77587035aac443b with parent 75e1acb, future comparison URL. @rustbot label: +S-waiting-on-perf |
Finished benchmarking try commit (63c71d4cdc96bd631e4a4da9b77587035aac443b): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Now perf is looking clean (but @bjorn3 remarked that the impact would likely be largest when running debug builds). |
Perf results for ebobby/simple-raytracer@804a7a2:
The perf difference is so small as simple-raytracer only spends a tiny bit of time in Before:
After:
|
@bjorn3 what kind of build of the raytracer is that (debug/release, which codegen backend)?
|
Debug build using cg_llvm.
Much better. #![feature(core_intrinsics)]
pub unsafe fn write<T>(dst: *mut T, src: T) {
std::intrinsics::move_val_init(&mut *dst, src)
} fn write(_1: *mut T, _2: T) -> () {
debug dst => _1;
debug src => _2;
let mut _0: ();
let mut _3: *mut T;
let mut _4: &mut T;
bb0: {
StorageLive(_4);
_4 = &mut (*_1);
_3 = &raw mut (*_4);
(*_3) = move _2;
StorageDead(_4);
return;
}
} #![feature(core_intrinsics)]
pub unsafe fn write<T>(dst: *mut T, src: T) {
std::intrinsics::copy_nonoverlapping(&src as *const T, dst, 1);
std::intrinsics::forget(src);
} fn write(_1: *mut T, _2: T) -> () {
debug dst => _1;
debug src => _2;
let mut _0: ();
let _3: ();
let mut _4: *const T;
let _5: &T;
let mut _6: *mut T;
let mut _7: bool;
bb0: {
_7 = const false;
_7 = const true;
StorageLive(_3);
StorageLive(_4);
StorageLive(_5);
_5 = &_2;
_4 = &raw const (*_5);
StorageLive(_6);
_6 = _1;
_3 = copy_nonoverlapping::<T>(move _4, move _6, const 1_usize) -> [return: bb1, unwind: bb4];
bb1: {
StorageDead(_6);
StorageDead(_4);
StorageDead(_5);
StorageDead(_3);
_7 = const false;
_0 = const ();
return;
}
bb2 (cleanup): {
resume;
}
bb3 (cleanup): {
drop(_2) -> bb2;
}
bb4 (cleanup): {
switchInt(_7) -> [false: bb2, otherwise: bb3];
}
} |
☀️ Test successful - checks-actions |
There's a noticeable binary size regression between
vs
The only observable difference I can spot is:
|
@therealprof so is the binary of rustc itself bigger, or is some rustc-generated binary bigger? Is this a debug build or a release build? https://github.com/kennytm/rustup-toolchain-install-master could be used to confirm that it is this PR vs some other PR that landed that day. |
Generated binaries in dev (or debug mode if you prefer) are larger.
Sorry, I don't have time to bisect this. I was just browsing the recently merged PRs and the mention of regressions piqued my interest so I decided to run my tools (https://github.com/stm32-rs/stm32f0xx-hal/blob/master/tools/capture_nightly_example_bloat.sh) on the latest nightly and sure enough I can see regressions happening between |
Some regression for debug builds was expected with this PR. Depending on how much this matters, I sketched an idea for how to mitigate this:
|
Well, debug mode in Rust is atrocious in every regard compared to other languages used for embedded development. The binaries are huge (often too big to fit into the flash of smaller microcontrollers) and slow (some peripherals like USB can't even be used since the generated code is some order of magnitudes too slow to react to USB events). Every little improvement helps! I'd be happy to test and benchmark any changes once they've landed in nightly. I've collected a nice dataset over different versions in the current format (each stable release back to 1.41 and a quite a few more nightlies in between). |
The reason this PR was deemed acceptable is that the issue should only affect debug builds. Anybody who cares about such issues would use release (or size-optimized) builds, we figured, and those should not be affected. Why does the size of debug builds matter so much to you? (Btw, discussions in a closed PR are bound to get lost, so if you think something should be done here or if you think there's something worth tracking, please open an issue.)
I won't have time to work on this, but maybe someone from @rust-lang/wg-mir-opt would be interested. Cc @tmiasko who recently added a closely related intrinsic lowering MIR pass. |
I can confirm release builds are not affected.
Well, only debug builds are really debuggable (in embedded context) for starters and debugging with a debugger has some extra relevance due to limited interaction capabilities compared with a regular application. Release builds are also built with all optimisation features Rust has to offer which really make them really slow to compile. And then there's the usual trap for young players that the default build mode is dev mode, so starters are frequently ending up in debug mode; best case they'll get a linker error telling them about their mistake...
Sure thing. |
@RalfJung @lcnr There seems to be a small compile time perf regression after all in regex-debug full builds. Looks like LLVM_module_codegen_emit_obj regressed.Thoughts on if addressing this is worth it? |
Considering that Now... this is only for regex-debug, and we generally don't put too much effort into making debug builds be efficient, so the runtime perf loss is acceptable, but it is still not great that we lost some compile-time. I guess regex uses I don't think it is possible to regain this perf directly. We may get it back via always-run MIR optimizations that turn this |
cf. #81163 for ongoing discussion
I very much disagree with this statement. The performance and binary size of debug builds are a huge problem for embedded Rust already; regressions are definitely not acceptable for us. |
Also see #80290 (comment) for an approach that should regain perf. |
Should we create an issue for #80290 (comment) to make sure it's not lost? |
Yeah if people care enough, there should be a version of #81163 for |
directly expose copy and copy_nonoverlapping intrinsics This effectively un-does rust-lang#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang#80290 and rust-lang#81163. Cc `@bjorn3` `@therealprof`
directly expose copy and copy_nonoverlapping intrinsics This effectively un-does rust-lang/rust#57997. That should help with `ptr::read` codegen in debug builds (and any other of these low-level functions that bottoms out at `copy`/`copy_nonoverlapping`), where the wrapper function will not get inlined. See the discussion in rust-lang/rust#80290 and rust-lang/rust#81163. Cc `@bjorn3` `@therealprof`
The test mentioned by this comment was deleted long ago by <rust-lang#80290>.
This makes
ptr::write
more consistent withptr::write_unaligned
,ptr::read
,ptr::read_unaligned
, all of which are implemented in terms ofcopy_nonoverlapping
.This means we can also remove
move_val_init
implementations in codegen and Miri, and its special handling in the borrow checker.Also see this Zulip discussion.