mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

scottmcm · 2018-07-04T10:23:24Z

LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the replace_with RFC discussion.

Examples of the improvements:

swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc

type Demo = [u16; 3];
pub fn swap_demo(x: &mut Demo, y: &mut Demo) {
    std::mem::swap(x, y);
}

nightly:

_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc

this PR:

_ZN4blah9swap_demo17ha1732a9b71393a7eE:
	mov	r8d, dword ptr [rcx]
	movzx	r9d, word ptr [rcx + 4]
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	word ptr [rdx + 4], r9w
	mov	dword ptr [rdx], r8d
	ret

`replace_with` optimizes down much better

Inspired by rust-lang/rfcs#2490,

fn replace_with<T, F>(x: &mut Option<T>, f: F)
    where F: FnOnce(Option<T>) -> Option<T>
{
    *x = f(x.take());
}

pub fn inc_opt(mut x: &mut Option<i32>) {
    replace_with(&mut x, |i| i.map(|j| j + 1));
}

Rust 1.26.0:

_ZN4blah7inc_opt17heb0acb64c51777cfE:
	mov	rax, qword ptr [rcx]
	movabs	r8, 4294967296
	add	r8, rax
	shl	rax, 32
	movabs	rdx, -4294967296
	and	rdx, r8
	xor	r8d, r8d
	test	rax, rax
	cmove	rdx, rax
	setne	r8b
	or	rdx, r8
	mov	qword ptr [rcx], rdx
	ret

Nightly (better thanks to ScalarPair, maybe?):

_ZN4blah7inc_opt17h66df690be0b5899dE:
	mov	r8, qword ptr [rcx]
	mov	rdx, r8
	shr	rdx, 32
	xor	eax, eax
	test	r8d, r8d
	setne	al
	add	edx, 1
	mov	dword ptr [rcx], eax
	mov	dword ptr [rcx + 4], edx
	ret

This PR:

_ZN4blah7inc_opt17h1426dc215ecbdb19E:
	xor	eax, eax
	cmp	dword ptr [rcx], 0
	setne	al
	mov	dword ptr [rcx], eax
	add	dword ptr [rcx + 4], 1
	ret

Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (#49420 (comment)).

rust-highfive · 2018-07-04T10:23:28Z

r? @KodrAus

(rust_highfive has picked a reviewer for you, use r? to override)

kennytm · 2018-07-04T10:39:52Z

I don't think this has anything to do with SIMD, as the longer codegen is still produced when commenting out the #[repr(simd)].

OTOH, changing this line:

rust/src/libcore/ptr.rs

Line 229 in a739c51

let mut t: UnalignedBlock = mem::uninitialized();

to

        let mut t = UnalignedBlock(0,0,0,0);

is sufficient to get this ASM on godbolt:

example::swap_i48:
  mov eax, dword ptr [rdi]
  movzx ecx, word ptr [rdi + 4]
  movzx edx, word ptr [rsi + 4]
  mov word ptr [rdi + 4], dx
  mov edx, dword ptr [rsi]
  mov dword ptr [rdi], edx
  mov word ptr [rsi + 4], cx
  mov dword ptr [rsi], eax
  ret

scottmcm · 2018-07-04T10:47:49Z

Interesting, @kennytm. Also really disturbing, since initializing to undef should in all situations allow the optimizer more freedom than initializing to a concrete value. It makes me wonder if the whole "swapping in u64s" approach actually has undef problems, perhaps in padding or similar...

(Agreed that it's not actually about the SIMD, but about the tail-too-small-to-SIMD part of the method.)

kennytm · 2018-07-04T10:54:09Z

I've also checked the inc_opt case, and my minor change in #52051 (comment) is not sufficient to reproduce the ultimate optimized code.

scottmcm · 2018-07-04T11:30:42Z

@kennytm Can you share your godbolt? When I try the change locally, I get worse codegen as it zero-inits the part of the stack that the memcpys don't write.

ASM with zero-init for the unaligned block

_ZN4blah9swap_demo17ha1732a9b71393a7eE:
.seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE
	sub	rsp, 32
	.seh_stackalloc 32
	.seh_endprologue
	xorps	xmm0, xmm0
	movups	xmmword ptr [rsp + 16], xmm0
	movups	xmmword ptr [rsp + 6], xmm0
	movzx	eax, word ptr [rcx + 4]
	mov	word ptr [rsp + 4], ax
	mov	eax, dword ptr [rcx]
	mov	dword ptr [rsp], eax
	movzx	eax, word ptr [rdx + 4]
	mov	word ptr [rcx + 4], ax
	mov	eax, dword ptr [rdx]
	mov	dword ptr [rcx], eax
	mov	eax, dword ptr [rsp]
	mov	dword ptr [rdx], eax
	movzx	eax, word ptr [rsp + 4]
	mov	word ptr [rdx + 4], ax
	add	rsp, 32
	ret
	.seh_handlerdata
	.section	.text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE
	.seh_endproc

As for inc_opt, that illustrates another advantage: the memcpys in swap_nonoverlapping_bytes always are align 1, whereas this PR keeps alignment information. I suspect that makes the difference since I see mov r8, qword ptr [rcx] with the zero-init (or nightly), which is probably an indication that LLVM isn't willing to split up the unaligned load and thus can't fold as much away.

kennytm · 2018-07-04T13:12:16Z

@scottmcm https://godbolt.org/g/1vgFXw

Including inc_opt in the same snippet spoils the optimization though 🤔.

scottmcm · 2018-07-04T21:53:28Z

Using an array of 4 0_u64s instead of the struct of them also seems to break the optimization, putting in the explicit stack zeroing like I'm seeing locally: https://godbolt.org/g/smQfre

Given how subtle all these changes seem to be, I'm liking the "just do it the obvious way" code even more.

kennytm · 2018-07-04T22:59:57Z

Could we modify swap_nonoverlapping instead of creating a new function swap_nonoverlapping_one?

unsafe fn swap_nonoverlapping<T>(x: *mut T, y: *mut T, count: usize) {
    if count == 1 && mem::size_of::<T>() < 32 {
        let z = read(x);
        copy_nonoverlapping(y, x, 1);
        write(y, z);
    } else {
        let x = x as *mut u8;
        let y = y as *mut u8;
        let len = mem::size_of::<T>() * count;
        swap_nonoverlapping_bytes(x, y, len)
    }
}

scottmcm · 2018-07-07T07:28:11Z

@kennytm I originally had the code in mem::swap directly, but that meant the 32 was rather out of context, so I did the pub(crate) function. I'm not a huge fan of putting the count == 1 check in swap_nonoverlapping since it feels wrong to bother checking in cases where it's not a compile-time constant and both sides of the if/else would need to get codegened...

stokhos · 2018-07-21T04:36:22Z

Ping from triage! @rust-lang/libs anyone got time to review this PR?

alexcrichton · 2018-07-21T15:46:59Z

@bors: r+

bors · 2018-07-21T15:47:00Z

📌 Commit 1f73144 has been approved by alexcrichton

alexcrichton · 2018-07-21T15:47:28Z

I agree with @scottmcm that we should keep this as a separate function for now, but if the need arises we can always merge it with the original swap_nonoverlapping function!

mem::swap the obvious way for types smaller than the SIMD optimization's block size LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the `replace_with` RFC discussion. Examples of the improvements: <details> <summary>swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc</summary> ```rust type Demo = [u16; 3]; pub fn swap_demo(x: &mut Demo, y: &mut Demo) { std::mem::swap(x, y); } ``` nightly: ```asm _ZN4blah9swap_demo17ha1732a9b71393a7eE: .seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE sub rsp, 32 .seh_stackalloc 32 .seh_endprologue movzx eax, word ptr [rcx + 4] mov word ptr [rsp + 4], ax mov eax, dword ptr [rcx] mov dword ptr [rsp], eax movzx eax, word ptr [rdx + 4] mov word ptr [rcx + 4], ax mov eax, dword ptr [rdx] mov dword ptr [rcx], eax movzx eax, word ptr [rsp + 4] mov word ptr [rdx + 4], ax mov eax, dword ptr [rsp] mov dword ptr [rdx], eax add rsp, 32 ret .seh_handlerdata .section .text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE .seh_endproc ``` this PR: ```asm _ZN4blah9swap_demo17ha1732a9b71393a7eE: mov r8d, dword ptr [rcx] movzx r9d, word ptr [rcx + 4] movzx eax, word ptr [rdx + 4] mov word ptr [rcx + 4], ax mov eax, dword ptr [rdx] mov dword ptr [rcx], eax mov word ptr [rdx + 4], r9w mov dword ptr [rdx], r8d ret ``` </details> <details> <summary>`replace_with` optimizes down much better</summary> Inspired by rust-lang/rfcs#2490, ```rust fn replace_with<T, F>(x: &mut Option<T>, f: F) where F: FnOnce(Option<T>) -> Option<T> { *x = f(x.take()); } pub fn inc_opt(mut x: &mut Option<i32>) { replace_with(&mut x, |i| i.map(|j| j + 1)); } ``` Rust 1.26.0: ```asm _ZN4blah7inc_opt17heb0acb64c51777cfE: mov rax, qword ptr [rcx] movabs r8, 4294967296 add r8, rax shl rax, 32 movabs rdx, -4294967296 and rdx, r8 xor r8d, r8d test rax, rax cmove rdx, rax setne r8b or rdx, r8 mov qword ptr [rcx], rdx ret ``` Nightly (better thanks to ScalarPair, maybe?): ```asm _ZN4blah7inc_opt17h66df690be0b5899dE: mov r8, qword ptr [rcx] mov rdx, r8 shr rdx, 32 xor eax, eax test r8d, r8d setne al add edx, 1 mov dword ptr [rcx], eax mov dword ptr [rcx + 4], edx ret ``` This PR: ```asm _ZN4blah7inc_opt17h1426dc215ecbdb19E: xor eax, eax cmp dword ptr [rcx], 0 setne al mov dword ptr [rcx], eax add dword ptr [rcx + 4], 1 ret ``` Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (rust-lang#49420 (comment)). </details>

bors · 2018-07-22T01:54:11Z

⌛ Testing commit 1f73144 with merge c069aa6bbc1d3c7a183764f584cd1ce8c0508957...

bors · 2018-07-22T02:43:52Z

💔 Test failed - status-appveyor

LLVM isn't able to remove the alloca for the unaligned block in the SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the `replace_with` RFC discussion.

scottmcm · 2018-07-22T05:06:00Z

Does the appveyor build use a different LLVM from tree? I just rebased this locally on windows and the test passes fine...

kennytm · 2018-07-22T06:05:27Z

@scottmcm Are you targeting i686-pc-windows-msvc?

Smaller platforms don't merge the loads the same way.

scottmcm · 2018-07-22T06:15:58Z

Thanks, @kennytm! That makes sense -- it probably doesn't unify the loads/stores on 32-bit. I've pushed an update to the test to not try to run it everywhere, since it depends on optimizations.

alexcrichton · 2018-07-22T16:07:48Z

@bors: r+

bors · 2018-07-22T16:07:49Z

📌 Commit c9482f7 has been approved by alexcrichton

mem::swap the obvious way for types smaller than the SIMD optimization's block size LLVM isn't able to remove the alloca for the unaligned block in the post-SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the `replace_with` RFC discussion. Examples of the improvements: <details> <summary>swapping `[u16; 3]` takes 1/3 fewer instructions and no stackalloc</summary> ```rust type Demo = [u16; 3]; pub fn swap_demo(x: &mut Demo, y: &mut Demo) { std::mem::swap(x, y); } ``` nightly: ```asm _ZN4blah9swap_demo17ha1732a9b71393a7eE: .seh_proc _ZN4blah9swap_demo17ha1732a9b71393a7eE sub rsp, 32 .seh_stackalloc 32 .seh_endprologue movzx eax, word ptr [rcx + 4] mov word ptr [rsp + 4], ax mov eax, dword ptr [rcx] mov dword ptr [rsp], eax movzx eax, word ptr [rdx + 4] mov word ptr [rcx + 4], ax mov eax, dword ptr [rdx] mov dword ptr [rcx], eax movzx eax, word ptr [rsp + 4] mov word ptr [rdx + 4], ax mov eax, dword ptr [rsp] mov dword ptr [rdx], eax add rsp, 32 ret .seh_handlerdata .section .text,"xr",one_only,_ZN4blah9swap_demo17ha1732a9b71393a7eE .seh_endproc ``` this PR: ```asm _ZN4blah9swap_demo17ha1732a9b71393a7eE: mov r8d, dword ptr [rcx] movzx r9d, word ptr [rcx + 4] movzx eax, word ptr [rdx + 4] mov word ptr [rcx + 4], ax mov eax, dword ptr [rdx] mov dword ptr [rcx], eax mov word ptr [rdx + 4], r9w mov dword ptr [rdx], r8d ret ``` </details> <details> <summary>`replace_with` optimizes down much better</summary> Inspired by rust-lang/rfcs#2490, ```rust fn replace_with<T, F>(x: &mut Option<T>, f: F) where F: FnOnce(Option<T>) -> Option<T> { *x = f(x.take()); } pub fn inc_opt(mut x: &mut Option<i32>) { replace_with(&mut x, |i| i.map(|j| j + 1)); } ``` Rust 1.26.0: ```asm _ZN4blah7inc_opt17heb0acb64c51777cfE: mov rax, qword ptr [rcx] movabs r8, 4294967296 add r8, rax shl rax, 32 movabs rdx, -4294967296 and rdx, r8 xor r8d, r8d test rax, rax cmove rdx, rax setne r8b or rdx, r8 mov qword ptr [rcx], rdx ret ``` Nightly (better thanks to ScalarPair, maybe?): ```asm _ZN4blah7inc_opt17h66df690be0b5899dE: mov r8, qword ptr [rcx] mov rdx, r8 shr rdx, 32 xor eax, eax test r8d, r8d setne al add edx, 1 mov dword ptr [rcx], eax mov dword ptr [rcx + 4], edx ret ``` This PR: ```asm _ZN4blah7inc_opt17h1426dc215ecbdb19E: xor eax, eax cmp dword ptr [rcx], 0 setne al mov dword ptr [rcx], eax add dword ptr [rcx + 4], 1 ret ``` Where that add is beautiful -- using an addressing mode to not even need to explicitly go through a register -- and the remaining imperfection is well-known (rust-lang#49420 (comment)). </details>

Rollup of 11 pull requests Successful merges: - #51807 (Deprecation of str::slice_unchecked(_mut)) - #52051 (mem::swap the obvious way for types smaller than the SIMD optimization's block size) - #52465 (Add CI test harness for `thumb*` targets. [IRR-2018-embedded]) - #52507 (Reword when `_` couldn't be inferred) - #52508 (Document that Unique::empty() and NonNull::dangling() aren't sentinel values) - #52521 (Fix links in rustdoc book.) - #52581 (Avoid using `#[macro_export]` for documenting builtin macros) - #52582 (Typo) - #52587 (Add missing backtick in UniversalRegions doc comment) - #52594 (Run the error index tool against the sysroot libdir) - #52615 (Added new lines to .gitignore.)

…iler-errors Use `load`+`store` instead of `memcpy` for small integer arrays I was inspired by rust-lang#98892 to see whether, rather than making `mem::swap` do something smart in the library, we could update MIR assignments like `*_1 = *_2` to do something smarter than `memcpy` for sufficiently-small types that doing it inline is going to be better than a `memcpy` call in assembly anyway. After all, special code may help `mem::swap`, but if the "obvious" MIR can just result in the correct thing that helps everything -- other code like `mem::replace`, people doing it manually, and just passing around by value in general -- as well as makes MIR inlining happier since it doesn't need to deal with all the complicated library code if it just sees a couple assignments. LLVM will turn the short, known-length `memcpy`s into direct instructions in the backend, but that's too late for it to be able to remove `alloca`s. In general, replacing `memcpy`s with typed instructions is hard in the middle-end -- even for `memcpy.inline` where it knows it won't be a function call -- is hard [due to poison propagation issues](https://rust-lang.zulipchat.com/#narrow/stream/187780-t-compiler.2Fwg-llvm/topic/memcpy.20vs.20load-store.20for.20MIR.20assignments/near/360376712). So because we know more about the type invariants -- these are typed copies -- rustc can emit something more specific, allowing LLVM to `mem2reg` away the `alloca`s in some situations. rust-lang#52051 previously did something like this in the library for `mem::swap`, but it ended up regressing during enabling mir inlining (rust-lang@cbbf06b), so this has been suboptimal on stable for ≈5 releases now. The code in this PR is narrowly targeted at just integer arrays in LLVM, but works via a new method on the [`LayoutTypeMethods`](https://doc.rust-lang.org/nightly/nightly-rustc/rustc_codegen_ssa/traits/trait.LayoutTypeMethods.html) trait, so specific backends based on cg_ssa can enable this for more situations over time, as we find them. I don't want to try to bite off too much in this PR, though. (Transparent newtypes and simple things like the 3×usize `String` would be obvious candidates for a follow-up.) Codegen demonstrations: <https://llvm.godbolt.org/z/fK8hT9aqv> Before: ```llvm define void `@swap_rgb48_old(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #1 { %a.i = alloca [3 x i16], align 2 call void `@llvm.lifetime.start.p0(i64` 6, ptr nonnull %a.i) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %a.i, ptr noundef nonnull align 2 dereferenceable(6) %x, i64 6, i1 false) tail call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %x, ptr noundef nonnull align 2 dereferenceable(6) %y, i64 6, i1 false) call void `@llvm.memcpy.p0.p0.i64(ptr` noundef nonnull align 2 dereferenceable(6) %y, ptr noundef nonnull align 2 dereferenceable(6) %a.i, i64 6, i1 false) call void `@llvm.lifetime.end.p0(i64` 6, ptr nonnull %a.i) ret void } ``` Note it going to stack: ```nasm swap_rgb48_old: # `@swap_rgb48_old` movzx eax, word ptr [rdi + 4] mov word ptr [rsp - 4], ax mov eax, dword ptr [rdi] mov dword ptr [rsp - 8], eax movzx eax, word ptr [rsi + 4] mov word ptr [rdi + 4], ax mov eax, dword ptr [rsi] mov dword ptr [rdi], eax movzx eax, word ptr [rsp - 4] mov word ptr [rsi + 4], ax mov eax, dword ptr [rsp - 8] mov dword ptr [rsi], eax ret ``` Now: ```llvm define void `@swap_rgb48(ptr` noalias nocapture noundef align 2 dereferenceable(6) %x, ptr noalias nocapture noundef align 2 dereferenceable(6) %y) unnamed_addr #0 { start: %0 = load <3 x i16>, ptr %x, align 2 %1 = load <3 x i16>, ptr %y, align 2 store <3 x i16> %1, ptr %x, align 2 store <3 x i16> %0, ptr %y, align 2 ret void } ``` still lowers to `dword`+`word` operations, but has no stack traffic: ```nasm swap_rgb48: # `@swap_rgb48` mov eax, dword ptr [rdi] movzx ecx, word ptr [rdi + 4] movzx edx, word ptr [rsi + 4] mov r8d, dword ptr [rsi] mov dword ptr [rdi], r8d mov word ptr [rdi + 4], dx mov word ptr [rsi + 4], cx mov dword ptr [rsi], eax ret ``` And as a demonstration that this isn't just `mem::swap`, a `mem::replace` on a small array (since replace doesn't use swap since rust-lang#83022), which used to be `memcpy`s in LLVM changes in IR ```llvm define void `@replace_short_array(ptr` noalias nocapture noundef sret([3 x i32]) dereferenceable(12) %0, ptr noalias noundef align 4 dereferenceable(12) %r, ptr noalias nocapture noundef readonly dereferenceable(12) %v) unnamed_addr #0 { start: %1 = load <3 x i32>, ptr %r, align 4 store <3 x i32> %1, ptr %0, align 4 %2 = load <3 x i32>, ptr %v, align 4 store <3 x i32> %2, ptr %r, align 4 ret void } ``` but that lowers to reasonable `dword`+`qword` instructions still ```nasm replace_short_array: # `@replace_short_array` mov rax, rdi mov rcx, qword ptr [rsi] mov edi, dword ptr [rsi + 8] mov dword ptr [rax + 8], edi mov qword ptr [rax], rcx mov rcx, qword ptr [rdx] mov edx, dword ptr [rdx + 8] mov dword ptr [rsi + 8], edx mov qword ptr [rsi], rcx ret ```

rust-highfive assigned KodrAus Jul 4, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 4, 2018

scottmcm added the WG-llvm Working group: LLVM backend code generation label Jul 4, 2018

scottmcm mentioned this pull request Jul 4, 2018

RFC: Add a replace_with method to Option rust-lang/rfcs#2490

Closed

scottmcm changed the title ~~Don't use SIMD optimizations in mem::swap for types smaller than the block size~~ mem::swap the obvious way for types smaller than the SIMD optimization's block size Jul 4, 2018

stokhos added S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 21, 2018

stokhos added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jul 21, 2018

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-team Status: Awaiting decision from the relevant subteam (see the T-<team> label). labels Jul 21, 2018

kennytm mentioned this pull request Jul 21, 2018

Rollup of 11 pull requests #52588

Closed

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jul 22, 2018

Don't use SIMD in mem::swap for types smaller than the block size

e6fc62a

LLVM isn't able to remove the alloca for the unaligned block in the SIMD tail in some cases, so doing this helps SRoA work in cases where it currently doesn't. Found in the `replace_with` RFC discussion.

kennytm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 22, 2018

scottmcm force-pushed the swap-directly branch from 1f73144 to 173410a Compare July 22, 2018 06:13

Only run the test on x86_64

c9482f7

Smaller platforms don't merge the loads the same way.

scottmcm force-pushed the swap-directly branch from 173410a to c9482f7 Compare July 22, 2018 06:14

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jul 22, 2018

kennytm mentioned this pull request Jul 22, 2018

Rollup of 11 pull requests #52616

Merged

bors merged commit c9482f7 into rust-lang:master Jul 22, 2018

scottmcm deleted the swap-directly branch August 15, 2018 03:18

scottmcm mentioned this pull request Jun 1, 2021

dead-code optimize if const { expr } even in opt-level=0 #85836

Closed

scottmcm mentioned this pull request May 14, 2023

[WIP]Use unaligned read/writes for core::mem::swap on x86_64 #98892

Closed

scottmcm mentioned this pull request May 26, 2023

Use load+store instead of memcpy for small integer arrays #111999

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

scottmcm commented Jul 4, 2018 •

edited

Loading

rust-highfive commented Jul 4, 2018

kennytm commented Jul 4, 2018

scottmcm commented Jul 4, 2018

kennytm commented Jul 4, 2018 •

edited

Loading

scottmcm commented Jul 4, 2018 •

edited

Loading

kennytm commented Jul 4, 2018

scottmcm commented Jul 4, 2018

kennytm commented Jul 4, 2018

scottmcm commented Jul 7, 2018

stokhos commented Jul 21, 2018

alexcrichton commented Jul 21, 2018

bors commented Jul 21, 2018

alexcrichton commented Jul 21, 2018

bors commented Jul 22, 2018

bors commented Jul 22, 2018

scottmcm commented Jul 22, 2018

kennytm commented Jul 22, 2018

scottmcm commented Jul 22, 2018

alexcrichton commented Jul 22, 2018

bors commented Jul 22, 2018

mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

mem::swap the obvious way for types smaller than the SIMD optimization's block size #52051

Conversation

scottmcm commented Jul 4, 2018 • edited Loading

rust-highfive commented Jul 4, 2018

kennytm commented Jul 4, 2018

scottmcm commented Jul 4, 2018

kennytm commented Jul 4, 2018 • edited Loading

scottmcm commented Jul 4, 2018 • edited Loading

kennytm commented Jul 4, 2018

scottmcm commented Jul 4, 2018

kennytm commented Jul 4, 2018

scottmcm commented Jul 7, 2018

stokhos commented Jul 21, 2018

alexcrichton commented Jul 21, 2018

bors commented Jul 21, 2018

alexcrichton commented Jul 21, 2018

bors commented Jul 22, 2018

bors commented Jul 22, 2018

scottmcm commented Jul 22, 2018

kennytm commented Jul 22, 2018

scottmcm commented Jul 22, 2018

alexcrichton commented Jul 22, 2018

bors commented Jul 22, 2018

scottmcm commented Jul 4, 2018 •

edited

Loading

kennytm commented Jul 4, 2018 •

edited

Loading

scottmcm commented Jul 4, 2018 •

edited

Loading