Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaybeUninit::assume_init optimizes poorly #74267

Open
dtolnay opened this issue Jul 12, 2020 · 12 comments
Open

MaybeUninit::assume_init optimizes poorly #74267

dtolnay opened this issue Jul 12, 2020 · 12 comments
Labels
A-codegen Area: Code generation E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Jul 12, 2020

In #74254 we observed that returning expr.assume_init() from a function unexpectedly inhibits the return value from being constructed in place up front.

https://rust.godbolt.org/z/hr77qM

#![allow(deprecated)]

use std::mem::{self, MaybeUninit};
use std::ptr;

type T = String;
const N: usize = 2;

// fast
pub fn a() -> [T; N] {
    Default::default()
}

// fast
pub fn b() -> [T; N] {
    unsafe {
        // ignore the UB for now
        let mut array: [T; N] = mem::uninitialized();
        for slot in &mut array {
            ptr::write(slot, T::default());
        }
        array
    }
}

// slow
pub fn c() -> [T; N] {
    let mut array: MaybeUninit<[T; N]> = MaybeUninit::uninit();
    unsafe {
        // ignore the UB for now
        // ordinarily would cast to &mut [MaybeUninit<T>; N]
        // but here we try to minimize difference from `b`
        let slots = &mut *array.as_mut_ptr();
        for slot in slots {
            ptr::write(slot, T::default());
        }
        array.assume_init()
    }
}

Notice that in the slow function the return value is constructed exactly the same as in both of the fast functions (6 instructions) except in the wrong place, then relocated from [rsp-48] to [rdi] 😢 (12 instructions).

example::a:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi + 24], 1
        vmovups xmmword ptr [rdi + 32], xmm0
        ret

example::b:
        mov     rax, rdi
        mov     qword ptr [rdi], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     qword ptr [rdi + 24], 1
        vmovups xmmword ptr [rdi + 32], xmm0
        ret

example::c:
        sub     rsp, 48
        mov     rax, rdi
        mov     qword ptr [rsp], 1
        vxorps  xmm0, xmm0, xmm0
        vmovups xmmword ptr [rsp + 8], xmm0
        mov     qword ptr [rsp + 24], 1
        vmovups xmmword ptr [rsp + 32], xmm0
        mov     rcx, qword ptr [rsp]
        mov     qword ptr [rdi], rcx
        vmovups xmm0, xmmword ptr [rsp + 8]
        vmovups xmmword ptr [rdi + 8], xmm0
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rdi + 16], rcx
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
        vmovups xmm0, xmmword ptr [rsp + 32]
        vmovups xmmword ptr [rdi + 32], xmm0
        add     rsp, 48
        ret
@dtolnay dtolnay added I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jul 12, 2020
@dtolnay
Copy link
Member Author

dtolnay commented Jul 12, 2020

The 12 instruction copy is extra baffling but it's possible that part is outside of Rust's control.

    ; copy word 0
        mov     rcx, qword ptr [rsp]
        mov     qword ptr [rdi], rcx
    ; copy words 1-2
        vmovups xmm0, xmmword ptr [rsp + 8]
        vmovups xmmword ptr [rdi + 8], xmm0
    ; copy word 3
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
    ; copy word 2 again (??)
        mov     rcx, qword ptr [rsp + 16]
        mov     qword ptr [rdi + 16], rcx
    ; copy word 3 again (??)
        mov     rcx, qword ptr [rsp + 24]
        mov     qword ptr [rdi + 24], rcx
    ; copy words 4-5
        vmovups xmm0, xmmword ptr [rsp + 32]
        vmovups xmmword ptr [rdi + 32], xmm0

I might have expected something like this:

        vmovups ymm0, ymmword ptr [rsp]
        vmovups ymm1, ymmword ptr [rsp + 16]
        vmovups ymmword ptr [rdi], ymm0
        vmovups ymmword ptr [rdi + 16], ymm1

which is what we get from fn cpy(input: [usize; 6]) -> [usize; 6] { input }. https://rust.godbolt.org/z/P5PzYs

@RalfJung
Copy link
Member

Is this something we can even fix on the Rust side, or would we expect LLVM to handle this better?

@nikic
Copy link
Contributor

nikic commented Jul 12, 2020

@dtolnay The ymm movups get broken up to avoid x86 store forwarding stalls. Of course, that shouldn't be introducing duplicate copies of the same bytes...

@alex
Copy link
Member

alex commented Aug 9, 2020

Is there any way to express the semantics of assume_init in C code? That's my traditional next step in trying to understand LLVM optimization differences.

@RalfJung
Copy link
Member

RalfJung commented Aug 9, 2020

It's basically an identity function -- a function that takes an array of some fixed size and returns the same array.

@alex
Copy link
Member

alex commented Aug 9, 2020

https://godbolt.org/z/5eKxE5 looks like it's a roughly correct C rendition of this problem. And it reproduces. I'm afraid part of the answer here is "aliasing", but adding -Z mutable-noalias=yes to the Rust reproducer didn't solve it.

@tesuji
Copy link
Contributor

tesuji commented Aug 10, 2020

GCC seems to optimize better than LLVM: https://godbolt.org/z/hWEWvM

b:
        mov     QWORD PTR [rdi], 0
        mov     QWORD PTR [rdi+8], 0
        mov     QWORD PTR [rdi+16], 0
        mov     QWORD PTR [rdi+24], 0
        mov     QWORD PTR [rdi+32], 0
        mov     QWORD PTR [rdi+40], 0
        mov     rax, rdi
        ret
c:
        mov     QWORD PTR [rdi], 0
        mov     QWORD PTR [rdi+8], 0
        mov     QWORD PTR [rdi+16], 0
        mov     QWORD PTR [rdi+24], 0
        mov     QWORD PTR [rdi+32], 0
        mov     QWORD PTR [rdi+40], 0
        mov     rax, rdi
        ret

@alex
Copy link
Member

alex commented Aug 11, 2020

I've filed an LLVM bug for this: https://bugs.llvm.org/show_bug.cgi?id=47114

@RalfJung
Copy link
Member

@nikic do you know if the LLVM upgrade helped with this as well?

@nikic
Copy link
Contributor

nikic commented Mar 13, 2021

@RalfJung Yes, this now seems to optimize well, presumably due to additional SROA after fully unrolling the initialization loop. We only start seeing memcpys at N=26 and higher, probably because some arbitrary cutoff is reached. I would expect that there would still be an unnecessary memcpy if the loop didn't get unrolled, but I didn't manage to make LLVM not unroll it...

@dtolnay
Copy link
Member Author

dtolnay commented Aug 21, 2021

Closing because the godbolt link from the top of this issue (https://rust.godbolt.org/z/hr77qM) now produces effectively identical asm for all 3 functions.

@dtolnay dtolnay closed this as completed Aug 21, 2021
@RalfJung
Copy link
Member

Is there a test for this? Should we have one?

@RalfJung RalfJung added the E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. label Aug 30, 2021
@RalfJung RalfJung reopened this Aug 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation E-needs-test Call for participation: An issue has been fixed and does not reproduce, but no test has been added. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants