Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary memcpy caused by ordering of unwrap #56172

Closed
jrmuizel opened this issue Nov 22, 2018 · 8 comments
Closed

Unnecessary memcpy caused by ordering of unwrap #56172

jrmuizel opened this issue Nov 22, 2018 · 8 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations A-mir-opt-nrvo Fixed by NRVO C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@jrmuizel
Copy link
Contributor

In the following code f and g have these lines swapped:

    let item = SpecificDisplayItem::PopStackingContext;
    clip.unwrap();

Unwrapping later causes f to have an additional memcpy.

#[inline(never)]
pub fn f(clip: Option<&bool>) {
    let item = SpecificDisplayItem::PopStackingContext;
    clip.unwrap();
    do_item(&DI {
            item,
    });
}

#[inline(never)]
pub fn g(clip: Option<&bool>) {
    clip.unwrap();
    let item = SpecificDisplayItem::PopStackingContext;
    do_item(&DI {
            item,
    });
}

pub enum SpecificDisplayItem {
    PopStackingContext,
    Other([f64; 22]),
}

struct DI {
    item: SpecificDisplayItem,
}


fn do_item(di: &DI) { unsafe { ext(di) } }
extern {
    fn ext(di: &DI);
}

Compiles to:

example::f:
        sub     rsp, 360
        test    rdi, rdi
        je      .LBB0_1
        mov     qword ptr [rsp], 0
        lea     rdi, [rsp + 8]
        lea     rsi, [rsp + 184]
        mov     edx, 176
        call    memcpy@PLT
        mov     rdi, rsp
        call    ext@PLT
        add     rsp, 360
        ret
.LBB0_1:
        lea     rdi, [rip + .Lbyte_str.2]
        call    core::panicking::panic@PLT
        ud2

example::g:
        sub     rsp, 184
        test    rdi, rdi
        je      .LBB1_1
        mov     qword ptr [rsp], 0
        mov     rdi, rsp
        call    ext@PLT
        add     rsp, 184
        ret
.LBB1_1:
        lea     rdi, [rip + .Lbyte_str.2]
        call    core::panicking::panic@PLT
        ud2

Ideally, f and g should compile to the same thing.

jrmuizel added a commit to jrmuizel/webrender that referenced this issue Nov 22, 2018
LLVM's ability to eliminate memcpy's across basic blocks is bad.
Because of this, we want to make sure that we avoid putting branches
in code where we want the memcpy elimination to happen.
rust-lang/rust#56172 has a reduced test case
of this happening.

This change lifts the branch caused by unwrap() above the creation of
the SpecificDisplayItem. It ends up saving a memcpy of 127 bytes along
with reducing pop_reference_frame by 18 instructions.
bors-servo pushed a commit to servo/webrender that referenced this issue Nov 22, 2018
Improve code quality for push_new_empty_item.

LLVM's ability to eliminate memcpy's across basic blocks is bad.
Because of this, we want to make sure that we avoid putting branches
in code where we want the memcpy elimination to happen.
rust-lang/rust#56172 has a reduced test case
of this happening.

This change lifts the branch caused by unwrap() above the creation of
the SpecificDisplayItem. It ends up saving a memcpy of 127 bytes along
with reducing pop_reference_frame by 18 instructions.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/webrender/3341)
<!-- Reviewable:end -->
@lqd
Copy link
Member

lqd commented Nov 23, 2018

cc @rust-lang/wg-codegen

@eddyb
Copy link
Member

eddyb commented Nov 23, 2018

IIRC @pcwalton found that LLVM only elides copies in the same basic block (and unwrap can panic, so it terminates a basic block, splitting the code before and after into separate basic blocks).

@nox
Copy link
Contributor

nox commented Nov 23, 2018

Yes, cf. @jrmuizel's PR on Webrender:

LLVM's ability to eliminate memcpy's across basic blocks is bad.

@nikic nikic added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels Dec 1, 2018
@nikic
Copy link
Contributor

nikic commented Dec 1, 2018

There were some attempts to make memcpyopt work across BBs, but they were reverted due to regressions, see https://bugs.llvm.org/show_bug.cgi?id=35519.

@jonas-schievink jonas-schievink added A-mir-opt Area: MIR optimizations C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-mir-opt-nrvo Fixed by NRVO labels Jun 10, 2020
@dotdash
Copy link
Contributor

dotdash commented Jun 26, 2020

It might be worth noting that the memcpy here is especially pointless because it copies uninitialized memory. When the memcpy optimizations fold a memset into a memcpy it checks whether the memcpy copies more memory than what has been memset, and if so and the remainder is uninitialized, the memcpy for the uninitialized part is dropped.

In this case here, SROA splits the alloca for the item, into the discriminant part, and 176 uninitialized bytes. But since it runs without memory dependency analysis, is has no easy way to drop the memcpy for the uninitialized part. And before the MemCpy pass can kill the memcpy, the inliner comes and breaks the code into multiple basic blocks. So for this constellation, it would help to run the MemCpy pass before the inliner, but I have no real idea what tradeoff that makes.

Another option would be to add an optimization that drops memcpys from uninitialized sources to a pass that does cross-bb memory dependence analysis anyway, but a quick look didn't reveal any obvious place to do that. Given the rather common use of enums like this, where only some variants have a payload, that might be worth a try though.

@oli-obk
Copy link
Contributor

oli-obk commented Jun 26, 2020

This may get resolved with #72632 by doing the necessary work on MIR.

@dotdash
Copy link
Contributor

dotdash commented Jun 26, 2020

The approach from #72632 breaks if you assign the same source to multiple destinations, because there's no simple chain that can be reduced to a single destination. I think you need to do copy-propagation (replacing the destination with the source, instead of the other way around) to handle that.

The following doesn't get properly optimized by #72632, but is handled by the memcpy pass being run before the inliner. That approach of course also doesn't handle all the cases, thus the proposal for the optimization to catch copies from uninitialized memory.

#[inline(never)]
pub fn f(clip: Option<&bool>) {
    let item = SpecificDisplayItem::PopStackingContext;
    clip.unwrap();
    do_item(&DI {
            item,
    });
   do_item(&DI {
            item,
    });
}}

In fact #72632 even stops the patched (MemCpyOpt before Inliner) LLVM from optimizing this version, because SROA can no longer split the alloca and so there's no memcpy that copies only uninitialized memory. For the modified f function #72632 still produces better code than nightly, but if you apply the same change to g, then nightly produces a properly optimized version, while dest-prop causes the lifetimes of the two DI instances to overlap, forcing double stack usage and a memcpy.

Edit: I'm not trying to criticize the work that went into #72632. I didn't notice the comment on the PR (and didn't realize it had progressed so far) when I started to look into this. After being made aware of it, I wanted to see that it works for myself and got confused by the way that optimization pass works because I (for some reason) always assumed it would do copy propagation and was just named weirdly, so I tried to figure out why it does things the way it does, and noticed that it breaks this modified example.

@jrmuizel
Copy link
Contributor Author

Fixed by #82806

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-mir-opt Area: MIR optimizations A-mir-opt-nrvo Fixed by NRVO C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants