Rust will memmov |self| when passed by value to inline function #42763

bholley · 2017-06-20T03:41:44Z

See the assembly for https://is.gd/0Owhgw . Note that you'll need to explicitly specify Release, because I can't link to Release code for some reason.

Note this bit here:

movl $400, %edx
movq %rbx, %rsi
callq memcpy@PLT

That looks like a very unnecessary and large copy.

I ran into this working on rust-selectors for stylo. In the parser, we have a very large stack-allocated SmallVec, under the assumption that it is never moved. I added some inline functions that take the buffer as |self|, and parsing got several percent slower. I rewrote the code to pass things as &mut, and the overhead went away.

Interestingly, doing:

for x in some_small_vec.into_iter() {
...
}

Doesn't have any memcpy in the disassembly, despite the fact that into_iter() takes self. Presumably some iterator optimization going on.

CC @SimonSapin @mbrubeck @Manishearth @gankro @jonathandturner

bholley · 2017-06-20T05:00:09Z

(Deleted my last comment, which was intended for somewhere else).

bholley · 2017-06-20T23:11:18Z

Actually, I was wrong. This totally affects SmallVec::into_iter() :-(

Paste the following into play: https://gist.github.com/bholley/69dcf3654dfde0ebd8bb8bd8a7c2386d

I guess using drain() is a reasonable workaround. Really bad footgun though.

michaelwoerister · 2017-06-21T14:00:06Z

cc @rust-lang/compiler

eddyb · 2017-06-21T18:49:14Z

cc @pcwalton Is this the same memcpy optimization problem?

sophiajt · 2017-06-21T18:53:46Z

@eddyb - I'm going to show my ignorance here a bit, but why is this a memcpy optimization? If the function is inlined, shouldn't the variable be reused rather than memcpy'd? When @bholley described it, just sounded like perhaps a bug in the inliner. (though, again, it's been a while since I worked in that area)

eddyb · 2017-06-21T19:04:37Z

@jonathandturner LLVM inlines after rustc generates the copies.

Gankra · 2017-06-21T19:20:37Z

It isn't in general safe to remove these copies when an inline happens. For example modifications to the inner function's copy shouldn't be visible to the outer function's copy. Similarly pointers in the inner function may alias the outer function's copy of the value.

Rust has a lot of info that can prove eliminating these copies is safe, but llvm doesn't generally have that information. For instance you can easily argue the copy is unnecessary when T: !Copy, but I don't think llvm has any notion of move semantics or deinitializing memory.

So either we need to feed llvm stronger alias information and/or improve llvm's optimization passes, or we need to make Rust (MIR) do the inlining so it can apply its domain-specific knowledge to elide the copies.

Either way, fixing this is actually pretty non-trivial, afaict.

sophiajt · 2017-06-21T19:26:47Z

For example modifications to the inner function's copy shouldn't be visible to the outer function's copy.

I'm trying to think of a situation this occurs, seems like Rust protects you:

move: you have ownership, so you're free to modify
shared borrow: you can't modify
mutable borrow: your changed version will be visible to the calling function

Though I guess maybe that's less the case than it seems at first blush?

eddyb · 2017-06-21T19:38:56Z

Doesn't apply to Copy types. We did try to avoid some of the copies for non-Copy types although I'm not clear on how much that helps or if it was even kept in MIR trans.

This is important because we're about to start storing a parallel list of pushed hashes, and the current behavior here will cause mismatches there. We take the opportunity to bump the SmallVec size to 16, since each entry is only a word and we really want to avoid heap-allocating. And then we switch to drain(), because of rust-lang/rust#42763

brson · 2017-07-05T21:54:01Z

This issue and others like it apparently come up quite regularly in stylo and are something of a hindrance to achieving its promised performance. It would be good if the compiler team could think a bit about how to unblock some of the advanced optimizations necessary here. cc @nikomatsakis @rust-lang/compiler

nikomatsakis · 2017-07-06T20:45:01Z

Hmm. We discussed this for some time in the @rust-lang/compiler team meeting but didn't decide on a precise course of action. It seems really important to make progress here, but we're not sure how to allocate the "person power" to make it happen, and we didn't want to call this P-high without somebody who is planning to follow-up.

One thing that I personally think would be helpful to is to drill a bit more into the details of when memmoves etc occur and if we can categorize the various cases better. Also to get a clearer idea on what sorts of annotations or other changes would help to do better optimization.

cc @arielb1, who might have thoughts on the matter

nikomatsakis · 2017-08-24T20:51:24Z

triage: P-medium

robsmith11 · 2018-05-05T07:27:07Z

Are there any suggestions on ways to catch instances of this issue? I came across a very obvious case where the run-time of my long-running program increased by a factor of 15x because I did something like this in a loop:

let a = if flag { struct.large_array1 } else {struct.large_array2};
f(a[x-1], a[x], a[x+1])

I wasn't expecting a memcpy to occur here, but clearly it was copying the entire array every time I ran this code. Using &s.large_array1 fixed it.

I work with large arrays all over my code, so I'm sure more subtle instances are occurring. This seems like a pretty serious performance issue for anyone working with large Copy types.

bholley · 2018-05-05T19:18:29Z

You can dump the assembly and then grep for memmov with one line of before-context, which should show you the sizes of all the immediate memmovs in the binary. From there it's pretty straightforward to look for the biggest ones. I did this successfully for Firefox.

jrmuizel · 2018-11-28T18:41:34Z

FWIW, I put a tool together that helps find these memcpy's: https://github.com/jrmuizel/memcpy-find

jrmuizel · 2021-01-13T19:58:29Z

It looks like the memcpy in the original test case is removed when running with -Zmir-opt-level=2.

example::main:
        sub     rsp, 472
        mov     qword ptr [rsp + 24], 5
        mov     qword ptr [rsp + 32], 5
        mov     qword ptr [rsp + 40], 5
        mov     qword ptr [rsp + 48], 5
        mov     qword ptr [rsp + 56], 5
        mov     qword ptr [rsp + 64], 5
        mov     qword ptr [rsp + 72], 5
        mov     qword ptr [rsp + 80], 5
        mov     qword ptr [rsp + 88], 5
        mov     qword ptr [rsp + 96], 5
        lea     rax, [rsp + 104]
        mov     qword ptr [rsp + 104], 5
        mov     qword ptr [rsp + 112], 5
        mov     qword ptr [rsp + 120], 5
        mov     qword ptr [rsp + 128], 5
        mov     qword ptr [rsp + 136], 5
        mov     qword ptr [rsp + 144], 5
        mov     qword ptr [rsp + 152], 5
        mov     qword ptr [rsp + 160], 5
        mov     qword ptr [rsp + 168], 5
        mov     qword ptr [rsp + 176], 5
        mov     qword ptr [rsp + 184], 5
        mov     qword ptr [rsp + 192], 5
        mov     qword ptr [rsp + 200], 5
        mov     qword ptr [rsp + 208], 5
        mov     qword ptr [rsp + 216], 5
        mov     qword ptr [rsp + 224], 5
        mov     qword ptr [rsp + 232], 5
        mov     qword ptr [rsp + 240], 5
        mov     qword ptr [rsp + 248], 5
        mov     qword ptr [rsp + 256], 5
        mov     qword ptr [rsp + 264], 5
        mov     qword ptr [rsp + 272], 5
        mov     qword ptr [rsp + 280], 5
        mov     qword ptr [rsp + 288], 5
        mov     qword ptr [rsp + 296], 5
        mov     qword ptr [rsp + 304], 5
        mov     qword ptr [rsp + 312], 5
        mov     qword ptr [rsp + 320], 5
        mov     qword ptr [rsp + 328], 5
        mov     qword ptr [rsp + 336], 5
        mov     qword ptr [rsp + 344], 5
        mov     qword ptr [rsp + 352], 5
        mov     qword ptr [rsp + 360], 5
        mov     qword ptr [rsp + 368], 5
        mov     qword ptr [rsp + 376], 5
        mov     qword ptr [rsp + 384], 5
        mov     qword ptr [rsp + 392], 5
        mov     qword ptr [rsp + 400], 5
        mov     qword ptr [rsp + 408], 5
        mov     qword ptr [rsp + 416], 5
        lea     rcx, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 424], rcx
        mov     qword ptr [rsp + 432], 2
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
        mov     qword ptr [rsp + 16], rax
        mov     qword ptr [rsp + 440], 0
        lea     rax, [rsp + 8]
        mov     qword ptr [rsp + 456], rax
        mov     qword ptr [rsp + 464], 1
        lea     rdi, [rsp + 424]
        call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
        add     rsp, 472
        ret

bholley · 2021-01-15T00:23:06Z

Is that just replacing the memcpy with an inline series of moves? If so I hope we can do better. :-)

jrmuizel · 2021-01-15T00:56:36Z

No. There are 50 movs to initialize with 5 and only 8 other movs to set up parameters for the call to print. It would be better if the initialization of the array could get removed too, but that's another issue.

Here's what it looks like by default without -Zmir-opt-level=2

example::main:
        push    r14
        push    rbx
        sub     rsp, 872
        mov     qword ptr [rsp + 24], 5
        mov     qword ptr [rsp + 32], 5
        mov     qword ptr [rsp + 40], 5
        mov     qword ptr [rsp + 48], 5
        mov     qword ptr [rsp + 56], 5
        mov     qword ptr [rsp + 64], 5
        mov     qword ptr [rsp + 72], 5
        mov     qword ptr [rsp + 80], 5
        mov     qword ptr [rsp + 88], 5
        mov     qword ptr [rsp + 96], 5
        mov     qword ptr [rsp + 104], 5
        mov     qword ptr [rsp + 112], 5
        mov     qword ptr [rsp + 120], 5
        mov     qword ptr [rsp + 128], 5
        mov     qword ptr [rsp + 136], 5
        mov     qword ptr [rsp + 144], 5
        mov     qword ptr [rsp + 152], 5
        mov     qword ptr [rsp + 160], 5
        mov     qword ptr [rsp + 168], 5
        mov     qword ptr [rsp + 176], 5
        mov     qword ptr [rsp + 184], 5
        mov     qword ptr [rsp + 192], 5
        mov     qword ptr [rsp + 200], 5
        mov     qword ptr [rsp + 208], 5
        mov     qword ptr [rsp + 216], 5
        mov     qword ptr [rsp + 224], 5
        mov     qword ptr [rsp + 232], 5
        mov     qword ptr [rsp + 240], 5
        mov     qword ptr [rsp + 248], 5
        mov     qword ptr [rsp + 256], 5
        mov     qword ptr [rsp + 264], 5
        mov     qword ptr [rsp + 272], 5
        mov     qword ptr [rsp + 280], 5
        mov     qword ptr [rsp + 288], 5
        mov     qword ptr [rsp + 296], 5
        mov     qword ptr [rsp + 304], 5
        mov     qword ptr [rsp + 312], 5
        mov     qword ptr [rsp + 320], 5
        mov     qword ptr [rsp + 328], 5
        mov     qword ptr [rsp + 336], 5
        mov     qword ptr [rsp + 344], 5
        mov     qword ptr [rsp + 352], 5
        mov     qword ptr [rsp + 360], 5
        mov     qword ptr [rsp + 368], 5
        mov     qword ptr [rsp + 376], 5
        mov     qword ptr [rsp + 384], 5
        mov     qword ptr [rsp + 392], 5
        mov     qword ptr [rsp + 400], 5
        mov     qword ptr [rsp + 408], 5
        mov     qword ptr [rsp + 416], 5
        lea     r14, [rsp + 472]
        lea     rsi, [rsp + 24]
        mov     rbx, qword ptr [rip + memcpy@GOTPCREL]
        mov     edx, 400
        mov     rdi, r14
        call    rbx
        lea     rdi, [rsp + 24]
        mov     edx, 400
        mov     rsi, r14
        call    rbx
        lea     rax, [rsp + 104]
        mov     qword ptr [rsp + 8], rax
        mov     rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
        mov     qword ptr [rsp + 16], rax
        lea     rax, [rip + .L__unnamed_1]
        mov     qword ptr [rsp + 424], rax
        mov     qword ptr [rsp + 432], 2
        mov     qword ptr [rsp + 440], 0
        lea     rax, [rsp + 8]
        mov     qword ptr [rsp + 456], rax
        mov     qword ptr [rsp + 464], 1
        lea     rdi, [rsp + 424]
        call    qword ptr [rip + std::io::stdio::_print@GOTPCREL]
        add     rsp, 872
        pop     rbx
        pop     r14
        ret

The initialization is there along with 2 400 byte memcpys.

erikdesjardins · 2021-03-13T22:18:26Z

Looks like nightly (probably #82806) removes the memcpy from the original example (https://godbolt.org/z/nGfxaj), but not from the smallvec example (https://godbolt.org/z/Y34K36)

cc @nikic

erikdesjardins · 2024-02-24T19:42:20Z

The memcpy was removed from the original example in 1.52: https://godbolt.org/z/5fnGqG1KG.
The array initialization is still there, but that's gonna be hard to remove since println! takes a reference to the array element. It disappears if you force a move: https://godbolt.org/z/hYWT6deWo.

The memcpy was removed from the smallvec example in 1.70: https://godbolt.org/z/6oGcfMaor.
There's still some unnecessary stack traffic that seems to be due to phase ordering: https://godbolt.org/z/9o8e6eqPc.

brson added the I-nominated label Jul 5, 2017

Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017

rust-highfive added P-medium Medium priority and removed I-nominated labels Aug 24, 2017

emilio mentioned this issue Nov 2, 2017

Move always move the whole SmallVec object. servo/rust-smallvec#70

Open

jswrenn mentioned this issue May 4, 2018

Slow matrix-vector multiplication dimforge/nalgebra#334

Open

emilio mentioned this issue Nov 28, 2018

Avoid picture primitive copies via VecHelper servo/webrender#3362

Merged

emilio mentioned this issue Aug 10, 2020

Tracking issue for Rust codegen optimizations wanted by Gecko jrmuizel/gecko-rust-codegen-opt#1

Open

workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust will memmov |self| when passed by value to inline function #42763

Rust will memmov |self| when passed by value to inline function #42763

bholley commented Jun 20, 2017

bholley commented Jun 20, 2017

bholley commented Jun 20, 2017

michaelwoerister commented Jun 21, 2017

eddyb commented Jun 21, 2017

sophiajt commented Jun 21, 2017

eddyb commented Jun 21, 2017

Gankra commented Jun 21, 2017 •

edited

Loading

sophiajt commented Jun 21, 2017

eddyb commented Jun 21, 2017

brson commented Jul 5, 2017

nikomatsakis commented Jul 6, 2017

nikomatsakis commented Aug 24, 2017

robsmith11 commented May 5, 2018

bholley commented May 5, 2018

jrmuizel commented Nov 28, 2018

jrmuizel commented Jan 13, 2021

bholley commented Jan 15, 2021

jrmuizel commented Jan 15, 2021

erikdesjardins commented Mar 13, 2021

erikdesjardins commented Feb 24, 2024 •

edited

Loading

Rust will memmov |self| when passed by value to inline function #42763

Rust will memmov |self| when passed by value to inline function #42763

Comments

bholley commented Jun 20, 2017

bholley commented Jun 20, 2017

bholley commented Jun 20, 2017

michaelwoerister commented Jun 21, 2017

eddyb commented Jun 21, 2017

sophiajt commented Jun 21, 2017

eddyb commented Jun 21, 2017

Gankra commented Jun 21, 2017 • edited Loading

sophiajt commented Jun 21, 2017

eddyb commented Jun 21, 2017

brson commented Jul 5, 2017

nikomatsakis commented Jul 6, 2017

nikomatsakis commented Aug 24, 2017

robsmith11 commented May 5, 2018

bholley commented May 5, 2018

jrmuizel commented Nov 28, 2018

jrmuizel commented Jan 13, 2021

bholley commented Jan 15, 2021

jrmuizel commented Jan 15, 2021

erikdesjardins commented Mar 13, 2021

erikdesjardins commented Feb 24, 2024 • edited Loading

Gankra commented Jun 21, 2017 •

edited

Loading

erikdesjardins commented Feb 24, 2024 •

edited

Loading