-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust will memmov |self| when passed by value to inline function #42763
Comments
(Deleted my last comment, which was intended for somewhere else). |
Actually, I was wrong. This totally affects SmallVec::into_iter() :-( Paste the following into play: https://gist.github.com/bholley/69dcf3654dfde0ebd8bb8bd8a7c2386d I guess using drain() is a reasonable workaround. Really bad footgun though. |
cc @rust-lang/compiler |
cc @pcwalton Is this the same memcpy optimization problem? |
@eddyb - I'm going to show my ignorance here a bit, but why is this a memcpy optimization? If the function is inlined, shouldn't the variable be reused rather than memcpy'd? When @bholley described it, just sounded like perhaps a bug in the inliner. (though, again, it's been a while since I worked in that area) |
@jonathandturner LLVM inlines after rustc generates the copies. |
It isn't in general safe to remove these copies when an inline happens. For example modifications to the inner function's copy shouldn't be visible to the outer function's copy. Similarly pointers in the inner function may alias the outer function's copy of the value. Rust has a lot of info that can prove eliminating these copies is safe, but llvm doesn't generally have that information. For instance you can easily argue the copy is unnecessary when So either we need to feed llvm stronger alias information and/or improve llvm's optimization passes, or we need to make Rust (MIR) do the inlining so it can apply its domain-specific knowledge to elide the copies. Either way, fixing this is actually pretty non-trivial, afaict. |
I'm trying to think of a situation this occurs, seems like Rust protects you:
Though I guess maybe that's less the case than it seems at first blush? |
Doesn't apply to |
This is important because we're about to start storing a parallel list of pushed hashes, and the current behavior here will cause mismatches there. We take the opportunity to bump the SmallVec size to 16, since each entry is only a word and we really want to avoid heap-allocating. And then we switch to drain(), because of rust-lang/rust#42763
This issue and others like it apparently come up quite regularly in stylo and are something of a hindrance to achieving its promised performance. It would be good if the compiler team could think a bit about how to unblock some of the advanced optimizations necessary here. cc @nikomatsakis @rust-lang/compiler |
Hmm. We discussed this for some time in the @rust-lang/compiler team meeting but didn't decide on a precise course of action. It seems really important to make progress here, but we're not sure how to allocate the "person power" to make it happen, and we didn't want to call this P-high without somebody who is planning to follow-up. One thing that I personally think would be helpful to is to drill a bit more into the details of when memmoves etc occur and if we can categorize the various cases better. Also to get a clearer idea on what sorts of annotations or other changes would help to do better optimization. cc @arielb1, who might have thoughts on the matter |
triage: P-medium |
Are there any suggestions on ways to catch instances of this issue? I came across a very obvious case where the run-time of my long-running program increased by a factor of 15x because I did something like this in a loop:
I wasn't expecting a memcpy to occur here, but clearly it was copying the entire array every time I ran this code. Using I work with large arrays all over my code, so I'm sure more subtle instances are occurring. This seems like a pretty serious performance issue for anyone working with large Copy types. |
You can dump the assembly and then grep for memmov with one line of before-context, which should show you the sizes of all the immediate memmovs in the binary. From there it's pretty straightforward to look for the biggest ones. I did this successfully for Firefox. |
FWIW, I put a tool together that helps find these memcpy's: https://github.com/jrmuizel/memcpy-find |
It looks like the memcpy in the original test case is removed when running with example::main:
sub rsp, 472
mov qword ptr [rsp + 24], 5
mov qword ptr [rsp + 32], 5
mov qword ptr [rsp + 40], 5
mov qword ptr [rsp + 48], 5
mov qword ptr [rsp + 56], 5
mov qword ptr [rsp + 64], 5
mov qword ptr [rsp + 72], 5
mov qword ptr [rsp + 80], 5
mov qword ptr [rsp + 88], 5
mov qword ptr [rsp + 96], 5
lea rax, [rsp + 104]
mov qword ptr [rsp + 104], 5
mov qword ptr [rsp + 112], 5
mov qword ptr [rsp + 120], 5
mov qword ptr [rsp + 128], 5
mov qword ptr [rsp + 136], 5
mov qword ptr [rsp + 144], 5
mov qword ptr [rsp + 152], 5
mov qword ptr [rsp + 160], 5
mov qword ptr [rsp + 168], 5
mov qword ptr [rsp + 176], 5
mov qword ptr [rsp + 184], 5
mov qword ptr [rsp + 192], 5
mov qword ptr [rsp + 200], 5
mov qword ptr [rsp + 208], 5
mov qword ptr [rsp + 216], 5
mov qword ptr [rsp + 224], 5
mov qword ptr [rsp + 232], 5
mov qword ptr [rsp + 240], 5
mov qword ptr [rsp + 248], 5
mov qword ptr [rsp + 256], 5
mov qword ptr [rsp + 264], 5
mov qword ptr [rsp + 272], 5
mov qword ptr [rsp + 280], 5
mov qword ptr [rsp + 288], 5
mov qword ptr [rsp + 296], 5
mov qword ptr [rsp + 304], 5
mov qword ptr [rsp + 312], 5
mov qword ptr [rsp + 320], 5
mov qword ptr [rsp + 328], 5
mov qword ptr [rsp + 336], 5
mov qword ptr [rsp + 344], 5
mov qword ptr [rsp + 352], 5
mov qword ptr [rsp + 360], 5
mov qword ptr [rsp + 368], 5
mov qword ptr [rsp + 376], 5
mov qword ptr [rsp + 384], 5
mov qword ptr [rsp + 392], 5
mov qword ptr [rsp + 400], 5
mov qword ptr [rsp + 408], 5
mov qword ptr [rsp + 416], 5
lea rcx, [rip + .L__unnamed_1]
mov qword ptr [rsp + 424], rcx
mov qword ptr [rsp + 432], 2
mov qword ptr [rsp + 8], rax
mov rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
mov qword ptr [rsp + 16], rax
mov qword ptr [rsp + 440], 0
lea rax, [rsp + 8]
mov qword ptr [rsp + 456], rax
mov qword ptr [rsp + 464], 1
lea rdi, [rsp + 424]
call qword ptr [rip + std::io::stdio::_print@GOTPCREL]
add rsp, 472
ret |
Is that just replacing the memcpy with an inline series of moves? If so I hope we can do better. :-) |
No. There are 50 Here's what it looks like by default without example::main:
push r14
push rbx
sub rsp, 872
mov qword ptr [rsp + 24], 5
mov qword ptr [rsp + 32], 5
mov qword ptr [rsp + 40], 5
mov qword ptr [rsp + 48], 5
mov qword ptr [rsp + 56], 5
mov qword ptr [rsp + 64], 5
mov qword ptr [rsp + 72], 5
mov qword ptr [rsp + 80], 5
mov qword ptr [rsp + 88], 5
mov qword ptr [rsp + 96], 5
mov qword ptr [rsp + 104], 5
mov qword ptr [rsp + 112], 5
mov qword ptr [rsp + 120], 5
mov qword ptr [rsp + 128], 5
mov qword ptr [rsp + 136], 5
mov qword ptr [rsp + 144], 5
mov qword ptr [rsp + 152], 5
mov qword ptr [rsp + 160], 5
mov qword ptr [rsp + 168], 5
mov qword ptr [rsp + 176], 5
mov qword ptr [rsp + 184], 5
mov qword ptr [rsp + 192], 5
mov qword ptr [rsp + 200], 5
mov qword ptr [rsp + 208], 5
mov qword ptr [rsp + 216], 5
mov qword ptr [rsp + 224], 5
mov qword ptr [rsp + 232], 5
mov qword ptr [rsp + 240], 5
mov qword ptr [rsp + 248], 5
mov qword ptr [rsp + 256], 5
mov qword ptr [rsp + 264], 5
mov qword ptr [rsp + 272], 5
mov qword ptr [rsp + 280], 5
mov qword ptr [rsp + 288], 5
mov qword ptr [rsp + 296], 5
mov qword ptr [rsp + 304], 5
mov qword ptr [rsp + 312], 5
mov qword ptr [rsp + 320], 5
mov qword ptr [rsp + 328], 5
mov qword ptr [rsp + 336], 5
mov qword ptr [rsp + 344], 5
mov qword ptr [rsp + 352], 5
mov qword ptr [rsp + 360], 5
mov qword ptr [rsp + 368], 5
mov qword ptr [rsp + 376], 5
mov qword ptr [rsp + 384], 5
mov qword ptr [rsp + 392], 5
mov qword ptr [rsp + 400], 5
mov qword ptr [rsp + 408], 5
mov qword ptr [rsp + 416], 5
lea r14, [rsp + 472]
lea rsi, [rsp + 24]
mov rbx, qword ptr [rip + memcpy@GOTPCREL]
mov edx, 400
mov rdi, r14
call rbx
lea rdi, [rsp + 24]
mov edx, 400
mov rsi, r14
call rbx
lea rax, [rsp + 104]
mov qword ptr [rsp + 8], rax
mov rax, qword ptr [rip + core::fmt::num::imp::<impl core::fmt::Display for usize>::fmt@GOTPCREL]
mov qword ptr [rsp + 16], rax
lea rax, [rip + .L__unnamed_1]
mov qword ptr [rsp + 424], rax
mov qword ptr [rsp + 432], 2
mov qword ptr [rsp + 440], 0
lea rax, [rsp + 8]
mov qword ptr [rsp + 456], rax
mov qword ptr [rsp + 464], 1
lea rdi, [rsp + 424]
call qword ptr [rip + std::io::stdio::_print@GOTPCREL]
add rsp, 872
pop rbx
pop r14
ret The initialization is there along with 2 400 byte memcpys. |
Looks like nightly (probably #82806) removes the memcpy from the original example (https://godbolt.org/z/nGfxaj), but not from the smallvec example (https://godbolt.org/z/Y34K36) cc @nikic |
The memcpy was removed from the original example in 1.52: https://godbolt.org/z/5fnGqG1KG. The memcpy was removed from the smallvec example in 1.70: https://godbolt.org/z/6oGcfMaor. |
See the assembly for https://is.gd/0Owhgw . Note that you'll need to explicitly specify Release, because I can't link to Release code for some reason.
Note this bit here:
That looks like a very unnecessary and large copy.
I ran into this working on rust-selectors for stylo. In the parser, we have a very large stack-allocated SmallVec, under the assumption that it is never moved. I added some inline functions that take the buffer as |self|, and parsing got several percent slower. I rewrote the code to pass things as &mut, and the overhead went away.
Interestingly, doing:
for x in some_small_vec.into_iter() {
...
}
Doesn't have any memcpy in the disassembly, despite the fact that into_iter() takes self. Presumably some iterator optimization going on.
CC @SimonSapin @mbrubeck @Manishearth @gankro @jonathandturner
The text was updated successfully, but these errors were encountered: