Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Very inefficient code generated for async functions setup (and likely for generators in general) #99504

Open
schreter opened this issue Jul 20, 2022 · 18 comments
Labels
A-mir-opt Area: MIR optimizations I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@schreter
Copy link

What we see in our project using also larger Futures is a lot of unnecessary memory copying. These memcpy() calls are the hottest function in the profile (and I mean, in some cases, very dominant, like 90%, even with optimization level 3). I searched for similar issues, but found none, so here we go:

What happens:

  • Calling an async function in fact calls a stub producing a Future (which is in fact a generator, which is later polled). This Future is stored in the callers's async function "stack" (i.e., it's Future), so the caller's Future is the aggregate of the parent function state and the called function's Future (with some layout optimizations if calling multiple async function in independent blocks).
  • Unfortunately, instead of telling the Future generator to materialize the Future directly into the proper place in the caller's state, the Future is first materialized on the regular stack and then copied from the stack to the caller's Future.
  • Now, we have a loop calling a request handler which supports various request types (dispatched by a match statement) where one or more of them produce a large Future. Then, the call to the request handler materializes a Future by writing a few words to the stack and then this (practically empty) space is copied in full from the stack to the caller's Future (i.e., including the uninitialized space - it's simply a binary move).

This wastes cycles like there is no tomorrow - instead of materializing the callee's Future on the stack, async functions should materialize the callee's Future directly in-place in the caller's Future. That would save the copying (and, especially, copying of still uninitialized space!).

A minimal example in compiler explorer is here: https://godbolt.org/z/b45MTex3e. You can see that callee().await first materializes the Future on stack and then it's copied into proper place.

Generated instructions (Intel):

        sub     rsp, 520    # in function prelude, reserve space for temporaries
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rbx, rsp    # the address of a temporary for callee's Future
        mov     rdi, rbx    # set as arg0 of the stub generating the Future
        call    qword ptr [rip + example::callee@GOTPCREL]
        mov     edx, 516    # size of the callee's Future (including any uninitialized stuff)
        mov     rdi, r15    # position of the callee's Future inside of caller's Future
        mov     rsi, rbx    # temporary variable with callee's Future
        call    qword ptr [rip + memcpy@GOTPCREL]

What I'd expect to see:

        (no space reservation for the temporary of callee's Future)
...
        mov     r15, rdi    # in function prelude, arg0 (&mut Future) of the caller
...
        mov     rdi, r15    # set as arg0 of the stub generating the Future to the proper position
        call    qword ptr [rip + example::callee@GOTPCREL]
        (no memcpy, since the callee's Future is already materialized in the right place)

This might be related to #97540, I also posted it there first.

Interestingly, the same problem does NOT happen when calling a function producing a large result and storing it in a variable in the async closure, subsequently using that variable later. In that case, the function producing a large result produces the value directly in future's state. This is also true when storing the large generated future in a variable, pinning it explicitly and awaiting it (as demonstrated via https://godbolt.org/z/dWzoqEjh1).

We found some temporary workarounds for the problem, boxing some especially large futures and/or the abovementioned workaround. This helps improve the performance somewhat, but memory allocation is also not particularly cheap. Further, hunting down these issues requires a lot of analysis time (since it's also not possible to set a warning level to warn about large futures). Therefore, these are not practicable.

Real solution to the problem, removing unnecessary memcpy, would be very desirable, since that would help performance in async code in general. It looks like some move optimization is missing.

BTW, I tried to post this directly as a bug, but the "Submit new issue" was grayed out. Therefore, I'm submitting it as a blank issue.

@nikic
Copy link
Contributor

nikic commented Jul 20, 2022

From a cursory look, I believe this requires a MIR optimization, LLVM doesn't have enough information for this, and I don't think it can be provided either (heck, %self isn't even noalias).

@schreter
Copy link
Author

@nikic Well, I'm no expert on LLVM or MIR, my knowledge is pretty superficial. So I can't judge.

But, the workaround shown in https://godbolt.org/z/dWzoqEjh1 is basically manually implementing what the Rust compiler is/should be doing and it works w/o the temporary on the stack and thus w/o memcpy(). So I hope the fix to the problem will be trivial and the performance improvement across the board quite high.

BTW, would you care adding appropriate labels to the issue (bug, performance, ...)? I seem not to be able to add anything. Thanks.

@nikic nikic added I-slow Issue: Problems and improvements with respect to performance of generated code. A-mir-opt Area: MIR optimizations labels Jul 20, 2022
@schreter
Copy link
Author

I did some more digging with MIR. What I see is the following (slow with memcpy: https://godbolt.org/z/eKGqjhWWP, fast with Pin workaround: https://godbolt.org/z/G4xfs3hxz):

With "regular" call and await, the code generated is like this:

    bb1: {
        _17 = move _2;                   // scope 0 at /app/example.rs:4:21: 9:2
        _4 = callee() -> [return: bb2, unwind: bb11]; // scope 0 at /app/example.rs:8:5: 8:13
    }

    bb2: {
        _3 = <impl Future<Output = ()> as IntoFuture>::into_future(move _4) -> [return: bb3, unwind: bb11]; // scope 0 at /app/example.rs:8:13: 8:19
    }

    bb3: {
        (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).0: impl std::future::Future<Output = ()>) = move _3; // scope 0 at /app/example.rs:8:13: 8:19
        goto -> bb4;                     // scope 1 at /app/example.rs:8:13: 8:19
    }

    bb4: {
        _8 = &mut (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).0: impl std::future::Future<Output = ()>); // scope 2 at /app/example.rs:8:13: 8:19
        _7 = &mut (*_8);                 // scope 2 at /app/example.rs:8:13: 8:19
        _6 = Pin::<&mut impl Future<Output = ()>>::new_unchecked(move _7) -> [return: bb5, unwind: bb11]; // scope 2 at /app/example.rs:8:13: 8:19
    }

    bb5: {
        _11 = _17;                       // scope 2 at /app/example.rs:8:13: 8:19
        _10 = get_context(move _11) -> [return: bb6, unwind: bb11]; // scope 2 at /app/example.rs:8:5: 8:19
    }

    bb6: {
        _9 = &mut (*_10);                // scope 2 at /app/example.rs:8:5: 8:19
        _5 = <impl Future<Output = ()> as Future>::poll(move _6, move _9) -> [return: bb7, unwind: bb11]; // scope 2 at /app/example.rs:8:13: 8:19
    }

In bb1, the generated future is stored in _4, bb2 moves it in call to into_future (which is identity and returns self, not marked #[inline]) into _3, then bb3 moves _3 into _1.0, which seems to be the place in the generator's state. Afterwards, bb4 pins this state via new_unchecked, bb5 acquires the context and bb6 calls poll on the future.

So one of the two moves in bb2 (inside of the into_future) or the one in bb3 seems to be not optimized out.

The workaround I made with pinning looks a bit different:

    bb1: {
        _19 = move _2;                   // scope 0 at /app/example.rs:4:21: 9:2
        (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).0: impl std::future::Future<Output = ()>) = callee() -> [return: bb2, unwind: bb12]; // scope 0 at /app/example.rs:6:19: 6:27
    }

    bb2: {
        _4 = &mut (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).0: impl std::future::Future<Output = ()>); // scope 3 at /app/example.rs:7:53: 7:61
        _3 = Pin::<&mut impl Future<Output = ()>>::new_unchecked(move _4) -> [return: bb3, unwind: bb12]; // scope 3 at /app/example.rs:7:24: 7:62
    }

    bb3: {
        _6 = move _3;                    // scope 2 at /app/example.rs:8:5: 8:8
        _5 = <Pin<&mut impl Future<Output = ()>> as IntoFuture>::into_future(move _6) -> [return: bb4, unwind: bb12]; // scope 2 at /app/example.rs:8:8: 8:14
    }

    bb4: {
        (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).1: std::pin::Pin<&mut impl std::future::Future<Output = ()>>) = move _5; // scope 2 at /app/example.rs:8:8: 8:14
        goto -> bb5;                     // scope 4 at /app/example.rs:8:8: 8:14
    }

    bb5: {
        _10 = &mut (((*(_1.0: &mut [static generator@/app/example.rs:4:21: 9:2])) as variant#3).1: std::pin::Pin<&mut impl std::future::Future<Output = ()>>); // scope 5 at /app/example.rs:8:8: 8:14
        _9 = &mut (*_10);                // scope 5 at /app/example.rs:8:8: 8:14
        _8 = Pin::<&mut Pin<&mut impl Future<Output = ()>>>::new_unchecked(move _9) -> [return: bb6, unwind: bb12]; // scope 5 at /app/example.rs:8:8: 8:14
    }

    bb6: {
        _13 = _19;                       // scope 5 at /app/example.rs:8:8: 8:14
        _12 = get_context(move _13) -> [return: bb7, unwind: bb12]; // scope 5 at /app/example.rs:8:5: 8:14
    }

    bb7: {
        _11 = &mut (*_12);               // scope 5 at /app/example.rs:8:5: 8:14
        _7 = <Pin<&mut impl Future<Output = ()>> as Future>::poll(move _8, move _11) -> [return: bb8, unwind: bb12]; // scope 5 at /app/example.rs:8:8: 8:14
    }

Here, the future is stored directly into the caller's state into _1.0 in bb1. Then, bb2 executes the pinning and in bb3 we see again call to into_future, which still moves the future, but this future is only a pinned reference (i.e., 8B value only). This one is then polled in bb7. Therefore, we don't see the memcpy, since it's moving a 8B value only (i.e., registers).

We see that await calling into_future effectively forces the callee's future onto the stack, so it can call into_future with it (which is an identity function returning self). If the call to into_future were to be inlined into the caller, i.e., in the "slow" example above, the bb3, then this move should be optimized by further layers as well.

So I thought, maybe it's just a question of adding #[inline] to the implementation of <Future as IntoFuture>::into_future()? Unfortunately, it seems like I can't get the std recompiled properly with my change, so I can't verify it. I tried to simulate that with a newtype for an explicit future type, but then the into_future function isn't an identity. No matter whether inlined or not, the memcpy is produced.

I'm stopping the analysis here for now. Maybe someone can take it further (e.g., look at the generated LLVM IR).

@schreter
Copy link
Author

Well, I did take yet another look. I can confirm that in the generated LLVM IR there is a call to memcpy generated (i.e., LLVM is not guilty here).

@nikic You mentioned that this requires MIR optimization. In the MIR output, there is no explicit call to memcpy, there are just calls to IntoFuture::into_future (which is an identity function for a Future) and "regular" moves. But yes, this call to into_future seems to be the cause of generating the memcpy in LLVM IR.

In my pinned future workaround, I added an explicit call to into_future() or to a simple identity function before pinning the future. See https://godbolt.org/z/9vcYdsMT8 (please ignore the hacks I needed to get it to compile w/o Tokio with nightly). In both cases, the identity function call caused a memcpy() call to be generated in the resulting code. I.e., it seems Rust is causing memcpy() for the identity function, if the destination is in the async context, although it is basically a simple move and it should be optimized out! So in that sense, I believe you are right that it has to do with MIR optimizations, which should optimize out the identity function call if the destination doesn't need to call a destructor (like when it's not yet live in the future's state).

Note that this seems to have something to do with the destination being on the caller's future's closure. I could not generate a similar problem with regular functions - there, the identity function is always optimized out, both on assignments to temporaries, to composite objects, via RVO, etc.

The question is, what can be done to get a solution to the problem? All async programs are heavily impacted by it.

@JakobDegen
Copy link
Contributor

This is fixable in principle, but it's not easy. The first thing to do would be to fix that inlining into generators is turned off, without that we have no shot at doing this at all. From a cursory look, this seems non-trivial. cc @cjgillot - do you have any idea how much work this would be, and/or is it on your radar (or anyone else who's working on inlining)?

@schreter
Copy link
Author

@JakobDegen Just an understanding question:

inlining into generators is turned off

Which inlining do you mean? I do see stuff being inlined into async functions, even multiple async functions into each other in the final compiler output (after all phases).

@JakobDegen
Copy link
Contributor

Sorry, I mean in MIR. LLVM still inlines, but since this is something that needs to be fixed in MIR, we need to do the inlining there as well

@schreter
Copy link
Author

@JakobDegen thanks for explanation. I thought it was something like that :-).

BTW, just an idea which might help short-term: When generating code for await, the compiler unconditionally generates call to IntoFuture::into_future(), independent whether the temporary already is a future or not. In 99,9% of cases, this will be a Future.

Would it be possible to determine this in the compiler in the code generation path in lower_expr_await() and for "true" Futures (i.e., a type which already implements Future) simply skip generating the call to into_future(). This check should be doable with the existing information (expression type), right?

An alternative would be to check there whether the implementation of into_future() is an identity function and if it is, optimize it out there. However, I assume that this is impossible with the existing information there.

This wouldn't be a "proper" fix, but it would alleviate the most pressing issue, which makes async programs slow. And, provided the expression type can be checked for having a Future trait, it should be quite trivial to implement.

What do you think?

@JakobDegen
Copy link
Contributor

It would be possible to not emit the IntoFuture I think, but I'm not so sure I want to go down that route - we should at least evaluate how complicated the proper fix for this is first. If it is too complicated, not emitting it on Futures seems feasible

@schreter
Copy link
Author

I'm not so sure I want to go down that route - we should at least evaluate how complicated the proper fix

Well, I fully agree :-). But based on the discussion above, it seems like the proper fix is quite far away. That's why I proposed the workaround for now, which will likely can be backported to a stable release soon.

@cjgillot
Copy link
Contributor

I hacked #99782 to try and enable MIR inlining for generators. Does this branch solve the issue you are having?

@JakobDegen
Copy link
Contributor

That on its own is probably not enough, but try that with #96451 rebased on top. Even if that doesn't solve this particular issue, it should still get rid of a bunch of copies (would love to see a usage report if you have some internal benchmarks)

@schreter
Copy link
Author

@cjgillot I tested with your changes (I hope correctly) by building the Rust complier from sources, adding it as a new stage1 toolchain and then using it from cargo via cargo +stage1 as documented in https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html.

Unfortunately, it doesn't seem to help. The callee future is still materialized on the stack first and then memcpy-ed into the proper place in the caller future space.

@JakobDegen Your changes can't be applied conflict-free, neither to the current master nor on top of the above changes. I resolved relevant technical conflicts (that was not much, basically one function), but there are further logical conflicts (like not covered exhaustive matches), so I didn't investigate further. Seems like you'll have to rebase your PR on the current master and fix it. I can then try afterwards.

@JakobDegen
Copy link
Contributor

Ah, crap. Yeah, I'll try and do that tonight

@JakobDegen
Copy link
Contributor

@schreter rebased, try now

@schreter
Copy link
Author

schreter commented Aug 4, 2022

@JakobDegen (CC @workingjubilee) This time I could merge both changes cleanly, yours had just a minor conflict in some test files for generated LLVM instructions.

However, it doesn't seem to help - memcpy() is still generated.

Of course, the error can be on my side and the compiler used is not the newly-built compiler (not sure how to check that, I just followed the instructions mentioned above).

Can you please test yourself? The minimal example is this:

#[tokio::main]
pub async fn main() {
    // this call causes the Future from `callee()`
    // to be materialized on stack and memcpy-moved
    // to the future of this function
    callee().await;
}

#[inline(never)] // prevent inlining creating the generator
pub async fn callee() {
    let mut data = [0_u32; 128];
    tokio::task::yield_now().await;
    println!("addr {:p}", &mut data);
}

If you set a breakpoint at callee (the generator producing function), you can see in its caller that it is materializing the generator on the stack and then memcpy-ing it to the parent's future.

@JakobDegen
Copy link
Contributor

I can probably take a look at some point, but it might be a bit

@schreter
Copy link
Author

@JakobDegen Any news? This, together with #97540 cause quite big and measurable performance issues in our project. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-mir-opt Area: MIR optimizations I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants