Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider making Interlocked.And and friends into JIT intrinsics #32239

Closed
stephentoub opened this issue Feb 13, 2020 · 21 comments · Fixed by #46253 or #96258
Closed

Consider making Interlocked.And and friends into JIT intrinsics #32239

stephentoub opened this issue Feb 13, 2020 · 21 comments · Fixed by #46253 or #96258
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Feb 13, 2020

From #32216 (comment)

#32216 adds new methods like Interlocked.And. Some of these could be done as JIT intrinsics and replaced by better implementations/instructions in some cases, like lock and on platforms that have it.

cc: @tannergooding

category:cq
theme:jit-intrinsics
skill-level:intermediate
cost:medium
impact:medium

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 13, 2020
@BruceForstall
Copy link
Member

@AndyAyersMS

@GrabYourPitchforks
Copy link
Member

For context, it turns out the JIT already optimizes this for the existing Interlocked.Add and similar APIs.

using System;
using System.Threading;

public class C {
    public static int M(ref int i) {
        return Interlocked.Add(ref i, 1);
    }
    
    public static void N(ref int i) {
        Interlocked.Add(ref i, 1);
    }
}
; C.M(Int32 ByRef)
    L0000: mov eax, 0x1
    L0005: lock xadd [rcx], eax
    L0009: inc eax
    L000b: ret

; C.N(Int32 ByRef)
    L0000: lock add dword [rcx], 0x1
    L0004: ret

@stephentoub
Copy link
Member Author

stephentoub commented Feb 14, 2020

Nice. This can be closed then, right?

Oh.. you said Add, not And. Those are way too closely named :-) I was very impressed it was able to reverse the pattern in Interlocked.And :-)

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Feb 18, 2020
@AndyAyersMS AndyAyersMS added this to the 5.0 milestone Feb 18, 2020
@AndyAyersMS
Copy link
Member

These should be done via new-style intrinsics. We may also want to convert some of the existing intrinsified interlocked ops over to this mechanism as well.

monojenkins pushed a commit to monojenkins/mono that referenced this issue Mar 5, 2020
These APIs were introduced recently in dotnet/runtime#33042 and already used in several places e.g. in `Task`, `InterlockedBitVector32`, `SafeHandle`, `RegexCharClass`

It should emit `%reg = atomicrmw and (or)` in LLVM IR. (and unlike `Interlocked.Add` these API return old value, see dotnet/runtime#33102

Addresses dotnet/runtime#32239 for Mono.
EgorBo added a commit to mono/mono that referenced this issue Mar 10, 2020
These APIs were introduced recently in dotnet/runtime#33042 and already used in several places e.g. in `Task`, `InterlockedBitVector32`, `SafeHandle`, `RegexCharClass`

It should emit `%reg = atomicrmw and (or)` in LLVM IR. (and unlike `Interlocked.Add` these API return old value, see dotnet/runtime#33102

Addresses dotnet/runtime#32239 for Mono.

Co-authored-by: EgorBo <[email protected]>
@AndyAyersMS
Copy link
Member

@Stoub we're triaging proposed 5.0 codegen work -- any thoughts on the priority of this issue?

@stephentoub
Copy link
Member Author

From my perspective, "nice to have", but I expect there's other higher priority JIT work that'll be more impactful. These APIs are brand new.

@AndyAyersMS
Copy link
Member

@GrabYourPitchforks can you make a case for addressing this in 5.0?

@GrabYourPitchforks
Copy link
Member

I agree with Steve's assessment. Optimizing these APIs might save a few cycles in a handful of places, such as below.

// Atomically clear the TASK_STATE_WAIT_COMPLETION_NOTIFICATION bit
Interlocked.And(ref m_stateFlags, ~TASK_STATE_WAIT_COMPLETION_NOTIFICATION);

// Set closed state (low order bit of the _state field).
Interlocked.Or(ref _state, StateBits.Closed);

These are hot code paths, but not so hot that a few extra cycles are going to kill us. :) If you have other more pressing matters, so be it.

@AndyAyersMS
Copy link
Member

Ok, I'm going to move this to future.

@AndyAyersMS AndyAyersMS modified the milestones: 5.0, Future May 1, 2020
@JulieLeeMSFT
Copy link
Member

@echesakovMSFT Possible candidate for .NET 6.

@EgorBo
Copy link
Member

EgorBo commented Oct 15, 2020

this is a bit tricky for the cases when return value of And/Or is used 🙂

@GrabYourPitchforks
Copy link
Member

Wouldn't it be able to use the same logic Interlocked.Add and friends use? See my earlier comment at #32239 (comment).

@EgorBo
Copy link
Member

EgorBo commented Oct 15, 2020

@GrabYourPitchforks here is what LLVM emits for us (mono-llvm): https://godbolt.org/z/vbsYz3

Quoting LLVM:

      // There are two different ways of expanding RMW instructions:
      // - into a load if it is idempotent
      // - into a Cmpxchg/LL-SC loop otherwise

Or/And aren't idempotent, Add is.

@EgorBo
Copy link
Member

EgorBo commented Oct 15, 2020

better demo: https://godbolt.org/z/cMj5Tx

@echesakov
Copy link
Contributor

When using Armv8.1 instructions the clang/llvm generated code seems much better than its Armv8.0 counterpart https://godbolt.org/z/Yvjcqb and there is no loops required.

Armv8.0

InterlockedAdd(int*, int):                  // @InterlockedAdd(int*, int)
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        ldaxr   w8, [x0]
        add     w8, w8, w1
        stlxr   w9, w8, [x0]
        cbnz    w9, .LBB0_1
        mov     w0, w8
        ret
InterlockedAnd(int*, int):                  // @InterlockedAnd(int*, int)
.LBB1_1:                                // =>This Inner Loop Header: Depth=1
        ldaxr   w8, [x0]
        and     w8, w8, w1
        stlxr   w9, w8, [x0]
        cbnz    w9, .LBB1_1
        mov     w0, w8
        ret
InterlockedOr(int*, int):                   // @InterlockedOr(int*, int)
.LBB2_1:                                // =>This Inner Loop Header: Depth=1
        ldaxr   w8, [x0]
        orr     w8, w8, w1
        stlxr   w9, w8, [x0]
        cbnz    w9, .LBB2_1
        mov     w0, w8
        ret

Armv8.1

InterlockedAdd(int*, int):                  // @InterlockedAdd(int*, int)
        ldaddal w1, w8, [x0]
        add     w0, w8, w1
        ret
InterlockedAnd(int*, int):                  // @InterlockedAnd(int*, int)
        mvn     w8, w1
        ldclral w8, w8, [x0]
        and     w0, w8, w1
        ret
InterlockedOr(int*, int):                   // @InterlockedOr(int*, int)
        ldsetal w1, w8, [x0]
        orr     w0, w8, w1
        ret

@EgorBo
Copy link
Member

EgorBo commented Dec 12, 2020

@echesakovMSFT Do you mind if I try to implement Or and And for arm ? (with the atomics)
In case if you haven't started it yet of course. I just want to do something for Arm back-end 🙂

@echesakov
Copy link
Contributor

@EgorBo I haven't started working on this. If you want to work on this, feel free to do so and un-assign me.

@EgorBo
Copy link
Member

EgorBo commented Feb 10, 2021

Reopening since #46253 only handled them for arm64

@EgorBo EgorBo removed the preview label Apr 9, 2021
@EgorBo EgorBo modified the milestones: 6.0.0, Future Apr 9, 2021
@EgorBo
Copy link
Member

EgorBo commented Apr 9, 2021

Moving to Future, it's not easy to optimize it for x86 unfortunately because we only can optimize cases where return value of these methods won't be used and we can't check that fact at the importer phase easily.

@tannergooding
Copy link
Member

tannergooding commented Apr 9, 2021

Moving to Future, it's not easy to optimize it for x86 unfortunately because we only can optimize cases where return value of these methods won't be used and we can't check that fact at the importer phase easily.

This is actually a decently broad issue with intrinsics. There are many cases where we want to optimistically create an intrinsic tree node but where we won't know if it can actually be an intrinsic or call until much later.
With hardware intrinsics, this leads to codegen issues for things that are detected to be constants after importation, for example.

Some of the older intrinsics (GT_INTRINSIC) have support in rationalizer to be rewritting as user calls: RewriteIntrinsicAsUserCall
I had taken a look at enabling this more broadly for cases like GT_HWINTRINSIC so we could fix cases like this but ran into issues due to the non-primitive inputs/returns (and these were out of my depth to resolve at the time): #11062 (comment)

Given that these interlocked methods only deal with primitives, its possible that it won't hit the same problems.

@EgorBo
Copy link
Member

EgorBo commented Apr 9, 2021

Some of the older intrinsics (GT_INTRINSIC) have support in rationalizer to be rewritting as user calls: RewriteIntrinsicAsUserCall

@tannergooding the problem that if we re-write GT_XAND to a call like that - we'll regress the current case when we inline it as complex expression (contains a loop) as is. see https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgBUALWbAEwEsAdgHMA3DRrEAzE1IMAwgwDeNBqoYq1G1VKYoGbGLgwAKWADMGgjA2zk0lgdeykAlFqXu1DAPTeGAQQEeS1wHABtBGGCOGFhPNQBJR1iwiDAAayiWQJ5TGAtbe2cXMWovAF8acqA==

Perhaps, it's not a big deal ("call" overhead for the case where return value is used and super fast implementation for cases where it's not -- e.g. all BCL usages (there are just a few of them) don't care about return value.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 21, 2023
@stephentoub stephentoub modified the milestones: Future, 9.0.0 Dec 26, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jan 5, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
9 participants