-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider making Interlocked.And and friends into JIT intrinsics #32239
Comments
For context, it turns out the JIT already optimizes this for the existing using System;
using System.Threading;
public class C {
public static int M(ref int i) {
return Interlocked.Add(ref i, 1);
}
public static void N(ref int i) {
Interlocked.Add(ref i, 1);
}
} ; C.M(Int32 ByRef)
L0000: mov eax, 0x1
L0005: lock xadd [rcx], eax
L0009: inc eax
L000b: ret
; C.N(Int32 ByRef)
L0000: lock add dword [rcx], 0x1
L0004: ret |
Oh.. you said Add, not And. Those are way too closely named :-) I was very impressed it was able to reverse the pattern in Interlocked.And :-) |
These should be done via new-style intrinsics. We may also want to convert some of the existing intrinsified interlocked ops over to this mechanism as well. |
These APIs were introduced recently in dotnet/runtime#33042 and already used in several places e.g. in `Task`, `InterlockedBitVector32`, `SafeHandle`, `RegexCharClass` It should emit `%reg = atomicrmw and (or)` in LLVM IR. (and unlike `Interlocked.Add` these API return old value, see dotnet/runtime#33102 Addresses dotnet/runtime#32239 for Mono.
These APIs were introduced recently in dotnet/runtime#33042 and already used in several places e.g. in `Task`, `InterlockedBitVector32`, `SafeHandle`, `RegexCharClass` It should emit `%reg = atomicrmw and (or)` in LLVM IR. (and unlike `Interlocked.Add` these API return old value, see dotnet/runtime#33102 Addresses dotnet/runtime#32239 for Mono. Co-authored-by: EgorBo <[email protected]>
@Stoub we're triaging proposed 5.0 codegen work -- any thoughts on the priority of this issue? |
From my perspective, "nice to have", but I expect there's other higher priority JIT work that'll be more impactful. These APIs are brand new. |
@GrabYourPitchforks can you make a case for addressing this in 5.0? |
I agree with Steve's assessment. Optimizing these APIs might save a few cycles in a handful of places, such as below. runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs Lines 739 to 740 in ef2ea8a
runtime/src/libraries/System.Private.CoreLib/src/System/Runtime/InteropServices/SafeHandle.cs Lines 104 to 105 in ef2ea8a
These are hot code paths, but not so hot that a few extra cycles are going to kill us. :) If you have other more pressing matters, so be it. |
Ok, I'm going to move this to future. |
@echesakovMSFT Possible candidate for .NET 6. |
this is a bit tricky for the cases when return value of And/Or is used 🙂 |
Wouldn't it be able to use the same logic |
@GrabYourPitchforks here is what LLVM emits for us (mono-llvm): https://godbolt.org/z/vbsYz3
|
better demo: https://godbolt.org/z/cMj5Tx |
When using Armv8.1 instructions the clang/llvm generated code seems much better than its Armv8.0 counterpart https://godbolt.org/z/Yvjcqb and there is no loops required. Armv8.0 InterlockedAdd(int*, int): // @InterlockedAdd(int*, int)
.LBB0_1: // =>This Inner Loop Header: Depth=1
ldaxr w8, [x0]
add w8, w8, w1
stlxr w9, w8, [x0]
cbnz w9, .LBB0_1
mov w0, w8
ret
InterlockedAnd(int*, int): // @InterlockedAnd(int*, int)
.LBB1_1: // =>This Inner Loop Header: Depth=1
ldaxr w8, [x0]
and w8, w8, w1
stlxr w9, w8, [x0]
cbnz w9, .LBB1_1
mov w0, w8
ret
InterlockedOr(int*, int): // @InterlockedOr(int*, int)
.LBB2_1: // =>This Inner Loop Header: Depth=1
ldaxr w8, [x0]
orr w8, w8, w1
stlxr w9, w8, [x0]
cbnz w9, .LBB2_1
mov w0, w8
ret Armv8.1 InterlockedAdd(int*, int): // @InterlockedAdd(int*, int)
ldaddal w1, w8, [x0]
add w0, w8, w1
ret
InterlockedAnd(int*, int): // @InterlockedAnd(int*, int)
mvn w8, w1
ldclral w8, w8, [x0]
and w0, w8, w1
ret
InterlockedOr(int*, int): // @InterlockedOr(int*, int)
ldsetal w1, w8, [x0]
orr w0, w8, w1
ret |
@echesakovMSFT Do you mind if I try to implement |
@EgorBo I haven't started working on this. If you want to work on this, feel free to do so and un-assign me. |
Reopening since #46253 only handled them for arm64 |
Moving to Future, it's not easy to optimize it for x86 unfortunately because we only can optimize cases where return value of these methods won't be used and we can't check that fact at the importer phase easily. |
This is actually a decently broad issue with intrinsics. There are many cases where we want to optimistically create an intrinsic tree node but where we won't know if it can actually be an intrinsic or call until much later. Some of the older intrinsics ( Given that these interlocked methods only deal with primitives, its possible that it won't hit the same problems. |
@tannergooding the problem that if we re-write Perhaps, it's not a big deal ("call" overhead for the case where return value is used and super fast implementation for cases where it's not -- e.g. all BCL usages (there are just a few of them) don't care about return value. |
From #32216 (comment)
#32216 adds new methods like Interlocked.And. Some of these could be done as JIT intrinsics and replaced by better implementations/instructions in some cases, like
lock and
on platforms that have it.cc: @tannergooding
category:cq
theme:jit-intrinsics
skill-level:intermediate
cost:medium
impact:medium
The text was updated successfully, but these errors were encountered: