-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Metal atomics backend following the AtomixCUDA package - almost verbatim. #39
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Did you also verify this against KernelAbstractions?
No concrete plans, but it shouldn't be too hard to generalize the functionality for that in Metal.jl I've also bumped the version numbers here to v1.0 to get out of the dreaded v0.x regime, so it will take until KA.jl bumps compat for this to actually be installable. |
Thanks a lot for going over this! We're waiting on atomics in ImplicitBVH.jl to make it work across the JuliaGPU stacks. I made a local copy of KernelAbstractions and bumped the using KernelAbstractions
using Atomix
using Metal
@kernel cpu=false function atomic_add_ka!(v)
i = @index(Global)
Atomix.@atomic v[i] += eltype(v)(1)
end
v = Metal.zeros(Int32, 1000)
atomic_add_ka!(get_backend(v), 128)(v, ndrange=length(v))
@assert all(Array(v) .== 1) Which gives me the following error:
When running julia> @macroexpand @kernel cpu=false function atomic_add_ka!(v)
i = @index(Global)
Atomix.@atomic v[i] += eltype(v)(1)
end
quote
function gpu_atomic_add_ka!(__ctx__, v; )
let
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:96 =#
if (KernelAbstractions.__validindex)(__ctx__)
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:97 =#
begin
#= REPL[15]:1 =#
#= REPL[15]:2 =#
i = KernelAbstractions.__index_Global_Linear(__ctx__)
#= REPL[15]:3 =#
((Atomix.Internal.Atomix).modify!((Atomix.Internal.referenceable(v))[i], +, (eltype(v))(1), UnsafeAtomics.seq_cst))[2]
end
end
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:99 =#
return nothing
end
end
begin
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:54 =#
if !($(Expr(:isdefined, :atomic_add_ka!)))
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:55 =#
begin
$(Expr(:meta, :doc))
atomic_add_ka!(dev) = begin
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:55 =#
atomic_add_ka!(dev, (KernelAbstractions.NDIteration.DynamicSize)(), (KernelAbstractions.NDIteration.DynamicSize)())
end
end
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:56 =#
atomic_add_ka!(dev, size) = begin
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:56 =#
atomic_add_ka!(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.DynamicSize)())
end
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:57 =#
atomic_add_ka!(dev, size, range) = begin
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:57 =#
atomic_add_ka!(dev, (KernelAbstractions.NDIteration.StaticSize)(size), (KernelAbstractions.NDIteration.StaticSize)(range))
end
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:58 =#
function atomic_add_ka!(dev::Dev, sz::S, range::NDRange) where {Dev, S <: KernelAbstractions.NDIteration._Size, NDRange <: KernelAbstractions.NDIteration._Size}
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:58 =#
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:59 =#
if (KernelAbstractions.isgpu)(dev)
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:60 =#
return (KernelAbstractions.construct)(dev, sz, range, gpu_atomic_add_ka!)
else
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:62 =#
if false
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:63 =#
return (KernelAbstractions.construct)(dev, sz, range, cpu_atomic_add_ka!)
else
#= /Users/anicusan/Prog/Julia/Packages/Atomix.jl-fork/prototype/KernelAbstractions.jl/src/macros.jl:65 =#
error("This kernel is unavailable for backend CPU")
end
end
end
end
end
end
And
I do not know enough about the KernelAbstractions/JuliaGPU compilation pipeline to debug this further - @vchuravy or @maleadt , would you have any pointers? |
If we make this work, I will implement |
Hmm, that doesn't bode well. Some bad LLVM IR is being generated here, causing the Metal back-end compiler to abort. Can you upload the metallib from the error message here? |
Here's a zip with the |
added an issue to track progress on this: JuliaGPU/Metal.jl#477 |
Are you using a non-official version of Julia? If so, it's strongly recommended to use |
Hi, no, I am using the official Julia distribution installed via Juliaup - see my |
It looks like the IR is actually corrupt -- the producer/reader mismatch is only a red herring. Probably an issue with our IR downgrader. I'll take a closer look. |
The invalid IR comes from EDIT: The unsupported IR in question: define void @kernel(ptr %ptr) {
%1 = atomicrmw add ptr %ptr, i32 0 monotonic, align 4
%2 = cmpxchg ptr %ptr, i32 0, i32 1 monotonic monotonic
ret void
} |
That's odd - why did the tests (copied verbatim from AtomixCUDA) work then? I'm asking to 1) understand it, and 2) then write a test covering this failed case. The operations in @inline function Atomix.modify!(ref::MtlIndexableRef, op::OP, x, order) where {OP}
x = convert(eltype(ref), x)
ptr = Atomix.pointer(ref)
begin
old = if op === (+)
Metal.atomic_fetch_add_explicit(ptr, x)
elseif op === (-)
Metal.atomic_fetch_sub_explicit(ptr, x)
elseif op === (&)
Metal.atomic_fetch_and_explicit(ptr, x)
elseif op === (|)
Metal.atomic_fetch_or_explicit(ptr, x)
elseif op === xor
Metal.atomic_fetch_xor_explicit(ptr, x)
elseif op === min
Metal.atomic_fetch_min_explicit(ptr, x)
elseif op === max
Metal.atomic_fetch_max_explicit(ptr, x)
else
error("not implemented")
end
end
return old => op(old, x)
end |
Presumably because those tests didn't trigger Here's where the ; │┌ @ /Users/tim/Julia/pkg/Atomix/src/core.jl:30 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Atomix/src/references.jl:99 within `pointer` @ /Users/tim/Julia/pkg/Metal/src/device/array.jl:64
; │││┌ @ abstractarray.jl:1236 within `_memory_offset`
; ││││┌ @ int.jl:88 within `*`
%16 = shl nuw nsw i64 %14, 2
%17 = add nsw i64 %16, -4
; │││└└
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:147 within `+`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `add_ptr`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
%18 = getelementptr i8, i8 addrspace(1)* %.unpack, i64 %17
; ││└└└└
; ││ @ /Users/tim/Julia/pkg/Atomix/src/core.jl:33 within `modify!` @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/internal.jl:23 @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/internal.jl:23
; ││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:399 within `atomic_pointermodify`
; │││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:260 within `llvm_atomic_op`
; ││││┌ @ /Users/tim/.julia/packages/UnsafeAtomicsLLVM/LPqS5/src/atomics.jl:260 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
%19 = bitcast i8 addrspace(1)* %18 to i32 addrspace(1)*
%20 = atomicrmw add i32 addrspace(1)* %19, i32 1 seq_cst, align 4 Interestingly though, after fixing the downgrader your example does just work. I guess Apple has recently added some support for native LLVM atomics to Metal? Something to look into, but if you want to be sure I'd try to use the explicit AIR atomic intrinsics where possible for now. |
First, thank you again for taking the time to investigate all this. The The odd part is that KernelAbstractions kernels using atomics end up emitting LLVM IR for them, and not anything via AtomixMetal - as seen in your call stack, the IR comes from But wasn't KernelAbstractions use of Atomix supposed to use the right If it seems we now accidentally have Metal atomics in KA because of LLVM IR I'm happy, but I'm still curious about the AtomixMetal / AtomixCUDA stacks, which may still be needed for other backends in the future. |
I thought so as well; cc @vchuravy. Note that a better design would be to use LLVM atomics everywhere and do the lowering to backend-specific intrinsics (like AIR's) in GPUCompiler.jl, but that's a redesign I don't have the time for (JuliaGPU/GPUCompiler.jl#479). |
AMDGPU.jl already uses Atomix for atomics and it does not need any special handling, since we rely directly on LLVM atomics for this. |
I believe this fails because you forgot to I think both AtomixCUDA and AtomixMetal should be deprecated and converted to extensions. It might even make more sense to have those extensions live in their respective repositories instead of here. Although it may be easier for CI if they live here. |
That seems to be correct. ; │┌ @ /Users/tim/Julia/pkg/Atomix/lib/AtomixMetal/src/AtomixMetal.jl:35 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Atomix/src/references.jl:99 within `pointer` @ /Users/tim/Julia/pkg/Metal/src/device/array.jl:64
; │││┌ @ abstractarray.jl:1236 within `_memory_offset`
; ││││┌ @ int.jl:88 within `*`
%16 = shl nuw nsw i64 %14, 2
%17 = add nsw i64 %16, -4
; │││└└
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:147 within `+`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `add_ptr`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:114 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
%18 = getelementptr i8, i8 addrspace(1)* %.unpack, i64 %17
; ││└└└└
; ││ @ /Users/tim/Julia/pkg/Atomix/lib/AtomixMetal/src/AtomixMetal.jl:38 within `modify!`
; ││┌ @ /Users/tim/Julia/pkg/Metal/src/device/intrinsics/atomics.jl:84 within `atomic_fetch_add_explicit`
; │││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:344 within `macro expansion`
; ││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:182 within `_typed_llvmcall`
; │││││┌ @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/pointer.jl:182 within `macro expansion` @ /Users/tim/.julia/packages/LLVM/wMjUU/src/interop/base.jl:39
%19 = bitcast i8 addrspace(1)* %18 to i32 addrspace(1)*
%20 = call i32 @air.atomic.global.add.s.i32(i32 addrspace(1)* %19, i32 1, i32 0, i32 2, i1 true) |
I opened JuliaGPU/CUDA.jl#2549. If it works out, I think we should do the same with the Metal backend instead of a new package. |
I suggest we merge this, and then focus on converting the current subpackages to extensions (as proposed by @christiangnrd) before releasing v1.0. |
I'll have a PR for extensions within a couple hours. Unless it's already being worked on. Then I won't bother finishing it. |
Awesome! I won't be able to get to it this week, so feel free 🙂 |
Thanks for all the help in this conversation! Yes, adding using KernelAbstractions
using Atomix
using AtomixMetal
using Metal
# Have two threads concurrently increment each element
@kernel cpu=false function atomic_add_ka!(v)
i = @index(Global)
Atomix.@atomic v[(i - 1) ÷ 2 + 1] += eltype(v)(1)
end
v = Metal.zeros(Int32, 1000)
atomic_add_ka!(get_backend(v), 128)(v, ndrange=length(v) * 2)
@assert all(Array(v) .== 2) This is a bit surprising - @christiangnrd 's PR will be very useful in codebases using KernelAbstractions with atomics. Finally - does the oneAPI backend support UnsafeAtomicsLLVM directly, or do we need a similar |
oneAPI.jl and OpenCL.jl use SPIRVIntrinsics.jl, which (contrary to its name) currently rely on OpenCL-style atomics, which are explicit function calls detected by the back-end: https://github.com/JuliaGPU/OpenCL.jl/blob/master/lib/intrinsics/src/atomic.jl |
Ok didn't have the free time I expected to work on this and I don't know when I'll be able to get to it, so someone else should probably pick this up if they need this. |
@christiangnrd I made a pull request for this: #42 |
Tests pass on my Mac M3. Only changes to AtomixCUDA is the use of Int32 instead of Int - are there any plans to add support for 64-bit atomics, at least when natively supported on the M2 and up?